Much of the focus on AI infrastructure deployments is on the compute side – how many petaFLOPS and I/O can you deploy to analyse the data, test and train the model, run the inference.
The compute side of the AI puzzle is also where all the shiny new sexy tech lives. There are also a wider set of choices now in compute models that suite AI and are being developed for the specific issues of AI – including the latest NVIDIA GPUs and DGX Systems, and Graphcore’s IPUs.
Storage is often the overlooked component of AI infrastructure. However, getting the data storage aspect right is critical to ensure the AI compute can perform as designed and deliver AI models that actual work!
Traditional Enterprise Storage
Traditional IT enterprise workloads involve a larger number of writes to commit data to storage, and usually a small number of reads or retrievals of that data. Many enterprise storage systems evolved to incorporate Hierarchical Storage Management (HSM) which allows for movement of data between tiers of data storage based on capacity and performance profile of these tiers. These HSM policies are designed to protect data (usually by keeping duplicate copies on different tiers) and also to manage “cold” data by moving it to the cheapest, high capacity tier – usually tape or object storage. These cheaper tiers are often slower as well, from seconds to minutes to retrieve files – which is acceptable if you only want a few files once in a blue moon.
To generalise across enterprises, research has shown that on average 80% of data is “cold” data, and isn’t accessed regularly. When data becomes “cold” can be identified with a number of file scanning tools, but usually it is 3-6 months after being created.
Because of this profile of cold data to hot data, most enterprise data systems are built with a small capacity tier of high performance flash or NVMe, and a medium capacity tier of disk for data 1-6 months old, and a large and ever growing cold data tier of tape or object storage for data more than 6 months old. The performance profile of enterprise storage systems is similarly skewed towards performance for writing data, with a greater proportion of writes than reads.
AI’s Unique Storage Requirements
Compared to the enterprise storage requirements for business as usual data, AI storage is different in a number of critical areas:
- Data sets are incredibly large. A simple facial recognition AI model requires 100 million images to determine the different between male and female. Further distinctions on age, ethnicity etc require even more images. And to be accurate, images need to be high resolution. This dataset could easily be 4.6PB (assuming 100 million uncompressed 8bit images*).
- Data is read, read, and read again as the model is trained. This happens randomly across the whole data set simultaneously.
- Results generated are small, and written once. Compared to the volume of data read, the write volume is very small.
- AI Compute resources are built to run at extremely high throughput, and scale linearly as more AI compute is added. Storage bandwidth requirements of 200GB/s or higher is normal.
Traditional enterprise HSM storage solutions can work for AI, but they struggle to keep bandwidth up to the compute as data is repeatedly fetched from slower tiers of storage. This latency slows down model training and overall model development. In the medium term this read pattern effects system longevity as the colder tiers of storage are not designed for the heavy, repeated read workloads.
Storage Has Evolved
There are a range of new storage options designed to address the issues of AI compute. These solutions generally all have –
- High bandwidth to saturate AI compute nodes.
- Storage architecture built on clusters that scale out and allow storage bandwidth speed to increase in a linear fashion as more clusters are added.
- RDMA (Remote Direct Media Access) to connect the AI processors directly to the data, by-passing the CPU and lowering latency and accelerating data flows to the AI processors.
- Single name space with an “unlimited” number of files and storage capacity. While nothing is unlimited, typically the top is measured in billions or trillions of files and Exabytes (1,000’s of PBs) of capacity.
Range of Options
Storage designed for AI workloads will ensure investments in AI compute are fully realised. XENON builds solutions with a number of leading providers who specialise in AI focused storage solutions, including:
- DDN and their new A3I range. DDN storage is the reference architecture behind NVIDIA DGX SuperPOD and used by NVIDIA in-house.
- VAST Data – the new all flash storage architecture combines QLC flash and Intel XPoint Optane to deliver blindingly fast speeds which scale linearly.
- Weka – software defined storage solutions provides a range of hardware options as well as a fast, single name space.
All three solutions are endorsed by NVIDIA and also Graphcore and are excellent choices depending on your specific AI requirements.
Contact the XENON team today, and we will work with you to ensure that your storage is as capable as your AI compute!
Talk to a Solutions ArchitectFootnote
* See this post in Enterprise Storage Forum for more details on these calculations. Image based AI is a leading area of AI development – especially in medical diagnostics examining scans and images from Xrays, CT scans, MRIs. In these areas, datasets quickly grow past 100m images with profound examples of life saving results in cancer diagnosis and life creating results in IVF.