Updates from SuperComputing 2022

Werner Scholz XENON TEAM 2020 Dr Werner Scholz, XENON CTO and Head of Research and Development had the opportunity to attend Super Computing 2022 in Dallas earlier this month. Of particular interest to XENON customers, a range of new products and solutions were announced at SC22 that will impact High Performance Computing (HPC) and Deep Learning and Artificial Intelligence (DL/AI) workloads.

If you are planning investments or upgrades in HPC or DL/AI solutions, or if you are interested in how you can leverage these new solutions in your work, contact us today and we will be happy to provide an in-depth briefing of topics most relevant to you.

Here is our summary of the key points of general interest to XENON customers.

CPUs – More Compute, Higher Performance, Higher Efficiency

AMD has launched its fourth generation EPYC processors (Genoa), with more cores, more processing power and higher efficiency. The new EPYC CPUs support PCIe Gen5, CXL, and DDR5, promising the potential for lots of high bandwidth memory and lower latency. We’ll take a deeper dive into the AMD release in a separate blog post.

Intel 4th generation Xeon Scalable CPUs (Sapphire Rapids) will launch in early 2023.

Also on the horizon is the NVIDIA Grace CPU and Grace-Hopper Superchip. We are looking forward to that arriving in 2023.

GPUs – Wider Range and More Powerful Processing

GPUs play an increasingly important role in HPC and especially AI workloads. XENON is already providing solutions based on AMD MI200 series GPUs as well as NVIDIA A100 and H100 GPUs. AMD MI300 GPUs will become available in 2023 as well as Intel Data Center GPU Max Series accelerators. NVIDIA had this area to themselves for quite some time, but the competitors are catching up in terms of performance and power efficiency and some vendors like Grapchore and Cerebras are designing solutions specifically for AI workloads and larger model sizes.

As GPUs represent a larger investment in the data centre, organisations are looking for ways to optimise their usage. A number of interesting solutions are emerging that enable fractional GPU allocation to users, and also aggregation of GPU power into single large instances.

XENON Updates from SuperComputing 2022 GPUs

AMD MI200 series GPU, NVIDIA H100 and A100 GPU

Compute Express Link (CXL)

New platforms like AMD Genoa and Intel Sapphire Rapids are supporting CXL, which provides a cache coherent link between the CPUs and PCIe connected CXL devices like accelerators, network cards, and memory devices. As an open standard for high speed CPU to device and CPU to memory we look forward to the implementation of CXL technologies across the new platforms. With PCIe Gen5 and DDR5, we look forward to lower latency, faster processing and moving more HPC solutions towards exaflop scale.

Cloud for HPC

XENON has long seen the advantages of cloud for bursting workloads or discrete projects, with XENON Cloud leveraging the best of public cloud providers. SC22 saw some interesting solutions that allow HPC workloads to take advantage of spot compute instances, providing mechanism to save processing-in-progress if the spot instance is lost … and allowing that work in progress to be launched into another spot instance when available. Taking advantage of spot compute instances dramatically lowers the cost of HPC in the cloud, especially when architected in a design that makes the most of these instances, like XENON Cloud.

Storage

The big, persistent issue in HPC – where to put all the data!

Analytics

Many storage providers are looking to provide more analytics within their storage systems and file systems. Leveraging metadata and increased compute power in CPUs and DPUs, we’ll see 2023 as the year of analytics with solutions expected from VAST, Panasas, and DDN.

High Performance, High Bandwidth – Cloud or On-Prem

Storage is definitely an ever expanding field of solutions and competing value propositions as we saw at eResearch 2022. High performance storage with high bandwidth and RDMA access is increasingly the domain of software defined storage vendors such as WEKA and VAST. Being software defined, there is the potential to run these solutions in the public cloud and achieve the same throughput as one does with on-premise installations. Being cloud, you can have instant bandwidth – WEKA has demonstrated 2TB/s in Oracle cloud in April!

LTO Roadmap

LTO tape continues to be the dominant storage technology for large archives, and “forever” storage. LTO 9 is now shipping, with 18TB native per cartridge and up to 45TB per tape compressed. The next leap forward is LTO10, which is forecast to bring a doubling of these capacities to 36TB native and 90TB compressed.

In addition to capacity, tape is an increasingly popular storage technology for security reasons. WORM, and a variety of mechanical tape locking mechanisms provide for a very secure air-gapped protection of data.

Tape is also an area of innovation, with object storage on tape. There are a number of ways this can be implemented, including using redundant arrays of tape libraries to create object storage spread of data.

Archive Management

Managing archives is either done manually or with an HSM storage system. Unfortunately, many HSM’s come with vendor lock-in and proprietary data formats. Versity has cracked this with their open format archive and large data solution which is an ideal replacement for HPSS, DMF and SanFS users.