What is NVIDIA’s DGX POD?
08 May 2019
NVIDIA DGX POD™ offers a proven design approach for building your GPU-accelerated AI data center with NVIDIA DGX-1, leveraging NVIDIA’s best practices and insights gained from real-world deployments.
The NVIDIA DGX POD™ is an optimised data centre rack containing up to nine DGX-1 servers, twelve storage servers, and three networking switches to support single and multi-node AI model training and inference using NVIDIA AI software.
The NVIDIA DGX POD™is also designed to be compatible with leading storage and networking technology providers. XENON offer a portfolio of NVIDIA DGX POD™ reference architecture solutions including: NetApp, IBM Spectrum, DDN and Pure Storage. All incorporate the best of NVIDIA DGX POD™ and are delivered as fully-integrated and ready-to-deploy solutions to make your data centre AI deployments simpler and faster.
NVIDIA DGX POD Reference Architecture
This reference architecture*, is based on a single 35 kW high-density rack to provide the most efficient use of costly data center floorspace and to simplify network cabling. As GPU usage grows, the average power per server and power per rack continues to increase. However, older data centers may not yet be able to support the power and cooling densities required; hence the three-zone design allowing the DGX POD components to be installed in up to three lower-power racks.
The NVIDIA DGX POD is designed to fit within a standard-height 42 RU data center rack. A taller rack can be used to include redundant networking switches, a management switch, and login servers. This reference architecture uses an additional utility rack for login and management servers, and has been sized and tested with up to six NVIDIA DGX PODs. Larger configurations of NVIDIA DGX PODs can be defined by an NVIDIA solution architect.
A primary 10 GbE (minimum) network switch is used to connect all servers in the NVIDIA DGX POD and to provide access to a data center network.
A 36-port Mellanox 100 Gbps switch is configured to provide four 100 Gbps InfiniBand connections to the nine DGX-1 servers in the rack. This provides the best possible scalability for multi-node jobs. In the event of switch failure, multi-node jobs can fall back to use the 10 GbE switch for communications. The Mellanox switch can also be configured in 100 GbE mode for organizations that prefer to use Ethernet networking. Alternately, by configuring two 100 Gbps ports per DGX-1 server, the Mellanox switch can also be used by the storage servers.
With the DGX family of servers, AI and HPC workloads are fusing into a unified architecture. For organizations that want to utilize multiple NVIDIA DGX PODs to run cluster-wide jobs, a core InfiniBand switch is configured in the utility rack in conjunction with a second 36-port Mellanox switch in NVIDIA DGX POD.
Storage architecture is important for optimized DL training performance. The NVIDIA DGX POD uses a hierarchical design with multiple levels of cache storage using the DGX-1 SSD and additional cache storage servers in the NVIDIA DGX POD. Long-term storage of raw data can be located on a wide variety of storage devices outside of the NVIDIA DGX POD, either on-premises or in public clouds.
The NVIDIA DGX POD baseline storage architecture consists of standard NFS on the storage servers in conjunction with the local DGX SSD cache. Additional storage performance may be obtained by using the Ceph object-based file system or other caching file system on the storage servers.
The NVIDIA DGX POD is also designed to be compatible with a number of third-party storage solutions, see the reference architectures from DDN, NetApp, IBM and Pure Storage, all offered by XENON.
The DGX POD architecture has been updated for each release of the DGX in the last few years. Learn more about the latest DGX POD.
For more information please contact XENON.
*Reference Architecture – as outlined in the whitepaper below:
NVIDIA DGX Data Center Reference Design – Easy Deployment of DGX Servers for Deep Learning