High density system performance boost meets growing science needs
The National Computational Infrastructure (NCI) is a High Performance Computing (HPC) resource catering to the Australian research community. Based at the Australian National University in Canberra, NCI received a government grant in 2012 to build a supercomputer dedicated to scientific research and technology innovation. With a strong focus on meteorology and climate research the HPC system was named Raijin, after the Shinto god of thunder, lightning, and storms.
In order to cover operational expenditures such as power, cooling, and maintenance, NCI decided to partner with leading Australian scientific research bodies including CSIRO, Geoscience Australia, and the Australian Bureau of Meteorology. Several Australian Research Council Centres of Excellence, NCRIS projects, medical research institutes, universities, and industry partners also contributed to the project.
Approximately three quarters of Raijin’s total computing capability is allocated to NCI’s partners with the remainder open to individual scientific research projects. Researchers from anywhere in Australia can apply to request computing resources which are then reviewed by a scientific committee who decides how much time and resources to allocate per project.
Challenged by its Own Success
By 2016 NCI was receiving requests for up to 320 million hours of computing time or over 3 times its total capacity. As it could no longer keep up with demand, NCI decided to boost the overall computing capability of Raijin. To avoid lengthening the wait-list even further, the system boost needed to be delivered as quickly and painlessly as possible. “We needed to minimize interruption,” said Allan Williams, Associate Director, Services and Technology at NCI, “A delay in deployment would have risked significantly increasing rather than decreasing our backlog.”
Following a call for proposals, NCI selected a solution from XENON and its technology partners Lenovo, Mellanox, and Cumulus, that would be integrated with Raijin’s existing infrastructure. “We chose to work with XENON due to their prior experience in designing and delivering HPC systems,” said Dr Muhammad Atif, Manager of HPC Systems and Cloud Services at NCI. “Our system is being pushed at full capacity 24 hours a day, 7 days a week, and so we needed to work with a team of HPC experts who understood the importance of ensuring that each compute node is able to run at full capacity at all times.”
In designing the solution XENON not only considered technical aspects such as processing, networking, and memory performance, but also operational factors such as cooling, power, and redundancy. As experienced HPC specialists XENON also planned for installation logistics such as location of loading docks, elevator lifts, and even door dimensions to ensure that the new system would be delivered as quickly and as painlessly as possible.
The XENON Solution
The new Raijin boost system includes 22,792 Intel® Xeon® Broadwell processors and 144 terabytes of memory including ten 1-terabyte nodes. The additional core processing power combined with increased RAM available per node enhances Raijin’s capability of handling large amounts of data at greater speeds. “Once on line, the new system immediately acted as a pressure release valve to take up some of the load while introducing the latest processing technology”, said Mr Williams.
Further boosting overall system performance, XENON equipped Raijin with a 100 Gigabits per second EDR InfiniBand interconnect from Mellanox. This resulted in a significant decrease in overall system latency along with an increase in bandwidth. Implementing EDR InfiniBand provided a significant boost to system performance and enabled researchers to process workloads faster.
XENON designed the new system to fit within just twelve 42RU racks. The new system can deliver over three times higher compute density when compared to the existing 50 racks of computing capacity already installed at Raijin. This not only frees up valuable floor space, but also minimizes operational expenditures related to the daily running and maintenance of the supercomputer.
System Delivered Ahead of Schedule and at Minimal Interruption
The new Raijin system was up and running ahead of schedule. Service disruption was minimised as all the performance tests were successfully completed over the Christmas holiday period. “The system boost was up and running in time for acceptance testing over the holidays,” said Mr Williams. “XENON staff along with its partners worked extra long hours to make sure that everything went as planned. In fact we were back on line one day ahead of schedule.”
Through careful project planning and coordination, XENON managed an on-site team that included key technology partners and industry experts needed to complete the installation on time. “Working with supercomputers requires more than just a supplier-customer relationship, it’s all about working as a partnership towards a common goal. As a partnership with XENON and its technology partners Mellanox, Lenovo, and Cumulus, it was fantastic!” said Dr Atif.
The combination of technology, expertise, and experience in building HPC systems was key to the successful deployment of this project. “We were very impressed with their capability, expertise, and professionalism. XENON clearly delivered what was required.” said Mr Williams.
Raijin – The Fastest Computer in Oceania
The system boost has enabled NCI to nearly double the computing capacity available to Australian public researchers. The high-density solution will enable NCI to expand their HPC capabilities in the years ahead. “It’s all about science in the end. With the new system delivered by XENON, we are now able to offer the best possible solution to their research needs,” said Dr Atif.
At ISC 2017, it was announced that Raijin was ranked within the world’s Top500 as the fastest computer in Oceania. “Following the system boost we achieved HPL efficiency of over 90% which is very high, especially for a heterogeneous system across different CPU and fabric types.” said Ben Menadue, Senior HPC Systems Specialist at NCI, “XENON, as a multi-vendor HPC specialist, was aware of all the different technologies and therefore able to help us achieve this ranking.”