quinta-feira, 22 de junho de 2023

The Aurora Supercomputer Is Installed: 2 ExaFLOPS Tens of Thousands of CPUs and GPUs

Argonne National Laboratory and Intel said on Thursday that they had installed all 10,624 blades for the Aurora supercomputer, a machine announced back in 2015 with a particularly bumpy history. The system promises to deliver a peak theoretical compute performance over 2 FP64 ExaFLOPS using its array of tens of thousands of Xeon Max 'Sapphire Rapids' CPUs with on-package HBM2E memory as well as Data Center GPU Max 'Ponte Vecchio' compute GPUs. The system will come online later this year.

"Aurora is the first deployment of Intel's Max Series GPU, the biggest Xeon Max CPU-based system, and the largest GPU cluster in the world," said Jeff McVeigh, Intel corporate vice president and general manager of the Super Compute Group.

The Aurora supercomputer looks quite impressive, even by the numbers. The machine is powered by 21,248 general-purpose processors with over 1.1 million cores for workloads that require traditional CPU horsepower and 63,744 compute GPUs that will serve AI and HPC workloads. On the memory side of matters, Aurora has 1.36 PB of on-package HBM2E memory and 19.9 PB of DDR5 memory that is used by the CPUs as well as 8.16 PB of HBM2E carried by the Ponte Vecchi compute GPUs. 

The Aurora machine uses 166 racks that house 66 blades each. It spans eight rows and occupies a space equivalent to two basketball courts. Meanwhile, that does not count the storage subsystem of Aurora, which employs 1,024 all-flash storage nodes offering 220TB of storage capacity and a total bandwidth of 31 TB/s. For now, Argonne National Laboratory does not publish official power consumption numbers for Aurora or its storage subsystem.

The supercomputer, which will be used for a wide variety of workloads from nuclear fusion simulations to whether prediction and from aerodynamics to medical research, uses HPE's Shasta supercomputer architecture with Slingshot interconnects. Meanwhile, before the system passes ANL's acceptance tests, it will be used for large-scale scientific generative AI models.

"While we work toward acceptance testing, we are going to be using Aurora to train some large-scale open-source generative AI models for science," said Rick Stevens, Argonne National Laboratory associate laboratory director. "Aurora, with over 60,000 Intel Max GPUs, a very fast I/O system, and an all-solid-state mass storage system, is the perfect environment to train these models."

Even though Aurora blades have been installed, the supercomputer still has to undergo and pass a series of acceptance tests, a common procedure for supercomputers. Once it successfully clears these and comes online later in the year, it is projected to attain a theoretical performance exceeding 2 ExaFLOPS (two billion billion floating point operations per second). With vast performance, it is expected to secure the top position in the Top500 list.

The installation of the Aurora supercomputer marks several milestones: it is the industry's first supercomputer with performance higher than 2 ExaFLOPS and the first Intel'-based ExaFLOPS-class machine. Finally, it marks the conclusion of the Aurora saga that began eight years ago as the supercomputer's journey has seen its fair share of bumps.

Originally unveiled in 2015, Aurora was initially intended to be powered by Intel's Xeon Phi co-processors and was projected to deliver approximately 180 PetaFLOPS in 2018. However, Intel decided to abandon the Xeon Phi in favor of compute GPUs, resulting in the need to renegotiate the agreement with Argonne National Laboratory to provide an ExaFLOPS system by 2021.

The delivery of the system was further delayed due to complications with compute tile of Ponte Vecchio due to the delay of Intel's 7 nm (now known as Intel 4) production node and the necessity to redesign the tile for TSMC's N5 (5 nm-class) process technology. Intel finally introduced its Data Center GPU Max products late last year and has now shipped over 60,000 of these compute GPUs to ANL.



from AnandTech https://ift.tt/QBfmSHi
via IFTTT