At the big Supercomputing 19 (SC19) show, Intel, Nvidia, and Cerebras discussed new architectures for faster machines.
At this year’s big supercomputing conference, SC19(Opens in a new window), the top of the list of the fastest machines in the world is unchanged. Still, several new technologies are being discussed that portend the era of exascale computing–devices theoretically capable of a billion (i.e., a quintillion) calculations per second.
As it has been since June of last year, the Summit computer at the Department of Energy’s Oak Ridge National Laboratory (ORNL) is now on top of the Top 500 list(Opens in a new window), with a sustained theoretical performance of 148.6 petaflops on the High-Performance Linpack test used to rank the Top500 list. This machine, built by IBM, has 4,608 nodes, each equipped with two 22-core IBM Power 9 CPUs and six Nvidia Tesla V100 GPUs connected by a Mellanox EDR InfiniBand network. A similar but somewhat smaller system called Sierra at the Lawrence Livermore National Laboratory comes in second at 94.6 petaflops. The Sunway TaihuLight supercomputer is in third place at China’s National Supercomputing Center in Wuxi. Sunway’s SW26010 processors power it and score 93 petaflops.
The entire top 10 on the list has been unchanged since June. The most powerful new system comes in at number 25 with a system called the Advanced Multiprocessing Optimized System (AMOS) at Rensselaer Polytechnic Institute’s Center for Computational Innovations (CCI)(Opens in a new window).
Again, this is an IBM Blue Gene/Q system with Power 9 CPUs and Nvidia Tesla V100s. According to the list, this is a smaller, five-rack system with a sustained Linpack maximum of 8 petaflops.
(As an alum, it’s great to see, and I was particularly tickled to see it named AMOS after Rensselaer’s first senior professor, Amos Eaton. That made me laugh, as I spent much time as an undergraduate waiting for the mainframe at Amos Eaton Hall. I doubt anyone ever ran LINPACK on the old IBM 360/67, but the new machine is probably millions of times faster; it has 130,000 cores compared with the single-digit number on the old mainframe.)
Looking over the whole list, China continues to rise and now has 227 of the Top 500 installations, while the US accounted for 118, near its all-time low. The top three system vendors are Lenovo, Sugon, and Inspur–all based in China–followed by Cray and HPE (HPE now owns Cray). Four hundred seventy systems use Intel CPUs, another 14 use Power processors, and three use AMD. There are now two ARM-based supercomputers on the list: the Astra system deployed at Sandia National Laboratories, equipped with Marvell’s ThunderX2 processors, and Fujitsu’s A64FX prototype system in Japan. Nvidia remains the dominant accelerator vendor, with GPUs in 136 145 accelerated systems. Ethernet is still used in over half of the systems, but the fastest use is InfiniBand or proprietary interconnects such as Cray Aries and Intel OmniPath.
Still, if there isn’t much change in the list so far, a lot of work is being done on new architectures to produce an Exascale machine within the next two years. The US has announced work on two big new supercomputers. The first is the Aurora project at the DOE’s Argonne National Laboratory, which will be built by Cray (now part of HPE) and Intel. In contrast, the second is Frontier at Oak Ridge, which will run custom AMD Epyc processors and Radeon Instinct GPUs connected over an Infinity Fabric interconnect.
Leading up to SC19, Intel announced more details of the Aurora project(Opens in a new window), saying it will use nodes that consist of two 10nm++ Sapphire Rapids Xeon processors and six of the new Ponte Vecchio GPU accelerators based on the forthcoming Xe graphics architecture, as well as the firm’s Optane DC persistent memory. Intel said Aurora will support over 10 petabytes of memory and 230 petabytes of storage and use the Cray Slingshot fabric to connect nodes over more than 200 racks. (It did not give exact numbers for total nodes or performance).
Intel gave more detail on the Ponte Vecchio processors, saying they will be built around the Xe architecture but optimized for high-performance computing and AI workloads. This version will be manufactured on 7nm technology and use Intel’s Foveros 3D and EMIB packaging to have multiple dies in the package. It also will support high-bandwidth memory and the Compute Express Link (CXL) interconnect. (Intel had expected a version of the Xe architecture in a consumer GPU sometime in 2020, presumably on Intel’s 10nm or 14nm process.)
Intel also gave more details on its one API project, libraries, and a new language variant called Data Parallel C++, designed to help developers write code that can run on CPUs, GPUs, and FPGAs.
Not to be outdone, Nvidia–whose GPUs are the most popular accelerators–announced a reference design for building servers combining ARM-based processors with Nvidia GPUs. Nvidia worked with Ampere, Fujitsu, and Marvell–all of who are working on ARM-based server processors, as well as with Cray and HPE, who have separately worked on some of the early ARM-based HPC systems with Nvidia GPU accelerators.
AMD said more companies are using its second-generation EPYC processors and Radeon Instinct accelerators, highlighting the company’s selection for the Frontier computer, which the firm said was expected to be the highest-performing supercomputer in the world when it shipped in 2021. AMD also announced several other systems that will be using its systems, including deals with Atos on its BullSequana XH2000 supercomputers for weather forecasting and research in atmospheric, ocean, and client computing; and with Cray, using its Shasta architecture in the forthcoming Archer2 and Vulcan systems in the UK. AMD talked about ROCm 3.0, a new version of the open-source software for GPU computing that the firm supports.
AMD highlighted that Microsoft Azure now offers a preview of an HPC instance based on its second-generation Epyc 7742 processor. At the same time, Nvidia announced a new Azure instance that can scale up to 800 V100 GPUs interconnected over a single Mellanox InfiniBand backend network. In roughly three hours, Nvidia said it used 64 of these instances on a pre-release version of the cluster to train BERT, a popular conversational AI model.