New Servers Drive Acceleration For Training And Inference
It‘s only been a few months since Nvidia launched its new A100 data center GPU capable of delivering accelerated performance for deep learning training and inference. But even in the early innings, the chipmaker said the GPU is already driving "meaningful" revenue, thanks to hyperscaler adoption.
The Santa Clara, Calif.-based company has made its A100 GPU available for the server market in two form factors: SXM, which requires Nvidia‘s HGX A100 compute board, and PCIe, which has made the A100 available for a much broader range of server options.
Based on Nvidia‘s 7-nanometer Ampere architecture, the company has pitched the A100 as a game-changing GPU that can deliver high and flexible performance for both scale-up and scale-out data centers, thanks in part to its multi-instance GPU feature. The A100 also comes with 40 GB of GPU HBM2 memory and can drive 1.6 TBps in memory bandwidth.
Nvidia‘s A100 SXM GPU was custom-designed to support maximum scalability, with the ability to interconnect 16 A100 GPUs using Nvidia’s NVLink and NVSwitch interconnect technology, which provides nearly 10 times the bandwidth of PCIe 4.0.
The A100 PCIe GPU, on the other hand, has lower performance and scalability, with the ability to only link two GPUs with NVLink, but its advantage is much wider server support, making it much easier to integrate in existing data center infrastructure.
“A100 PCIe provides great performance for applications that scale to one or two GPU at a time, including AI inference and some HPC applications,” Paresh Kharya, Nvidia’s senior director of product management for accelerated computing, said in June. “The A100 SXM configuration, on the other hand, provides the highest application performance with 400 watts of TDP. This configuration is ideal for customers with applications scaling to multiple GPUs in a server as well as across servers.”
When Nvidia unveiled the A100 PCIe GPU in June, the company said Dell Technologies, Cisco Systems and several other OEMs would release more than 50 A100-based servers this year.
What follows are 10 A100 servers that are available now or coming out soon, from Nvidia‘s DGX A100 AI system to Gigabyte’s G492-Z51 that can support up to 10 A100s.
The ESC4000A-E10 is a 2U single-socket GPU server from Asus that can support up to four A100 PCIe GPUs. Using AMD‘s second-generation EPYC lineup, the server’s processor can support up to 64 cores, 128 threads, eight-channel DDR4 3200 memory and 128 lanes of PCIe 4.0 connectivity. Beyond the PCIe 4.0 x16 slots for the A100, the server comes with two PCIe 4.0 x16 slots for butterfly riser cards and an additional PCIe 4.0 x8 slot for a front riser card. To provide even faster throughput between GPUs, the server supports Nvidia’s NVLink bridge. For storage, the server comes with eight 2.5-inch or 3.5-inch hot-swap storage device bays, four of which can support NVMe. On memory, the server can support up to 2 TB total. Other features include Asus Thermal Radar 2.0, which uses ambient sensors throughout the server to perform “intelligent fan-curve adjustments” for a reduced total cost of ownership. For networking, the server comes with one dual-port Gigabit LAN controller and one management LAN port.
Atos BullSequana X2415
The Bullsequana X2415 is a 1U dual-socket GPU blade server from Atos that comes with four A100 SXM GPUs connected with NVLink, thanks to its use of Nvidia‘s HGX A100 compute board. Atos said the server, which is being used in the JUWELS supercomputer (pictured) in Germany, is more than two times powerful than its previous GPU blade system, and its energy consumption is optimized thanks to a patented direct liquid cooling solution. The server comes with two AMD EPYC processors, either from the Rome lineup or the upcoming Milan lineup, and it supports 16 DDR4 Memory slots at 32 GB each. The server also comes with up to four Mellanox InfiniBand ports that are connected via a Dragonfly+ configuration. The server has an optional storage slot for an M.2 SATA SSD of up to 1.92 TB or a M.2 NVME SSD of 960 GB. In addition, it comes with two interconnect mezzanine boards.
The G492-Z51 is a 4U dual-socket GPU server from Gigabyte that can support up to 10 A100 PCIe GPUs, thanks to two PCIe switches providing five connections each. The server comes with two second-generation AMD EPYC processors, supporting up to 128 cores and PCIe 4.0 connectivity. Beyond the 10 PCIe 4.0 slots for the GPUs, the server also has three low-profile PCIe 4.0 expansion slots. In addition, there is an OCP 3.0 Gen4 mezzanine slot in the rear side. For memory, it comes with 32 DIMM slots for eight-channel DDR4. On the storage slot, the server comes with eight slots for 3.5-inch NVMe and SATA storage and eight hot-swappable bays for 3.5-inch SATA and SAS storage. For networking, the server comes with two 10-GBps BASE-T LAN ports and one management LAN port.
HPE Apollo 6500 Gen10
The HPE Apollo 6500 Gen10 is a 4U dual-socket GPU server that can support up to eight A100 PCIe GPUs with the use of the HPE XL270d eight PCIe GPU module. Using Intel‘s first- and second-generation Xeon Scalable lineup, the server’s two processors can support a total of 56 cores. With 24 DIMM slots, the server can support up to 3 TB of HPE DDR4 SmartMemory. On the storage side of things, the server comes with one HPE Smart Array S100i, one HPE Smart Array Pr08i-a or one HPE Smart Array P816i-a. For expansion slots beyond slots reserved for GPUs, the server has four PCIe 3.0 x16 slots from the XL20d module for high-speed fabrics and one PCIe 3.0 x16 slot for a full-height, half-length card. For networking, the server comes with an embedded Ethernet adapter with four ports or optional HPE FlexibleLOM and PCIe adapters for high-speed adapters.
The NF5488A5 is a 4U dual-socket GPU server that supports up to eight A100 SXM GPUs, thanks to the server‘s use of Nvidia’s HGX A100 compute board. Inspur says the new server was ranked first by MLPerf for single server performance on the ResNet50 model. Using AMD’s second-generation EPYC lineup, the server’s processors support up to a total of 128 cores and PCIe 4.0 connectivity. With 32 DIMM slots, the server supports up to 4 TB in DDR4 2,933MHz memory. On the storage side, the server comes with eight 2.5-inch bays for SAS or SATA drives or, alternatively, four bays for NVMe drives and four bays for SATA or SAS drives in addition to four M.2 NVMe slots. It also comes with an additional two M.2 SATA slots. The server also has four low-profile PCIe 4.0 x16 slots and a 10G Ethernet optical interface.
Lenovo ThinkSystem SR670
The ThinkSystem SR670 is a 2U dual-socket GPU server that can support up to four A100 PCIe GPUs. Using Intel‘s first- and second-generation Xeon Scalable processors, the two server processors support up to 28 cores each for a total of 56 cores. With 24 DIMM slots, the server supports up to 1.5 TB in TrueDD4 2,933MHz memory. For accelerator support, the server comes with up to four PCIe 3.0 x16 slots for double-wide cards like the A100 and up to eight PCIe 3.0 x8 slots for single-wide cards. For I/O expansion, the server comes with two PCIe 3.0 x16 slots and one PCIe 3.0 x4 slot. On the storage side, the server comes with up to eight 2.5-inch hot-swap rear bays for SSD or SATA drives and up to two non-hot-swap internal bays for M.2 SSDs and 6 Gbps SATA. For networking, the server comes with one port for dedicated 1 GbE system management.
Nvidia DGX A100
The DGX A100 is a purpose-built artificial intelligence system from Nvidia that comes with eight A100 SXM GPUs and a total of 320 GB of GPU memory. The GPUs are interconnected with Nvidia‘s third-generation NVLink, providing GPU-to-GPU direct bandwidth of 600 GBps. The system also comes with six Nvidia NVSwitches, which provide 4.8 TBps of bi-directional bandwidth, and nine Mellanox ConnectX-6 200-GBps InfiniBand adapters, which provide 450 GBps of bi-directional bandwidth. For networking, the system also comes with an Ethernet port for up to 200-GBps speeds. The system is outfitted with two 64-core AMD EPYC processors and 1 TB in memory. It also comes with 15 TB of NVMe Gen4 SSD storage, capable of hitting 25 GBps of peak bandwidth.
Penguin Computing Relion XE4118GT
The Relion XE4118GT is a 4U dual-socket GPU server that can support up to eight A100 PCIe GPUs. Using Intel‘s second-generation Xeon Scalable lineup, the server’s two processors support up to 28 cores each for a total of 64 cores in the server. With 24 DIMM slots, the server can support up to 3 TB in DDR4-2,933MHz and 6 TB with Intel Optane Persistent Memory. On the storage side, the server comes with a 12-GB SAS/SATA hard drive backplane, 12 33.5-inch hot-swap bays for SAS or SATA drives and 10 2.5-inch hot-swap drives for SAS or SATA drives. Beyond the eight PCIe 3.0 slots for GPUs, the server comes with an additional two PCIe 3.0 slots for low-profile cards. For networking, the server comes with an Intel I350-AM2 Ethernet controller and two GbE LAN ports.
QCT QuantaGrid D43KQ-2U
The QuantaGrid D43KQ-2U is a 2U dual-socket GPU server from Quanta Cloud Technology that supports up to eight A100 PCIe GPUs. Using AMD‘s second-generation EPYC lineup, the server’s two processors support up to 64 cores each for a total of 128 cores in one node. For memory, the server has 32 DDR4 DIMM slots that can support up to 4 TB of 3200MHz memory. For storage, the server comes with five variants that consist of different configurations of SATA, SAS and NME drive bays on the front and rear of the server while all variants come with two on-board M.2 SATA drives for operating system installation. The five variants also come with different PCIe 4.0 slot configurations, though all of them come with eight PCIe 4.0 mezzanine slots for the eight-way GPU support. For networking, the server comes with one dedicated GbE management port.
The AS-2124GQ-NART is an upcoming 2U dual-socket GPU server from SuperMicro as part of its A+ server family that comes with four A100 SXM GPUs, thanks to its use of Nvidia‘s HGX A100 compute board. The GPUs are interconnected using Nvidia’s third-generation NVLink and connected to the CPUs through PCIe 4.0 connectivity. Using AMD’s second-generation EPYC lineup, the processors support up to 64 cores each for a total of 128 cores in the server. For memory, the server has 32 DIMM slots for DDR4-3200MHz SDRAM. On the storage side, it comes with four 2.5-inch hot-swap drive bays for SATA, NVMe or SAS storage. It also has four PCIe 4.0 x16 slots and one PCIe 4.0 x8 slots. For networking, the server comes with two 10 GbE-aggregate host LAN ports and a 1 GbE dedicated IPMI management port.