Nvidia’s new game-changing A100 data center GPU is getting broad support from Dell Technologies, Cisco Systems and several other OEMs with more than 50 A100-based servers coming out this year.
The new A100 servers, unveiled Monday during the virtual ISC 2020 high-performance computing conference, are made possible in part by a new A100 PCIe 4.0 card that fits in existing server motherboards, eliminating the need for Nvidia’s HGX A100 server board that supports the original SXM form factor while offering slightly lower performance and fewer scalability options.
“Between these two products, we now have the ability to do mainstream server acceleration as well as scale-up server acceleration from our Ampere architecture-based GPUs,” Paresh Kharya, Nvidia’s director of product management for accelerated computing, told CRN.
The chipmaker unveiled the A100 at its digital GPU Technology Conference in May, saying that the new chip will “revolutionize” artificial intelligence by unifying training and inference into one architecture that can outperform its V100 and T4 several times over. Unlike any other GPU on the market, the A100 can be partitioned into as many as seven separate GPU instances for parallel computing jobs.
Servers supporting Nvidia’s A100 include Atos’ BullSequana X2415, Cisco’s Unified Computing System and HyperFlex servers, Dell’s PowerEdge servers, Hewlett Packard Enterprise’s ProLiant DL380 Gen10 and Apollo 6500 Gen10 systems and Lenovo’s ThinkSystem SR670. Other server vendors supporting the A100 include Asus, Fujitsu, Gigabyte, Inspur, One Stop Systems, Quanta and Supermicro.
Nvidia said 30 of the servers are expected to start shipping this summer while more than 20 additional systems will arrive by the end of the year. The company added that server vendors can receive certifications to highlight systems that are optimized to run Nvidia’s GPU-accelerated NGC software.
Kharya said the main tradeoffs between the A100 PCIe card and A100 SXM chip are performance and scalability. While the A100 SXM chip has a thermal design power of 400 watts, the A100 PCIe card has a TDP of 250 watts and can only achieve about 90 percent of the performance of the former.
“It's designed to run at a lower TDP, so while the peak performance is the same, the sustained performance is going to be different,” he said.
The A100 SXM chip are also better suited for scale-up deployments, supporting up to four, eight or even 16 A100 GPUs that can be interconnected with Nvidia’s NVLink and NVSwitch interconnect technology, which provides nearly 10 times the bandwidth of PCIe 4.0, according to Kharya. In comparison, only up to two A100 PCIe cards can be connected with NVLink, which means that any additional GPUs will have to rely on lower bandwidth PCIe connectivity to communicate.
“A100 PCIe provides great performance for applications that scale to one or two GPU at a time, including AI inference and some HPC applications,” he said. “The A100 SXM configuration, on the other hand, provides the highest application performance with 400 watts of TDP. This configuration is ideal for customers with applications scaling to multiple GPUs in a server as well as across servers.”
But while the A100 PCIe card has lower performance and scalability, it will likely have broader appeal among channel partners because of how it can fit in a wide range of existing server motherboards. The A100 SXM chip, on the other hand, requires Nvidia’s HGX server board, which was custom-designed to support maximum scalability and serves as the basis for the chipmaker’s flagship DGX A100 system.
Mike Trojecki, vice president of digital solutions and services at New York-based Nvidia partner Logicalis, said his company is looking at Nvidia as a key partner for driving sales for AI-based services and products this year and that the A100 will help unlock more opportunities than any previous Nvidia GPU.
“The interesting thing here with the A100 is it brings AI down to a level that makes it available to everyone now,” he told CRN last month, citing Nvidia's claim that its DGX A100 can perform the same level of training and inference work as 50 DGX-1 and 600 CPU systems at a tenth of the cost and a twentieth of the power. “It's not just this giant system that was out there where you couldn't afford it. This brings it down and makes AI consumable.”
With the A100 bringing down the costs of running AI workloads, Trojecki said the GPU will help speed up sales cycles and move more customer conversations into real deployments.
“There's a difference between a $7 million sale getting someone into an AI platform rather than a million-dollar [sale],” he said.
In addition to the new PCIe card and expanded OEM support, Nvidia announced the new Mellanox UFM Cyber-AI platform, a hardware system that is designed to minimize downtime in data centers using the company’s Mellanox InfiniBand interconnect technology. The company said the platform uses AI-powered analytics to detect security threats and operational issues, allowing data centers to reduce downtime and potential save hundreds of thousands of dollars an hour.
"The UFM Cyber-AI platform determines a data center’s unique vital signs and uses them to identify performance degradation, component failures and abnormal usage patterns,” said Gilad Shainer, senior vice president of marketing for Mellanox networking at Nvidia, in a statement. “It allows system administrators to quickly detect and respond to potential security threats and address upcoming failures, saving cost and ensuring consistent service to customers.”