Accelerating deep learning (DL) training – on GPUs, TPUs, FPGAs or other accelerators – is in the early days scale-out architecture, like the server market was in the mid-2000s. DL training enables the advanced pattern recognition behind modern artificial intelligence (AI) based services. NVIDIA GPUs have been a major driver for DL development and commercialization, but IBM just made an important contribution to scale-out DL acceleration. Understanding what IBM did and how that work advances AI deployments takes some explanation.
Inference scales-out. Trained DL models can be simplified for faster processing with good enough pattern recognition to create profitable services. Inference can scale-out as small individual tasks running on multiple inexpensive servers. There is a lot of industry investment aimed at lowering the cost of delivering inference, we’ll discuss that in the future.
The immediate challenge for creating deployable inference models is that, today, training scales-up. Training requires large data sets and high numeric precision; aggressive system designs are needed to meet real-world training times and accuracies. But cloud economics are driven by scale-out.
The challenge for cloud companies deploying DL-based AI services, such as Microsoft’s Cortana, Amazon’s Alexa and Google Home, is that DL training has not scaled well. Poor off-the-shelf scaling is mostly due to the immature state of DL acceleration, forcing service providers to invest (in aggregate) hundreds of millions of dollars in research and development (R&D), engineering and deployment of proprietary scale-out systems.
NVLink Scales-Up in Increments of Eight GPUs
GPU evolution has been a key part of DL success over recent years. General purpose processors were, and still are, too slow at processing DL math with large training data sets. NVIDIA invested early in leveraging GPUs for DL acceleration, in both new GPU architectures to further accelerate DL and in DL software development tools to enable easy access to GPU acceleration.
An important part of NVIDIA’s GPU acceleration strategy is NVLink. NVLink is a scale-up high-speed direct GPU-to-GPU interconnect architecture that directly connects two to eight GPU sockets. NVLink enables GPUs to train together with minimum processor intervention. Prior to NVLink, GPUs did not have the low-latency interconnect, data flow control sophistication, or unified memory space needed to scale-up by themselves. NVIDIA implements NVLink using its SXM2 socket instead of PCIe.
NVIDIA’s DGX-1, Microsoft’s Open Compute Project (OCP) Project Olympus HGX-1 GPU chassis and Facebook’s “Big Basin” server contribution to OCP are very similar designs that each house eight NVIDIA Tesla SXM2 GPUs. The DGX-1 design includes a dual-processor x86 server node in the chassis, while the HGX-1 and Big Basin designs must be paired with separate server chassis.
Microsoft’s HGX-1 can bridge four GPU chassis by using its PCIe switch chips to connect the four NVLink domains to one to four server nodes. While all three designs are significant feats of server architecture, the HGX-1’s 32-GPU design limit presents a practical upper limit for directly connected scale-up GPU systems.
The list price for each DGX-1 is $129,000 using NVIDIA’s P100 SXM2 GPU and $149,000 using its V100 SXM2 GPU (including the built-in dual-processor x86 server node). While this price range is within reach of some high-performance computing (HPC) cluster bids, it is not a typical cloud or academic purchase.
Original Design Manufacturers (ODMs) like Quanta Cloud Technology (QCT) manufacture variants of OCP’s HGX-1 and Big Basin chassis, but do not publish pricing. NVIDIA P100 modules are priced from about $5,400 to $9,400 each. Because NVIDIA’s SXM2 GPUs account for most of the cost of both Big Basin and HGX-1, we believe that system pricing for both is in the range of $50,000 to $70,000 per chassis unit (not including matching x86 servers), in cloud-sized purchase quantities.
Facebook’s Big Basin Performance Claims
Facebook published a paper in June describing how it connected 32 Big Basin systems over its internal network to aggregate 256 GPUs and train a ResNet-50 image recognition model in under an hour with about 90% scaling efficiency and 72% accuracy.
While 90% scaling efficiency is an impressive achievement for state-of-the-art, there are several challenges with Facebook’s paper.
The eight-GPU Big Basin chassis is the largest possible scale-up NVIDIA NVLink instance. It is expensive, even if you could buy OCP gear as an enterprise buyer. Plus, Facebook’s paper does not mention which OCP server chassis design and processor model they used for their benchmarks. Which processor it used may be a moot point, because if you are not a cloud giant, it is very difficult to buy a Big Basin chassis or any of the other OCP servers that Facebook uses internally. Using different hardware, your mileage is guaranteed to vary.
Facebook also does not divulge the operating system or development tools used in the paper, because Facebook has its own internal cloud instances and development environments. No one else has access to them.
The net effect is that it is nearly impossible to replicate Facebook’s achievement if you are not Facebook.
IBM Scales-Out with Four GPUs in a System
IBM recently published a paper as a follow-up to the Facebook paper. IBM’s paper describes how to train a Resnet-50 model in under an hour at 95% scaling efficiency and 75% accuracy, using the same data sets that Facebook used for training. IBM’s paper is notable in several ways:
- Not only did IBM beat Facebook on all the metrics, but 95% efficiency is very linear scaling.
- Anyone can buy the equipment and software to replicate IBM’s work. Equipment, operating systems and development environments are called out in the paper.
- IBM used smaller scale-out units than Facebook. Assuming Facebook used their standard dual-socket compute chassis, IBM has half the ratio of GPUs to CPUs – Facebook uses a 4:1 ratio and IBM uses a 2:1 ratio.
IBM sells its OpenPOWER “Minsky” deep learning reference design as the Power Systems S822LC for HPC. IBM’s PowerAI software platform with Distributed Deep Learning (DDL) libraries includes IBM-Caffe and “topology aware communication” libraries. PowerAI DDL is specific to OpenPOWER-based systems, so it will run on similar POWER8 Minsky-based designs and upcoming POWER9 “Zaius”-based systems (Zaius was designed by Google and Rackspace), such as those shown at various events by Wistron, E4, Inventec and Zoom.
PowerAI DDL enables creating large scale-out systems out of smaller, more affordable, GPU-based scale-up servers. It optimizes communications between GPU-based servers based on network topology, the capabilities of each network link, and the latencies for each phase of a DL model.
IBM used 64 Power System S822LC systems, each with four NVIDIA Tesla P100 SXM2-connected GPUs and two POWER8 processors, for a total of 256 GPUs – matching Facebook’s paper. Even with twice as many IBM GPU-accelerated chassis required to host the same number of GPUs as in Facebook’s system, IBM achieved a higher scaling efficiency than Facebook. That is no small feat.
Commercial availability of IBM’s S822LC for low volume buyers will be a key element enabling academic and enterprise researchers to buy a few systems and test IBM’s hardware and software scaling efficiencies. The base price for an IBM S822LC for Big Data (without GPUs) is $6,400, so the total price of a S822LC for High Performance Computing should be in the $30,000 to $50,000 ballpark (including the dual-processor POWER8 server node), depending on which P100 model is installed and other options.
Half the battle is knowing that something can be done. We believe IBM’s paper and product availability will spur a lot of DL development work by other hardware and software vendors.
— The author and members of the TIRIAS Research staff do not hold equity positions in any of the companies mentioned. TIRIAS Research tracks and consults for companies throughout the electronics ecosystem from semiconductors to systems and sensors to the cloud.