Chip giant Nvidia has long dominated what's known as the "training" of neural networks, the compute-intensive task of fashioning and refashioning the neural "weights," or, "parameters," of the network until they reach optimal performance. The company has always had competition, from a variety of chip makers including giants such as Intel and Advanced Micro Devices, and startups such as Graphcore.
The latest benchmark tests of speed, however, suggest Nvidia really has no competition, if competition means parties that meaningfully challenge the best the company can do.
The MLCommons, the industry consortium that compiles multiple benchmark reports each year on AI chip performance, on Wednesday announced numbers for how different chips perform when training neural nets for a variety of tasks, including training Meta's Llama large language model to make predictions, and training the Stable Diffusion image model to produce pictures.
Nvidia swept the benchmarks, getting the top score, and also the second-best, in all nine competitions. Competitors such as AMD, Intel and Google's cloud computing division, didn't even come close.
Also:Nvidia sweeps AI benchmarks, but Intel brings meaningful competition
It was the third time in a row that Nvidia had no competition for the top scores, but even the encouraging results of past rounds by competitors failed to materialize this time around.
The Training 4.0 test, which totals nine separate tasks, records the time to tune a neural network by having its settings refined in multiple experiments. It is one half of neural network performance, the other half being so-called inference, where the finished neural network makes predictions as it receives new data. Inference is covered in separate releases from MLCommons.
Most of MLPerf's tasks are by now well-established neural nets that have been in development for years, such as 3-D U-Net, a program for studying volumetric data for things such as solid tumor detection, which was introduced by Google's DeepMind back in 2016.
However, MLCommons continues to periodically update the benchmarks with new tasks to reflect emerging workloads. This round of training was the first time the submitters competed on the time to "fine-tune" a version of Meta's Llama language model, where the AI model is retrained, after its initial training, by using a more focused training data set. Also added was a "graph neural network" task, training the neural net to traverse a set of associated data points, which can be useful for things such as drug discovery.
Also:Nvidia teases Rubin GPUs and CPUs to succeed Blackwell in 2026
In the test to fine-tune Meta's Llama 2 70B, Nvidia took just a minute and a half, with a collection of 1,024 of its "H100" GPU chips, a mainstream part currently powering AI workloads across the industry. The Nvidia chips scored the top twenty-three results, with Intel's "Guadi" AI accelerator showing up in twenty-fourth place.
Even when adjusted for the number of chips, substantial challenges fail to materialize. In eight-chip configurations, for example, which is more common than a 1,024-chip system, as far as enterprises are concerned - a configuration where Intel had promising results last summer - Intel's best - and only - submission this time around was for the Llama 70B task.
It took Intel's system, aided by two of Intel's XEON CPUs, 78 minutes to fine-tune Llama. An 8-way Nvidia H100 system, aided by two of Advanced Micro Devices' EPYC processor, assembled by open-source vendor Red Hat, took less than half the time, just over 31 minutes.
In the test to train OpenAI's GPT-3 for things such as chat, Intel was able to use just 1,024 Gaudi chips, a tenth of the number of chips Nvidia used, 11,616 H100s. But Intel's score, 67 minutes, took more than twenty times as long to train as Nvidia's leading score, 3.4 minutes. Of course, some enterprises may find that a difference of an hour to train versus three minutes is negligible considering the cost savings of using far fewer chips, and given that much of the work in training AI models can be factors other than the strict wall-clock time to train, such as the time required for data prep.
Also:Intel shows off latest 'Gaudi' AI chip, pitched towards enterprises
Other vendors had an equally hard time catching Nvidia. On the venerable test of image recognition, with the neural net known as Resnet, Advanced Micro Devices took 167 minutes to train the network using six of its "RADEON RX 7900 XTX" accelerator chips, versus just 122 minutes for a six-way Nvidia "GeForce RTX 4090" system.
Google's four submissions of its "TPU" version 5 chip, all for the GPT-3 test, achieved scores far below Nvidia's, between 12 and 114 minutes to train versus Nvidia's 3.4 minutes. Past competitors such as Graphcore have since bowed out of the race.
Also conspicuous in the results are Nvidia's dominance as a system vendor. All of its winning scores were made with systems engineered by Nvidia itself, even though a raft of system vendors participated, including Dell, Fujitsu, Hewlett Packard Enterprise, Juniper Networks, and Lenovo.
An interesting future development could be systems using Nvidia's "Grace" CPU. All of the chip results submitted, whether from Nvidia, Intel, AMD or Google, continue to use only one of two x86 CPUs, Intel's XEON or AMD's EPYC. With Nvidia aiming to sell more complete computing systems using Grace, it seems only a matter of time before the CPU gets joined with Nvidia's GPUs. That could have an interesting impact on Nvidia's already substantial lead.
An interesting first this time around for the benchmark suite was the inclusion of a measurement of the energy consumed to train neural nets. The Singapore company Firmus Technologies featured results of its cloud platform, Sustainable Metal Cloud, running tens and hundreds of Nvidia H100s, and offered the total "energy-to-train" measured in Jules. Firmus, in fact, built two of the second-place scores won by Nvidia.
To run the Llama 2 70B fine-tuning, for example, Firmus's cloud computing system took between 45 million and 46 million Jules to train the network using 512 H100 chips. That training run, took two minutes, a little longer than Nvidia's best time on its own. It required four times as much energy as an 8-chip system that took fifteen times as long to train, or, 29 minutes, demonstrating the remarkable increase in energy consumed with giant training systems.
The cost of training AI has been a hot-button issue in recent years, both in terms of cost burden to companies in dollar terms, and the environmental burden. It remains to be seen whether other submitters to the MLPerf results will join Firmus in offering energy measurement in the next round of benchmarks.