Intel’s Habana Labs submitted improved scores for its second-generation Gaudi2 training accelerator.
An 8-Gaudi2 system can train ResNet in 16.6 minutes or BERT in 15.6. This represents a slight improvement over Gaudi2 scores from July 2022, which Eitan Medina, COO at Habana Labs, told EE Times was down to optimizations in the company’s SynapseAI software stack.
These scores win in the “available” category for 8-accelerator systems, beating the closest 8x Nvidia A100 scores—28.2 and 16.8, respectively (Nvidia’s H100 is in the “preview” category as it’s not commercially available yet). Medina points out that the A100 is on the same process node as Gaudi2, but that Gaudi2 has more memory—96 GB compared to 80 GB for the A100—and also integrates networking on-chip.
Medina fancies Gaudi2’s chances against H100, given that Gaudi2’s scores in this round used BF16; he expects that using FP8 for future submissions will further boost Habana’s scores (Gaudi2 supports both FP8 formats).
“We have double the compute in FP8… this is something we’re really looking forward to enabling for our customers,”
he said. “We do expect that additional software optimization, just good old engineering, will reveal more and more things we can do, both on the host side, as well as what [the tensor processing core] does, and how the graph compiler works.”
While Gaudi2’s power envelope is slightly bigger than A100’s, Gaudi2’s on-chip RoCE reduces component count meaning customers shouldn’t notice a big difference overall when comparing power consumption at the server level, Medina said.
Intel Sapphire Rapids
Intel submitted its first set of training scores for its fourth-generation Xeon Scalable CPUs, code named Sapphire Rapids. A total of 32 Sapphire Rapids CPUs can train BERT in 47.3 minutes, or ResNet in 89.0 minutes. Two CPUs can train DLRM in under an hour.
“We proved that on a standard 2-socket Xeon scalable processor, that you can train,”
Jordan Plawner, senior director of AI products at Intel, told Vivax Media. “And within 1-16 nodes, you can train intuitively, in a reasonable amount of time.”
While CPU training won’t suit all users of AI, some data center customers will be more than happy with this, Plawner said.
“Part of the market is very happy using shared, general-purpose infrastructure to do intermittent training,”
he said. “Look at the number of minutes and the number of nodes. This either resonates with you or it doesn’t. Either you’re in this camp or you’re not.”
New to fourth-gen Xeons is AMX (advanced matrix extensions), a set of new instructions specifically for accelerating matrix multiplication in AI/ML workloads. Plawner expects between 3-6× speedup across inference and training for different model types; the MLPerf scores also reflect Sapphire Rapids’ larger size and core count.
Comparing to Intel’s previous MLPerf training submissions for third-gen Xeons (Cooper Lake) from July 2021, the only possible comparison was on the recommendation benchmark DLRM. (DLRM may not be a good indicator of AMX’s contribution, since the workload is typically more memory-bound than compute-bound, but Sapphire Rapids has more memory bandwidth and a hardware data streaming accelerator block which no doubt contributes here).
Four Sapphire Rapids CPUs can train the DLRM benchmark in 38.0 minutes, 3.3× faster than four Cooper Lake CPUs, and for 8x CPU systems, the improvement was 2.9×.
Plawner said that Intel is currently running fine-tuning/transfer learning experiments using Sapphire Rapids, a type of setup where big models trained on accelerator systems can be fine-tuned with a small amount of training in just a few minutes.
MosaicML
MosaicML showed scores in the open division (unlike the closed division, the open division allows changes to the model).
MosaicML took a popular version of BERT and trained it on a typical DGX-A100 system (8x Nvidia A100 GPUs). They added their own efficiency speedups via the company’s software library, Composer, which reduced the time to train from 21.4 to 7.9 minutes—a factor of 2.7×. This actually brings Mosaic’s A100 score close to Nvidia’s H100 score (6.4 minutes, albeit for a slightly different version of BERT).
“We’ve built a software optimization framework that makes it easy for folks to plug and play different software,”
Hanlin Tang, co-founder of MosaicML, told Vivax Media. “To get these speedups, we did a few things. We added a whole bunch of our efficiency methods, some are system-level things like kernel fusions, and better kernels including better attention kernels such as [HazyResearch’s] FlashAttention… and the third thing is tuning, which leads to better data efficiency for the model.”
Better data efficiency—training to the same accuracy with less data—means training can be completed faster. This has implications for large language models where size can be limited by access to enough training data today.
Tang said that data quality also matters—for Mosaic’s previous ResNet submissions, the company used techniques such as training on smaller images in the initial parts of training when the model is learning coarse-grained features, for example. The company intends to apply techniques like this to NLP training in the future.
“A general concept that we’re seeing more and more of is that the neural network architecture starts becoming less important over time,”
Naveen Rao, co-founder of MosaicML, told Vivax Media. “It’s really about how you select data to cause more learning. Our brains do this very well; we get naturally filtered data points that are more useful, and throw away the ones that are less useful. Not every data point has something to be learned from, and I think that’s a key concept.”
MosaicML runs customer training in its Nvidia A100-powered cloud, where optimizations can be invoked with a single command. While the optimization concepts aren’t unique to particular hardware, the implementations are; much is hardware- and system-specific, which is why the company offers a cloud service. The company’s aim is to offer training for very large models at efficient cost points.
“One of the reasons we founded the company was to have these state-of-the-art methods be accessible to many industries,”
Rao said. “The problem we now have is [AI] can do amazing things, but it’s just being used by a small number of large tech companies. That’s not really what we want to see.”
GreenWaves Technologies NE16
As well as training results, this round of MLPerf also showcased new tinyML inference scores.
In the tinyML category, European startup GreenWaves Technologies, a first time submitter, swept the board with its 10-core RISC-V GAP9 processor, featuring the NE16 AI accelerator.
Martin Croome, VP of marketing at GreenWaves, told Vivax Media that the company’s staple diet is bigger audio networks, but there are some instances where customers have many smaller networks they want to run simultaneously.
GreenWaves’ GAP9 can perform keyword spotting inference in 0.73 ms using 18.6 µJ, or 0.48 ms using 26.7 µJ. This is both faster and lower energy than nearest challenger Syntiant, but Croome stressed that GreenWaves’ product is for a different market with a different cost point.
The company had several tricks up its sleeve for smaller networks like the tiny MLPerf benchmarks. For most of the benchmarks, GreenWaves’ team was able to keep everything in the device’s large shared L1 cache between layers, minimizing data transfer and the associated energy. Almost all weights were quantized to 6-bit (the NE16 can support down to 2-bit weights).
“NE16 has proven to be very good at optimizing pointwise convolutions, and the overall architecture is good, and we’ve done a lot of work on the tools over the last five years, so it’s a combination of multiple things,”
Croome said.
GreenWaves uses a combination of custom and non-custom kernels assembled together by the company’s GAPflow toolchain, which can fuse together convolution, pooling layers, activations of different types, and more. This is particularly useful in the world of audio—GreenWaves’ target market—where neural networks are in general more diverse for computer vision.
Plumerai
European startup Plumerai’s software solution is an inference engine for any tinyML models running on Arm Cortex-M, which typically halves memory footprint and increases inference speed as much as 70% without affecting accuracy, according to the company (compared to TF Lite for Micros and CMSIS-NN). This is achieved without additional quantization.
Plumerai submitted scores using its inference engine for Arm Cortex-M33, M4, and M7 microcontrollers. Inference speed was improved 2-6% over Plumerai scores in the last round.
Compared to other results on the same Arm Cortex-M4 microcontroller (STM32L4R5ZIT6U) running at the same clock speed, Plumerai’s image classification scores were 1.3× faster than STMicro’s own result, which was in turn faster than OctoML. (STMicro’s inference engine, part of its X-Cube-AI software stack, is based on an optimized version of CMSIS-NN).
On Cortex-M33, Plumerai again beat STMicro and OctoML scores, even getting faster latency than Silicon Labs’ M33 device with on-chip accelerator (Plumerai did not submit power scores).
Other notable MLPerf Tiny entries
Syntiant submitted a second round of scores for its NDP120, which features a second-generation Syntiant in-memory compute core. Keyword spotting results improved from 1.80 to 1.48 ms (1.2×) and from 35.29 µJ to 31.5 (1.1×).
This is notably the first time Syntiant submitted benchmarks for workloads other than keyword spotting—visual wake words and image classification scores are also available for the NDP120. The company said all the tinyML benchmarks used less than a third of the on-chip resources, meaning it’s suited to applications like sensor fusion or running more than one neural network simultaneously.
STMicro improved its inference latency by up to 33% and energy scores by up to 37% compared to the last round. This was achieved by adding more optimizations in the company’s X-Cube-AI stack—users can now optimize for memory, latency, or a balance of the two.
Silicon Labs once again entered its MG24 part, a multi-protocol SoC for IoT applications, which includes Silicon Labs’ home-grown AI accelerator alongside an Arm Cortex-M33 core. The MG24’s scores improved 1.5-1.9× across the board, for both latency and energy consumption, except for anomaly detection which was similar to the last round. Silicon Labs used TensorFlow Lite for microcontrollers and CMSIS-NN for its submissions.
OctoML’s offering is a developer platform focused on code portability, performance, and tools, which is intended to allow model deployment without specialized ML expertise. The company submitted scores using two different compilation flows: The baseline used Apache TVM and CMSIS-NN, while the other scores used microTVM and OctoML’s AutoTuning optimization process.
The company’s intent was to show that autotuning with native microTVM schedules achieved similar performance to CMSIS-NN; visual wake words were within 11% on latency and power.
Qualcomm submitted in the preview category (for systems not yet commercially available) with a next-gen version of the Qualcomm Sensing Hub. The Sensing Hub is an on-chip AI accelerator block designed for always-on processing of sensor data (mainly camera data) in smartphones.
In Snapdragon mobile processors, this block offloads always-on processing from the Hexagon processor, which is used for bigger AI tasks. This latest generation of the Sensing Hub includes a second AI accelerator core alongside DSP and memory. It can perform anomaly detection in less than 0.1ms, faster than any score submitted in the “available” category.
Comments
Post a Comment
Welcome.......
What are you thinking of....!!