Neuchips Demos Recommendation Accelerator for LLM Inference Activity

The Taiwanese AI accelerator maker Neuchips is now targeting LLM inference with its AI accelerator chip, originally designed for recommendation workloads, Ken Lau, the company’s new CEO, told Vivax Media. YounLong Lin, the previous CEO, is now the company’s chairman.

The company has demonstrated Llama2-7B up and running on its single-chip PCIe card at 60 tokens/second (16 users/batch=16), or on a new four-chip PCIe card at 240 tokens/second, said Lau, who previously served as general manager of Intel Taiwan.

The existing single-chip, full-height, full-length PCIe card runs Neuchips’ domain-specific AI accelerator in a TDP of 55 W alongside 33-GB LPDDR5 memory at 1.6-Tbps memory bandwidth, and there is also an existing M.2 card with a 25-W TDP and the same memory and memory bandwidth. The new four-chip card’s TDP is 300 W for the accelerators, with 256-GB LPDDR5 at a bandwidth of 6.4 Tbps. Llama2 performance scales linearly, at least as far as eight quad-chip cards (32 chips), which can generate 1,920 tokens/second.

Neuchips’ RecAccel chip has been rebranded as the N3000, but it is the same silicon and same software stack as previously used in the company’s MLPerf inference results that are being demonstrated for Llama2, the company confirmed to Vivax Media. The company will continue to target hyperscalers’ recommendation workloads as well as go after the growing LLM market.

“So far, feedback from the market [on LLMs] has been positive, with customers challenging us to provide results for different models,”

Lau toldVivax Media. “[For recommendation], we are talking to different customers, but they often have specific things they want us to do rather than taking what we’ve got off the shelf.”

While Neuchips’ chip was not designed with LLMs in mind, there are a number of similarities between recommendation and LLM workloads that make the same chip suitable for accelerating both. (Competing recommendation accelerator maker Esperanto also recently demonstrated LLMs on its platform.)

“The whole industry is transitioning to LLMs,”

Lau said.
“Because of the way our chip is architected, we found it’s quite suitable for LLMs, because of the LPDDR5 and the large pipe of PCIe Gen 5. This is what we have available today, and we don’t believe others have it.”

Both recommendation and LLMs are extremely sensitive to memory throughput. Neuchips, as well as providing high bandwidth, has a dedicated engine designed to accelerate recommendation embeddings, which can also help traffic optimization and caching for LLMs (the company has patents on its memory-traffic–balanced table-sharding method, lossless table-compression scheme and caching). The embedding engine has a direct link to multiple LPDDR5 modules.

The company’s secret sauce is around its patented flexible FP8 (FFP8) number format, which the company said allows 8-bit throughput with 16-bit accuracy.

Characteristics of this format include configurable exponent and mantissa widths, as well as an optional unsigned version (a sign bit can be used to add accuracy). A proprietary calibrator decides the exact quantization and number format based on model and data characteristics. (Read more about FFP8 here.)

Neuchips’ demo showed Llama2 with weights quantized to FFP8 (activations in BF16) versus Meta’s FP16-quantized version with comparable results—not identical, but similar—while an INT8-quantized version returned gibberish. The company said it is also testing bigger versions of Llama2 with different prompt lengths.

CEO Lau joined Intel in 2002 to run its data center business but also worked across Intel’s client portfolio, eventually becoming general manager of Intel Taiwan until joining Neuchips this year. The company currently has 60 employees.

Neuchips single-chip PCIe cards and M.2 cards are available now, while samples for the four-chip PCIe card are due by the end of this year.

Share And Comment Bellow On What You Think About This Post!!!

VIVAX MEDIA

Search This Blog

Why Tanzania Communications Regulatory Authority (TCRA) Engages Broadcasting Stakeholders in Quality Content

Neuchips Demos Recommendation Accelerator for LLM Inference Activity

Comments

Post a Comment