News

Tiny But Mighty: Ceva Reveals New NPUs for Tiny Machine Learning Devices

5 days ago by Duane Benson

The new neural processing unit is designed for edge AI applications that require targeted machine learning in resource-constrained settings.

At Sensors Converge 2024 in Santa Clara, California, Ceva introduced the Ceva-NeuPro-Nano family of self-sufficient neural processing units (NPUs). The NeuPro-Nano is an NPU IP set designed for third-party processor manufacturers to integrate it into their system-on-chips (SoCs) or microcontrollers.

 

All About Circuits' Dale Wilson meets with Ceva at Sensors Converge 2024.
 

Architected to operate without a separate coprocessor, the NPUs bring machine learning (ML) to applications that previously were too resource-constrained to support AI functionality. The IP's self-sufficiency opens up new application possibilities for ML by reducing design costs, power requirements, and physical footprint.

 

Ceva-NeuPro-Nano Supports TinyML

The Ceva-NeuPro-Nano architecture is designed for AI smart sensors and edge AI devices operating in the arena known as tiny machine learning (TinyML).

 

Edge AI applications that benefit from TinyML

Edge AI applications that benefit from TinyML. Image used courtesy of Ceva
 

TinyML brings machine learning and targeted artificial intelligence to IoT devices that are limited in size, processing power, and resources. While broad-based and generative AI requires massive data sets covering a wide variety of interconnected information parameters, TinyML applications use narrow-focus, small data sets. For example, a TinyML thermostat would likely need a data set that covered little more than environmental parameters (like temperature or humidity) and the user’s behavior preferences. 

Built for these small data sets, Ceva's NPUs can be programmed with key neural network functionality, such as feature extraction, control code, and DSP code. It supports the most advanced machine learning data types and operators, including native transformer computation, sparsity acceleration, and fast quantization.

 

Lightweight Deployment Requirements

As an IP core, the Ceva-NeuPro-Nano family can be deployed as an internal part of an existing SoC solution or even as a stand-alone processing unit. It is purpose-built to help edge AI and TinyML solution providers get their products to market faster at a lower per-unit cost.

 

Ceva-NeuPro-Nano family

Combining neural processing with a scalar processing unit on a single core for reduced overall resource requirements. Image courtesy of Ceva

 

The Ceva-NeuPro-Nano family supports up to 64 int8 multiply-accumulate (MACs) per cycle and 4-bit through 32-bit integer math. The family offers sparsity acceleration, the ability to reduce computation impact when a matrix contains a large number of zero or insignificant values. It can also accelerate non-linear activation types.

The edge NPU can be deployed in single-core form with less than 10 MB of Flash and less than 500 KB of code plus dynamic data memory. The model weights (parameters) can be in the range of 10 KB to 10 MB. The minimum computational resources are 10 GOPs, and the system—equipped with automatic on-the-fly energy tuning—is optimized to run at 10 mW or less.

Ceva supports the NeuPro-Nano with a complete AI SDK for use in its NeuPro Studio. It also supports open AI frameworks to reduce development time.

 

The New NPU Uses Ceva-NetSqueeze AI Compression

TinyML applications are often battery-powered and may even be hosted within a sensor. Edge AI developers often try to repurpose legal DSPs for TinyML, but doing so doesn’t deliver sufficient neural processing per watt. One of the major bottlenecks is decompressing and then recompressing or decompressing and feeding those decompressed weights into an NPU engine. This compression/decompression consumes a disproportionate amount of the system's resources. 

To address this bottleneck, Ceva developed NetSqueeze AI compression technology—employed in Ceva-NeuPro-Nano—to directly process compressed model weights without first decompressing the models. Skipping the decompression step reduces processing power and time requirements. It also cuts memory footprint by up to 80% compared to systems that require intermediate decompression. 

“With NetSqueeze, you don't have to decompress, and you don't have to fill up the memory buffers with the model weights,” said Chad Lucien, Ceva's VP and general manager of the sensors and audio business unit. “We're able to actually process the compressed model weights directly to speed things up and reduce the overall memory footprint, which delivers a huge positive impact to cost.”