Metadata
Title
Developing quantization strategies for energy-efficient AI inference
Category
general
UUID
6f7f00d7beff4b2c8ff91aca7cca645c
Source URL
https://www.tue.nl/en/news-and-events/news-overview/19-03-2026-developing-quanti...
Parent URL
https://www.tue.nl/en/research
Crawl Time
2026-03-23T15:10:05+00:00
Rendered Raw Markdown

Developing quantization strategies for energy-efficient AI inference

Source: https://www.tue.nl/en/news-and-events/news-overview/19-03-2026-developing-quantization-strategies-for-energy-efficient-ai-inference Parent: https://www.tue.nl/en/research

Share

Twitter Facebook LinkedIn

Developing quantization strategies for energy-efficient AI inference

March 19, 2026

Floran de Putter defended his PhD thesis at the Department of Electrical Engineering on March 18.

Artificial intelligence has improved rapidly over the last decade, driven by better training algorithms, access to large datasets, and increasingly powerful hardware. AI models have also grown dramatically in size, making them far more energy‑intensive to run. At the same time, there is a shift from running AI in the cloud to running it directly on edge devices such as phones, smart cameras, and robots. This shift helps reduce latency, improve reliability and privacy, and ensure functionality without an internet connection. However, edge devices are limited by battery capacity, heat dissipation, and available compute resources. As a result, energy consumption is a key bottleneck for bringing AI into everyday products. In his PhD research, Floran de Putter focuses on making AI inference (running a trained model) more energy‑efficient through quantization, while minimizing accuracy loss.

Quantization reduces numerical precision by using fewer bits to represent each value—for example, using 4 bits instead of 16. This lowers storage requirements and reduces data movement, and it can also simplify computing hardware. The downside is that reducing precision may hurt accuracy, and the optimal bit‑width depends on the model and its application. Because quantization needs vary across AI models and tasks, the hardware must be both flexible and efficient. Floran de Putter introduces ‘BrainTTA’, an energy‑efficient accelerator for AI inference that supports multiple precision levels (1/2/4/8‑bit). He implemented BrainTTA in modern semiconductor technologies (22nm and 28nm) and successfully fabricated it in 28nm, demonstrating practical viability. Compared to other programmable accelerators with similar flexibility goals, BrainTTA achieves up to 3.1× higher energy efficiency/

Cover of Floran de Putter's thesis

Binary neural networks

De Putter also investigates the extreme end of quantization: binary neural networks. These networks appear to be a holy grail for efficiency because they use only 1 bit per value, but they typically suffer from large drops in accuracy. His research classifies repair methods that aim to recover accuracy and evaluates them in isolation. This is valuable because many research papers combine multiple techniques, making it unclear which ones truly contribute. Furthermore, De Putter quantifies the energy cost of these repair methods and compares repaired binary networks with higher‑precision alternatives. The results show that some repairs can significantly narrow the accuracy gap, but further improvements become increasingly expensive in terms of energy. Well‑designed 4‑bit networks often provide a more attractive balance between accuracy and energy efficiency.

Comparing quantization strategies

Finally, De Putter proposes a methodology to systematically compare a broad range of quantization strategies using detailed energy modelling. This enables fair comparisons across different models and hardware assumptions. The results consistently reveal a practical sweet spot: 4‑bit quantization combined with a small set of targeted high‑precision repairs. This mixed‑precision approach achieves up to 5.3× energy improvement compared to a 16‑bit baseline, without accuracy loss. Together, this research contributes to making AI more energy‑efficient and offers practical approaches for running accurate AI models on edge devices.

Title of PhD thesis: Neural Network Quantization from Training to Silicon - Challenges and Opportunities for Energy-Efficiency. Supervisors: Prof. Henk Corporaal and Dr. Gijs Dubbelman.

Media Contact

Linda Milder

(Communicatiemedewerker)

l.m.g.milder@tue.nl