Near-Memory Computing Architectures for Scalable Edge AI Applications
(2025) In Series of Licentiate and Doctoral Theses- Abstract
- Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from
personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand
for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are
expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt
to diverse workloads, while data movement between processors and memory continues to dominate system cost. These
trends position memory-centric computing as a compelling alternative to conventional architectures.
Addressing these challenges begins with... (More) - Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from
personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand
for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are
expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt
to diverse workloads, while data movement between processors and memory continues to dominate system cost. These
trends position memory-centric computing as a compelling alternative to conventional architectures.
Addressing these challenges begins with rethinking on-chip memory. A dual-port six-transistor (6T) static random-access
memory (SRAM) was developed that combines high density with concurrent read and write capability, enabling reliable
low-voltage operation and significant energy savings. Targeted at resource-constrained edge platforms, this design also
achieves notable area reductions compared with conventional dual-port implementations. Building on this foundation, a
synthesizable distributed-SRAM (SD-SRAM) framework was devised to extend the scalability of on-chip memories. By
pushing the area–energy break-even point from 2–4 kb to about 14–16 kb and enabling modular partitioning with parallel
access, this framework provides a robust memory substrate well suited to data-centric processing.
Extending these memory innovations, a digital near-memory computing (NMC) architecture was integrated into RISC-
V-based microcontroller units (MCUs), performing multiply–accumulate (MAC) operations directly within cache SRAM
macros without altering the conventional SRAM architecture. The NMC engine supports configurable convolutional neural
network (CNN) dataflows and achieves nearly two orders of magnitude higher throughput while sharply reducing internal
data movement relative to processor–memory baselines. To further improve efficiency under strict energy constraints, a
family of approximate multipliers with operand-aware error control provides tunable accuracy–cost trade-offs, enabling
substantial reductions in area and power while preserving inference quality for error-tolerant workloads.
These contributions converge in a scalable and modular NMC platform capable of flexible CNN acceleration at the edge.
By tightly coupling local memory, approximate processing, and near-memory execution, this work advances the develop-
ment of compact, high-throughput, and energy-efficient AI hardware. The presented concepts and implementations also
establish a foundation for compiler–hardware co-optimization, automated NMC generation, and seamless integration into
multimodal, sensor-driven embedded systems. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/73947649-6ca2-4dd3-94fe-36ec148841cd
- author
- Nouripayam, Masoud LU
- supervisor
-
- Joachim Rodrigues LU
- Liang Liu LU
- opponent
-
- Prof. Aunet, Snorre, NTNU Norwegian University of Science and Technology, Norway.
- organization
- publishing date
- 2025
- type
- Thesis
- publication status
- published
- subject
- in
- Series of Licentiate and Doctoral Theses
- issue
- 189
- pages
- 170 pages
- publisher
- Electrical and Information Technology, Lund University
- defense location
- Lecture Hall E:1406, building E, Ole Römers väg 3, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream.
- defense date
- 2025-11-21 09:15:00
- ISSN
- 1654-790X
- ISBN
- 978-91-8104-742-4
- 978-91-8104-741-7
- language
- English
- LU publication?
- yes
- id
- 73947649-6ca2-4dd3-94fe-36ec148841cd
- date added to LUP
- 2025-10-28 18:54:48
- date last changed
- 2025-10-30 08:09:48
@phdthesis{73947649-6ca2-4dd3-94fe-36ec148841cd,
abstract = {{Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from<br/>personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand<br/>for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are<br/>expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt<br/>to diverse workloads, while data movement between processors and memory continues to dominate system cost. These<br/>trends position memory-centric computing as a compelling alternative to conventional architectures.<br/>Addressing these challenges begins with rethinking on-chip memory. A dual-port six-transistor (6T) static random-access<br/>memory (SRAM) was developed that combines high density with concurrent read and write capability, enabling reliable<br/>low-voltage operation and significant energy savings. Targeted at resource-constrained edge platforms, this design also<br/>achieves notable area reductions compared with conventional dual-port implementations. Building on this foundation, a<br/>synthesizable distributed-SRAM (SD-SRAM) framework was devised to extend the scalability of on-chip memories. By<br/>pushing the area–energy break-even point from 2–4 kb to about 14–16 kb and enabling modular partitioning with parallel<br/>access, this framework provides a robust memory substrate well suited to data-centric processing.<br/>Extending these memory innovations, a digital near-memory computing (NMC) architecture was integrated into RISC-<br/>V-based microcontroller units (MCUs), performing multiply–accumulate (MAC) operations directly within cache SRAM<br/>macros without altering the conventional SRAM architecture. The NMC engine supports configurable convolutional neural<br/>network (CNN) dataflows and achieves nearly two orders of magnitude higher throughput while sharply reducing internal<br/>data movement relative to processor–memory baselines. To further improve efficiency under strict energy constraints, a<br/>family of approximate multipliers with operand-aware error control provides tunable accuracy–cost trade-offs, enabling<br/>substantial reductions in area and power while preserving inference quality for error-tolerant workloads.<br/>These contributions converge in a scalable and modular NMC platform capable of flexible CNN acceleration at the edge.<br/>By tightly coupling local memory, approximate processing, and near-memory execution, this work advances the develop-<br/>ment of compact, high-throughput, and energy-efficient AI hardware. The presented concepts and implementations also<br/>establish a foundation for compiler–hardware co-optimization, automated NMC generation, and seamless integration into<br/>multimodal, sensor-driven embedded systems.}},
author = {{Nouripayam, Masoud}},
isbn = {{978-91-8104-742-4}},
issn = {{1654-790X}},
language = {{eng}},
number = {{189}},
publisher = {{Electrical and Information Technology, Lund University}},
school = {{Lund University}},
series = {{Series of Licentiate and Doctoral Theses}},
title = {{Near-Memory Computing Architectures for Scalable Edge AI Applications}},
year = {{2025}},
}