Near-Memory Computing Architectures for Scalable Edge AI Applications

Nouripayam, Masoud

Near-Memory Computing Architectures for Scalable Edge AI Applications

Mark

Nouripayam, Masoud ^LU (2025) In Series of Licentiate and Doctoral Theses

Abstract: Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from
personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand
for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are
expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt
to diverse workloads, while data movement between processors and memory continues to dominate system cost. These
trends position memory-centric computing as a compelling alternative to conventional architectures.
Addressing these challenges begins with... (More); Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from
personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand
for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are
expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt
to diverse workloads, while data movement between processors and memory continues to dominate system cost. These
trends position memory-centric computing as a compelling alternative to conventional architectures.
Addressing these challenges begins with rethinking on-chip memory. A dual-port six-transistor (6T) static random-access
memory (SRAM) was developed that combines high density with concurrent read and write capability, enabling reliable
low-voltage operation and significant energy savings. Targeted at resource-constrained edge platforms, this design also
achieves notable area reductions compared with conventional dual-port implementations. Building on this foundation, a
synthesizable distributed-SRAM (SD-SRAM) framework was devised to extend the scalability of on-chip memories. By
pushing the area–energy break-even point from 2–4 kb to about 14–16 kb and enabling modular partitioning with parallel
access, this framework provides a robust memory substrate well suited to data-centric processing.
Extending these memory innovations, a digital near-memory computing (NMC) architecture was integrated into RISC-
V-based microcontroller units (MCUs), performing multiply–accumulate (MAC) operations directly within cache SRAM
macros without altering the conventional SRAM architecture. The NMC engine supports configurable convolutional neural
network (CNN) dataflows and achieves nearly two orders of magnitude higher throughput while sharply reducing internal
data movement relative to processor–memory baselines. To further improve efficiency under strict energy constraints, a
family of approximate multipliers with operand-aware error control provides tunable accuracy–cost trade-offs, enabling
substantial reductions in area and power while preserving inference quality for error-tolerant workloads.
These contributions converge in a scalable and modular NMC platform capable of flexible CNN acceleration at the edge.
By tightly coupling local memory, approximate processing, and near-memory execution, this work advances the develop-
ment of compact, high-throughput, and energy-efficient AI hardware. The presented concepts and implementations also
establish a foundation for compiler–hardware co-optimization, automated NMC generation, and seamless integration into
multimodal, sensor-driven embedded systems. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/73947649-6ca2-4dd3-94fe-36ec148841cd

author

Nouripayam, Masoud ^LU

supervisor

Joachim Rodrigues ^LU
Liang Liu ^LU

opponent

Prof. Aunet, Snorre, NTNU Norwegian University of Science and Technology, Norway.

organization

publishing date

2025

type

Thesis

publication status

published

subject

Computer Systems

in

Series of Licentiate and Doctoral Theses

issue

189

pages

170 pages

publisher

Electrical and Information Technology, Lund University

defense location

Lecture Hall E:1406, building E, Ole Römers väg 3, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream.

defense date

2025-11-21 09:15:00

ISSN

1654-790X

ISBN

978-91-8104-741-7

978-91-8104-742-4

language

English

LU publication?

yes

id

73947649-6ca2-4dd3-94fe-36ec148841cd

date added to LUP

2025-10-28 18:54:48

date last changed

2025-10-30 08:09:48

@phdthesis{73947649-6ca2-4dd3-94fe-36ec148841cd,
  abstract     = {{Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from<br/>personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand<br/>for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are<br/>expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt<br/>to diverse workloads, while data movement between processors and memory continues to dominate system cost. These<br/>trends position memory-centric computing as a compelling alternative to conventional architectures.<br/>Addressing these challenges begins with rethinking on-chip memory. A dual-port six-transistor (6T) static random-access<br/>memory (SRAM) was developed that combines high density with concurrent read and write capability, enabling reliable<br/>low-voltage operation and significant energy savings. Targeted at resource-constrained edge platforms, this design also<br/>achieves notable area reductions compared with conventional dual-port implementations. Building on this foundation, a<br/>synthesizable distributed-SRAM (SD-SRAM) framework was devised to extend the scalability of on-chip memories. By<br/>pushing the area–energy break-even point from 2–4 kb to about 14–16 kb and enabling modular partitioning with parallel<br/>access, this framework provides a robust memory substrate well suited to data-centric processing.<br/>Extending these memory innovations, a digital near-memory computing (NMC) architecture was integrated into RISC-<br/>V-based microcontroller units (MCUs), performing multiply–accumulate (MAC) operations directly within cache SRAM<br/>macros without altering the conventional SRAM architecture. The NMC engine supports configurable convolutional neural<br/>network (CNN) dataflows and achieves nearly two orders of magnitude higher throughput while sharply reducing internal<br/>data movement relative to processor–memory baselines. To further improve efficiency under strict energy constraints, a<br/>family of approximate multipliers with operand-aware error control provides tunable accuracy–cost trade-offs, enabling<br/>substantial reductions in area and power while preserving inference quality for error-tolerant workloads.<br/>These contributions converge in a scalable and modular NMC platform capable of flexible CNN acceleration at the edge.<br/>By tightly coupling local memory, approximate processing, and near-memory execution, this work advances the develop-<br/>ment of compact, high-throughput, and energy-efficient AI hardware. The presented concepts and implementations also<br/>establish a foundation for compiler–hardware co-optimization, automated NMC generation, and seamless integration into<br/>multimodal, sensor-driven embedded systems.}},
  author       = {{Nouripayam, Masoud}},
  isbn         = {{978-91-8104-741-7}},
  issn         = {{1654-790X}},
  language     = {{eng}},
  number       = {{189}},
  publisher    = {{Electrical and Information Technology, Lund University}},
  school       = {{Lund University}},
  series       = {{Series of Licentiate and Doctoral Theses}},
  title        = {{Near-Memory Computing Architectures for Scalable Edge AI Applications}},
  year         = {{2025}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Near-Memory Computing Architectures for Scalable Edge AI Applications