Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Near-Memory Computing Architectures for Scalable Edge AI Applications

Nouripayam, Masoud LU (2025) In Series of Licentiate and Doctoral Theses
Abstract
Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from
personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand
for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are
expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt
to diverse workloads, while data movement between processors and memory continues to dominate system cost. These
trends position memory-centric computing as a compelling alternative to conventional architectures.
Addressing these challenges begins with... (More)
Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from
personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand
for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are
expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt
to diverse workloads, while data movement between processors and memory continues to dominate system cost. These
trends position memory-centric computing as a compelling alternative to conventional architectures.
Addressing these challenges begins with rethinking on-chip memory. A dual-port six-transistor (6T) static random-access
memory (SRAM) was developed that combines high density with concurrent read and write capability, enabling reliable
low-voltage operation and significant energy savings. Targeted at resource-constrained edge platforms, this design also
achieves notable area reductions compared with conventional dual-port implementations. Building on this foundation, a
synthesizable distributed-SRAM (SD-SRAM) framework was devised to extend the scalability of on-chip memories. By
pushing the area–energy break-even point from 2–4 kb to about 14–16 kb and enabling modular partitioning with parallel
access, this framework provides a robust memory substrate well suited to data-centric processing.
Extending these memory innovations, a digital near-memory computing (NMC) architecture was integrated into RISC-
V-based microcontroller units (MCUs), performing multiply–accumulate (MAC) operations directly within cache SRAM
macros without altering the conventional SRAM architecture. The NMC engine supports configurable convolutional neural
network (CNN) dataflows and achieves nearly two orders of magnitude higher throughput while sharply reducing internal
data movement relative to processor–memory baselines. To further improve efficiency under strict energy constraints, a
family of approximate multipliers with operand-aware error control provides tunable accuracy–cost trade-offs, enabling
substantial reductions in area and power while preserving inference quality for error-tolerant workloads.
These contributions converge in a scalable and modular NMC platform capable of flexible CNN acceleration at the edge.
By tightly coupling local memory, approximate processing, and near-memory execution, this work advances the develop-
ment of compact, high-throughput, and energy-efficient AI hardware. The presented concepts and implementations also
establish a foundation for compiler–hardware co-optimization, automated NMC generation, and seamless integration into
multimodal, sensor-driven embedded systems. (Less)
Please use this url to cite or link to this publication:
author
supervisor
opponent
  • Prof. Aunet, Snorre, NTNU Norwegian University of Science and Technology, Norway.
organization
publishing date
type
Thesis
publication status
published
subject
in
Series of Licentiate and Doctoral Theses
issue
189
pages
170 pages
publisher
Electrical and Information Technology, Lund University
defense location
Lecture Hall E:1406, building E, Ole Römers väg 3, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream.
defense date
2025-11-21 09:15:00
ISSN
1654-790X
ISBN
978-91-8104-742-4
978-91-8104-741-7
language
English
LU publication?
yes
id
73947649-6ca2-4dd3-94fe-36ec148841cd
date added to LUP
2025-10-28 18:54:48
date last changed
2025-10-30 08:09:48
@phdthesis{73947649-6ca2-4dd3-94fe-36ec148841cd,
  abstract     = {{Artificial intelligence (AI) and machine learning (ML) are rapidly permeating nearly every aspect of modern life, from<br/>personal devices and autonomous systems to industrial automation and environmental monitoring. The growing demand<br/>for intelligence at the network edge is reshaping how computing hardware is conceived and built. Edge AI platforms are<br/>expected to deliver high throughput within tight energy and area budgets, operate reliably at low voltages, and adapt<br/>to diverse workloads, while data movement between processors and memory continues to dominate system cost. These<br/>trends position memory-centric computing as a compelling alternative to conventional architectures.<br/>Addressing these challenges begins with rethinking on-chip memory. A dual-port six-transistor (6T) static random-access<br/>memory (SRAM) was developed that combines high density with concurrent read and write capability, enabling reliable<br/>low-voltage operation and significant energy savings. Targeted at resource-constrained edge platforms, this design also<br/>achieves notable area reductions compared with conventional dual-port implementations. Building on this foundation, a<br/>synthesizable distributed-SRAM (SD-SRAM) framework was devised to extend the scalability of on-chip memories. By<br/>pushing the area–energy break-even point from 2–4 kb to about 14–16 kb and enabling modular partitioning with parallel<br/>access, this framework provides a robust memory substrate well suited to data-centric processing.<br/>Extending these memory innovations, a digital near-memory computing (NMC) architecture was integrated into RISC-<br/>V-based microcontroller units (MCUs), performing multiply–accumulate (MAC) operations directly within cache SRAM<br/>macros without altering the conventional SRAM architecture. The NMC engine supports configurable convolutional neural<br/>network (CNN) dataflows and achieves nearly two orders of magnitude higher throughput while sharply reducing internal<br/>data movement relative to processor–memory baselines. To further improve efficiency under strict energy constraints, a<br/>family of approximate multipliers with operand-aware error control provides tunable accuracy–cost trade-offs, enabling<br/>substantial reductions in area and power while preserving inference quality for error-tolerant workloads.<br/>These contributions converge in a scalable and modular NMC platform capable of flexible CNN acceleration at the edge.<br/>By tightly coupling local memory, approximate processing, and near-memory execution, this work advances the develop-<br/>ment of compact, high-throughput, and energy-efficient AI hardware. The presented concepts and implementations also<br/>establish a foundation for compiler–hardware co-optimization, automated NMC generation, and seamless integration into<br/>multimodal, sensor-driven embedded systems.}},
  author       = {{Nouripayam, Masoud}},
  isbn         = {{978-91-8104-742-4}},
  issn         = {{1654-790X}},
  language     = {{eng}},
  number       = {{189}},
  publisher    = {{Electrical and Information Technology, Lund University}},
  school       = {{Lund University}},
  series       = {{Series of Licentiate and Doctoral Theses}},
  title        = {{Near-Memory Computing Architectures for Scalable Edge AI Applications}},
  year         = {{2025}},
}