A General Purpose Near Data Processing Architecture Optimized for Data-intensive Applications

Li, Xingda; Hu, Haidi

A General Purpose Near Data Processing Architecture Optimized for Data-intensive Applications

Mark

Li, Xingda ^LU and Hu, Haidi ^LU (2023) EITM02 20221
Department of Electrical and Information Technology

Abstract: In recent years, as Internet of Things (IoT) and machine learning technologies have advanced, there has been increasing interest in the study of energy-efficient and flexible architectures for embedded systems. To bridge the performance gap between microprocessors and memory systems, Near-Data Processing (NDP) was introduced. Although some works have implemented NDP, few of them utilize the microprocessor’s cache memory.
In this thesis, we present an NDP architecture that integrates static random access memory (SRAM), which is regarded as the L2 cache of a microcontroller unit (MCU). The proposed NDP is tailored for data-intensive applications and seeks to address multiple problems. A coarse-grained reconfigurable array (CGRA)-based... (More); In recent years, as Internet of Things (IoT) and machine learning technologies have advanced, there has been increasing interest in the study of energy-efficient and flexible architectures for embedded systems. To bridge the performance gap between microprocessors and memory systems, Near-Data Processing (NDP) was introduced. Although some works have implemented NDP, few of them utilize the microprocessor’s cache memory.
In this thesis, we present an NDP architecture that integrates static random access memory (SRAM), which is regarded as the L2 cache of a microcontroller unit (MCU). The proposed NDP is tailored for data-intensive applications and seeks to address multiple problems. A coarse-grained reconfigurable array (CGRA)-based strategy is utilized to maximize flexibility while decreasing power consumption. Additionally, numerous approaches, such as convolution-and-pooling-integrated computation, two-level clock gating, etc., are implemented to improve energy efficiency even more.
The design was constructed utilizing STMicroelectronics (STM) 65 nm Low Power Low VT (LPLVT) technology with a maximum clock rate of 167 MHz. Two popular algorithms, the convolutional neural network (CNN) and K-means, were mapped onto the hardware to evaluate it. As a result, the power efficiency of CNN and K-means algorithms can be boosted by 12x and 26x relative to field-programmable gate array (FPGA) and MCU implementations, respectively, and by several orders of magnitude relative to other K-Means accelerators. (Less)
Popular Abstract: In the past two decades, Machine Learning (ML) has rapidly advanced and been widely utilized in research, technology, and commerce [1]. In a variety of domains, including automated driving, computer vision, and speech recognition, ML has proven to be an effective technique for producing useful applications. However, existing ML methods and computer systems are currently confronted with a huge obstacle provided by the exponential growth of data.
The most prevalent method for accelerating data-intensive ML applications is to increase the speed of existing central processing units (CPUs) and graphics processing units (GPUs). However, given to the flexibility of these two processors, they can be somewhat power-hungry when executing particular... (More); In the past two decades, Machine Learning (ML) has rapidly advanced and been widely utilized in research, technology, and commerce [1]. In a variety of domains, including automated driving, computer vision, and speech recognition, ML has proven to be an effective technique for producing useful applications. However, existing ML methods and computer systems are currently confronted with a huge obstacle provided by the exponential growth of data.
The most prevalent method for accelerating data-intensive ML applications is to increase the speed of existing central processing units (CPUs) and graphics processing units (GPUs). However, given to the flexibility of these two processors, they can be somewhat power-hungry when executing particular ML approaches. Certain candidates, such as FPGA/ASIC accelerators, can achieve a decent balance between performance and power consumption. Unfortunately, due to their high cost and comparable huge power consumption, they may not be suitable for use in various rapidly expanding consumer electronics, such as smart watches, sweeping robots, smart speakers, etc.
In order to address these issues, the proposed solution must be as accommodating as possible and capable of supporting a wide variety of applications in the ML domain. Additionally, novel techniques should be employed to enhance the power efficiency so that it can be integrated into IoT devices. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9134395

author

Li, Xingda ^LU and Hu, Haidi ^LU

supervisor

organization

Department of Electrical and Information Technology

course

EITM02 20221

year

2023

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

report number

LU/LTH-EIT 2023-950

language

English

id

9134395

date added to LUP

2023-09-21 09:45:18

date last changed

2023-09-21 09:45:18

@misc{9134395,
  abstract     = {{In recent years, as Internet of Things (IoT) and machine learning technologies have advanced, there has been increasing interest in the study of energy-efficient and flexible architectures for embedded systems. To bridge the performance gap between microprocessors and memory systems, Near-Data Processing (NDP) was introduced. Although some works have implemented NDP, few of them utilize the microprocessor’s cache memory.
In this thesis, we present an NDP architecture that integrates static random access memory (SRAM), which is regarded as the L2 cache of a microcontroller unit (MCU). The proposed NDP is tailored for data-intensive applications and seeks to address multiple problems. A coarse-grained reconfigurable array (CGRA)-based strategy is utilized to maximize flexibility while decreasing power consumption. Additionally, numerous approaches, such as convolution-and-pooling-integrated computation, two-level clock gating, etc., are implemented to improve energy efficiency even more.
The design was constructed utilizing STMicroelectronics (STM) 65 nm Low Power Low VT (LPLVT) technology with a maximum clock rate of 167 MHz. Two popular algorithms, the convolutional neural network (CNN) and K-means, were mapped onto the hardware to evaluate it. As a result, the power efficiency of CNN and K-means algorithms can be boosted by 12x and 26x relative to field-programmable gate array (FPGA) and MCU implementations, respectively, and by several orders of magnitude relative to other K-Means accelerators.}},
  author       = {{Li, Xingda and Hu, Haidi}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{A General Purpose Near Data Processing Architecture Optimized for Data-intensive Applications}},
  year         = {{2023}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

A General Purpose Near Data Processing Architecture Optimized for Data-intensive Applications