Advanced

Implementation of a Deep Learning Inference Accelerator on the FPGA.

Ramakrishnan, Shenbagaraman LU (2020) EITM02 20191
Department of Electrical and Information Technology
Abstract
Today, Artificial Intelligence is one of the most important technologies, ubiquitous in our daily lives. Deep Neural Networks (DNN's) have come up as state of art for various machine intelligence applications such as object detection, image classification, face recognition and performs myriad of activities with exceptional prediction accuracy. AI in this contemporary world is moving towards embedded platforms for inference on the edge. This is essential to avoid latency, enhance data security and realize real-time performance. However, these DNN algorithms are computational and memory intensive. Consequently, exploiting immense energy, compute resources and memory-bandwidth making it difficult to be deployed in embedded devices. To solve... (More)
Today, Artificial Intelligence is one of the most important technologies, ubiquitous in our daily lives. Deep Neural Networks (DNN's) have come up as state of art for various machine intelligence applications such as object detection, image classification, face recognition and performs myriad of activities with exceptional prediction accuracy. AI in this contemporary world is moving towards embedded platforms for inference on the edge. This is essential to avoid latency, enhance data security and realize real-time performance. However, these DNN algorithms are computational and memory intensive. Consequently, exploiting immense energy, compute resources and memory-bandwidth making it difficult to be deployed in embedded devices. To solve this problem and realize an on-device AI acceleration, dedicated energy-efficient hardware accelerators are paramount.

This thesis involves the implementation of such a dedicated deep learning accelerator on the FPGA. The NVIDIA's Deep Learning Accelerator (NVDLA), is encompassed in this research to explore SoC designs for integrated inference acceleration. NVDLA, an open-source architecture, standardizes deep learning inference acceleration on hardware. It optimizes inference acceleration all across the full stack from application through hardware to achieve energy efficiency synergy with the demanding throughput requirements. Therefore, the following thesis probes into the NVDLA framework to perceive the consistent workflow across the whole hardware-software programming hierarchies. Besides, the hardware design parameters, optimization features and system configurations of the NVDLA systems are analyzed for efficient implementations. Also, a comparative study of the diverse NVDLA SoC implementations (nv\_small and nv\_medium) with respect to performance metrics such as power, area, and throughput are discussed.

Our approach engages prototyping of Nvidia’s Deep Learning Accelerator on a Zynq Ultrascale+ ZCU104 FPGA to examine its system functionality. The Hardware design of the system is carried out using Xilinx's Vivado Design Suite 2018.3 in Verilog. While the on-device software runs Linux kernel 4.14 on Zynq MPSoC. Thus, the software ecosystem is built with PetaLinux tools from Xilinx. The entire system architecture is validated using the pre-built regression tests that verify individual CNN layers. Besides these NVDLA hardware design also runs pre-compiled AlexNet as a benchmark for performance evaluation and comparison (Less)
Popular Abstract
Today, Artificial Intelligence is at the edge. This edge or endpoint device is becoming more sophisticated with the evolution of Internet of Things (IoT) and 5G. For instance, these devices are employed in different applications such as autonomous cars, drones, and other IoT gadgets. At present, a self-driving car is a data center on wheels, a drone is a data center on wings as well as robots are data centers with arms and legs. All these mechanisms collect vast real-world information that demands to be processed in real-time. Here in these applications, there is no time to send data to the cloud for processing and wait for action. As the decision making needs to be instantaneous. There is a shift in transforming the processing to the edge... (More)
Today, Artificial Intelligence is at the edge. This edge or endpoint device is becoming more sophisticated with the evolution of Internet of Things (IoT) and 5G. For instance, these devices are employed in different applications such as autonomous cars, drones, and other IoT gadgets. At present, a self-driving car is a data center on wheels, a drone is a data center on wings as well as robots are data centers with arms and legs. All these mechanisms collect vast real-world information that demands to be processed in real-time. Here in these applications, there is no time to send data to the cloud for processing and wait for action. As the decision making needs to be instantaneous. There is a shift in transforming the processing to the edge devices.

The edge acceleration brings computation and data storage closer to the device. With the evolution of specialized hardware’s providing increased computational capabilities, the AI models are processed on the edge. As a result, the overall system latency gets reduced, the bandwidth costs for data transfers are lowered and the data processing is done locally enhances privacy concerns. For example, autonomous cars require a spontaneous reaction (in seconds) to avoid potential hazards on the road. Consider the situation where a self-driving car is collecting real word information like images, videos, in this case, assume it’s sensing for a stop sign. If the system sends the specific image information to the cloud for processing and waits for a decision to stop. By that response time, the autonomous vehicle could have already blown through the stop sign running over several people. Therefore, it is paramount to process the data in real-time which could be accomplished using dedicated hardware for processing locally.

This thesis primarily explores those hardware architectures for efficient processing of AI algorithms and their corresponding software execution environment setup. The particular thesis was carried out as a joint collaboration between Ericsson and Lund University. Here Nvidia’s Deep Learning Accelerator architecture is engaged as a target to comprehend the complete system incorporating a hardware-software co-design. The particular architecture is an essential characteristic of NVIDIA’s Xavier Drive chip which is utilized in their autonomous drive platforms.

This thesis is addressed to a variety of audiences who are passionate about Deep Learning, Computer Architecture, and System-on-Chip Design. The thesis illustrates a comprehensive implementation of an AI accelerator to envision AI processing on the edge. (Less)
Please use this url to cite or link to this publication:
author
Ramakrishnan, Shenbagaraman LU
supervisor
organization
alternative title
FPGA implementation av en Deep Learning inferensaccelerator
course
EITM02 20191
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Artificial Intelligence, Machine Learning, Deep Learning, Neural Networks, Deep Learning Accelerators, NVDLA, FPGA
report number
LU/LTH-EIT 2020-751
language
English
id
9007070
date added to LUP
2020-03-26 11:37:33
date last changed
2020-03-26 11:37:33
@misc{9007070,
  abstract     = {Today, Artificial Intelligence is one of the most important technologies, ubiquitous in our daily lives. Deep Neural Networks (DNN's) have come up as state of art for various machine intelligence applications such as object detection, image classification, face recognition and performs myriad of activities with exceptional prediction accuracy. AI in this contemporary world is moving towards embedded platforms for inference on the edge. This is essential to avoid latency, enhance data security and realize real-time performance. However, these DNN algorithms are computational and memory intensive. Consequently, exploiting immense energy, compute resources and memory-bandwidth making it difficult to be deployed in embedded devices. To solve this problem and realize an on-device AI acceleration, dedicated energy-efficient hardware accelerators are paramount.

This thesis involves the implementation of such a dedicated deep learning accelerator on the FPGA. The NVIDIA's Deep Learning Accelerator (NVDLA), is encompassed in this research to explore SoC designs for integrated inference acceleration. NVDLA, an open-source architecture, standardizes deep learning inference acceleration on hardware. It optimizes inference acceleration all across the full stack from application through hardware to achieve energy efficiency synergy with the demanding throughput requirements. Therefore, the following thesis probes into the NVDLA framework to perceive the consistent workflow across the whole hardware-software programming hierarchies. Besides, the hardware design parameters, optimization features and system configurations of the NVDLA systems are analyzed for efficient implementations. Also, a comparative study of the diverse NVDLA SoC implementations (nv\_small and nv\_medium) with respect to performance metrics such as power, area, and throughput are discussed. 

Our approach engages prototyping of Nvidia’s Deep Learning Accelerator on a Zynq Ultrascale+ ZCU104 FPGA to examine its system functionality. The Hardware design of the system is carried out using Xilinx's Vivado Design Suite 2018.3 in Verilog. While the on-device software runs Linux kernel 4.14 on Zynq MPSoC. Thus, the software ecosystem is built with PetaLinux tools from Xilinx. The entire system architecture is validated using the pre-built regression tests that verify individual CNN layers. Besides these NVDLA hardware design also runs pre-compiled AlexNet as a benchmark for performance evaluation and comparison},
  author       = {Ramakrishnan, Shenbagaraman},
  keyword      = {Artificial Intelligence,Machine Learning,Deep Learning,Neural Networks,Deep Learning Accelerators,NVDLA,FPGA},
  language     = {eng},
  note         = {Student Paper},
  title        = {Implementation of a Deep Learning Inference Accelerator on the FPGA.},
  year         = {2020},
}