Advanced

DSP Design With Hardware Accelerator For Convolutional Neural Networks

Hille, Julian LU and Santos Ferreira, Lucas LU (2019) EITM02 20182
Department of Electrical and Information Technology
Abstract
Convolutional Neural Networks impressed the world in 2012 by reaching state-of-the-art accuracy levels in the ImageNet Large Scale Visual Recognition Challenge. The era of machine learning has arrived and with it countless applications varying from autonomous driving to unstructured robotic manipulation. Computational complexity in the past years has grown exponentially, requiring highly efficient low power new hardware architectures, capable of executing those.
In this work, we have performed optimization in three levels of hardware design: from algorithmic, to system, and accelerator level. The design of a DSP with Tensilica and the integration of Xenergic dual port SRAMs, for direct memory access of a convolution hardware accelerator,... (More)
Convolutional Neural Networks impressed the world in 2012 by reaching state-of-the-art accuracy levels in the ImageNet Large Scale Visual Recognition Challenge. The era of machine learning has arrived and with it countless applications varying from autonomous driving to unstructured robotic manipulation. Computational complexity in the past years has grown exponentially, requiring highly efficient low power new hardware architectures, capable of executing those.
In this work, we have performed optimization in three levels of hardware design: from algorithmic, to system, and accelerator level. The design of a DSP with Tensilica and the integration of Xenergic dual port SRAMs, for direct memory access of a convolution hardware accelerator, lead to four orders speed-up on the initial identified bottleneck, causing an estimated three times final speed-up of a single handwritten classification image compared to the pure software implementation. Higher speed-up is expected for deeper convolutional architectures and larger image dimensions, due to the linear time complexity scaling of the convolution hardware accelerator in comparison to conventional non-linear software-based approaches. (Less)
Popular Abstract
Artificial Intelligence is becoming more and more used in newer technologies, from mobile phones featuring voice detection to autonomous driving cars and also in the new industries. For such applications the "intelligence" requirements are increasing. Today much computations are solved by using the cloud. For example, in mobile phones, voice assistance only works with an internet connection. The same approaches are not possible for autonomic controlled vehicles. The essential control features have to be inside the vehicle.

Therefore we are in need to bring Artificial Intelligence into mobile devices. This thesis aims to implement a benchmark classification problem (MNIST) by using a programmable processor, designed with a commercial... (More)
Artificial Intelligence is becoming more and more used in newer technologies, from mobile phones featuring voice detection to autonomous driving cars and also in the new industries. For such applications the "intelligence" requirements are increasing. Today much computations are solved by using the cloud. For example, in mobile phones, voice assistance only works with an internet connection. The same approaches are not possible for autonomic controlled vehicles. The essential control features have to be inside the vehicle.

Therefore we are in need to bring Artificial Intelligence into mobile devices. This thesis aims to implement a benchmark classification problem (MNIST) by using a programmable processor, designed with a commercial tool, and a flexible hardware accelerator to speed up a convolutional neural network that recognizes handwritten digits between 0 and 9.
Therefore we have designed and trained a reference architecture in the programming language Python, from which the weights were obtained to implement the same architecture on the designed processor (by using C/C++). By investigation of the most resources consuming functions, we have figured out that the convolution has the highest computation cost. Hence the accelerator was implemented and instructions added, directly connecting it to the processor. Results obtained achieved four orders of magnitude total speed-up of the identified bottleneck. Yielding in an estimated three times final speed-up for a single handwritten classification image, compared to a pure software implementation at the same processor. Additionally, an open-source processor alternative is proposed. (Less)
Please use this url to cite or link to this publication:
author
Hille, Julian LU and Santos Ferreira, Lucas LU
supervisor
organization
course
EITM02 20182
year
type
H2 - Master's Degree (Two Years)
subject
keywords
CNN, Hardware, Accelerator, DSP, Convolutional, Convolution, Neural, Network, Processor, Tensilica, FIR, Memory, SRAM
report number
LU/LTH-EIT 2019-686
language
English
id
8973041
date added to LUP
2019-03-27 11:46:33
date last changed
2019-03-27 11:46:33
@misc{8973041,
  abstract     = {Convolutional Neural Networks impressed the world in 2012 by reaching state-of-the-art accuracy levels in the ImageNet Large Scale Visual Recognition Challenge. The era of machine learning has arrived and with it countless applications varying from autonomous driving to unstructured robotic manipulation. Computational complexity in the past years has grown exponentially, requiring highly efficient low power new hardware architectures, capable of executing those.
In this work, we have performed optimization in three levels of hardware design: from algorithmic, to system, and accelerator level. The design of a DSP with Tensilica and the integration of Xenergic dual port SRAMs, for direct memory access of a convolution hardware accelerator, lead to four orders speed-up on the initial identified bottleneck, causing an estimated three times final speed-up of a single handwritten classification image compared to the pure software implementation. Higher speed-up is expected for deeper convolutional architectures and larger image dimensions, due to the linear time complexity scaling of the convolution hardware accelerator in comparison to conventional non-linear software-based approaches.},
  author       = {Hille, Julian and Santos Ferreira, Lucas},
  keyword      = {CNN,Hardware,Accelerator,DSP,Convolutional,Convolution,Neural,Network,Processor,Tensilica,FIR,Memory,SRAM},
  language     = {eng},
  note         = {Student Paper},
  title        = {DSP Design With Hardware Accelerator For Convolutional Neural Networks},
  year         = {2019},
}