Trainable Region of Interest Prediction: Hard Attention Framework for Hardware-Efficient Event-Based Computer Vision Neural Networks on Neuromorphic Processors

Arjmand, Cina

Trainable Region of Interest Prediction: Hard Attention Framework for Hardware-Efficient Event-Based Computer Vision Neural Networks on Neuromorphic Processors

Mark

Arjmand, Cina ^LU (2023) EITM01 20231
Department of Electrical and Information Technology

Abstract: Neuromorphic processors are a promising new type of hardware for optimizing neural network computation using biologically-inspired principles. They can effectively leverage information sparsity such as in images from event-based cameras, and are well-adapted to processing event-based data in an energy-efficient fashion. However, neuromorphic processors struggle to process high resolution event-based data due to the computational cost of high resolution processing. This work introduces the Trainable Region of Interest Prediction (TRIP) framework for attaining hardware-efficient processing of event-based vision on a neuromorphic processor. TRIP uses active region-of-interest (ROI) generation to perform hard attention by cropping selected... (More); Neuromorphic processors are a promising new type of hardware for optimizing neural network computation using biologically-inspired principles. They can effectively leverage information sparsity such as in images from event-based cameras, and are well-adapted to processing event-based data in an energy-efficient fashion. However, neuromorphic processors struggle to process high resolution event-based data due to the computational cost of high resolution processing. This work introduces the Trainable Region of Interest Prediction (TRIP) framework for attaining hardware-efficient processing of event-based vision on a neuromorphic processor. TRIP uses active region-of-interest (ROI) generation to perform hard attention by cropping selected regions of input images, automatically filtering out unnecessary information and learning to process only the most important information in an image. TRIP is implemented on several neural networks tested on various event-based datasets. It leverages extensive hardware-optimization to maximize efficiency with respects to hardware-related metrics such as power, memory utilization, latency, number of network parameters, and area. The algorithm is implemented and benchmarked on the SENECA neuromorphic processor. The algorithms employing the TRIP framework exhibit intelligent ROI selection behavior and the capability to dynamically adjust ROI size and position to fit various targets, while obtaining or in some cases improving over state-of-the-art accuracy. Utilizing lower resolution input reduces the computation requirements of TRIP by $46\times$ compared to state-of-the-art solutions. The embedded hardware implementation of TRIP more than doubles the speed and energy efficiency of classification on the DVS Gesture recognition dataset compared to a baseline network, and generally outperforms other state-of-the-art neuromorphic processors benchmarked using DVS Gesture. (Less)
Popular Abstract: Artificial intelligence (AI) is being pervasively adopted across all facets of life, but one thing holding back the expansion of AI capabilities is its massive energy requirements. Servers used to deliver AI services consume more electricity than entire countries\footnote{https://doi.org/10.1016/j.joule.2023.09.004}, and the cost is felt not only in the wallets of companies but on the planet as well. While the things that AI can do are extremely impressive, the current cost of powering AI is simply unsustainable.

Humans are looking for ways to make more powerful AI systems fit on smaller devices that consume less energy. Not only will this be cheaper for both the planet and those paying for cloud servers, it will enable AI systems like... (More); Artificial intelligence (AI) is being pervasively adopted across all facets of life, but one thing holding back the expansion of AI capabilities is its massive energy requirements. Servers used to deliver AI services consume more electricity than entire countries\footnote{https://doi.org/10.1016/j.joule.2023.09.004}, and the cost is felt not only in the wallets of companies but on the planet as well. While the things that AI can do are extremely impressive, the current cost of powering AI is simply unsustainable.

Humans are looking for ways to make more powerful AI systems fit on smaller devices that consume less energy. Not only will this be cheaper for both the planet and those paying for cloud servers, it will enable AI systems like robots and self-driving cars to be faster, more sophisticated, and more effective. When an AI-powered system like a self-driving car can perform its advanced object detection and decision-making efficiently within the device itself, it does not have to send huge amounts of data back and forth to a cloud server. Instead, it can operate locally and be self-contained within itself. This is faster, potentially more energy efficient, and eliminates the risk of sensitive data being leaked. Furthermore, the less space and energy that an advanced AI takes up on a device, the more advanced AI we can fit on a single device. Super energy-efficient AI can enable us to cram a multitude of sophisticated functions into a single wearable or implantable device. The brain of a robot will be able to fit even more neurons in it and still run on the same battery. More efficient AI will enable a future that looks even more like science fiction.

The challenge of making AI more efficient require re-thinking both the hardware that we build to run it and the algorithms we design for the software. In this thesis, an algorithm is implemented which mimics the way that human eyes analyze a scene. Instead of taking in an entire scene at once (kind of like an insect eye) as most computer vision AI does, the algorithm in this thesis figures out where the most interesting place to look in a picture is, and then focuses only on that part. This is called a hard attention algorithm, and doing this turns out to be much more effective than doing it the insect way. Additionally, this algorithm is implemented on a special type of hardware known as a neuromorphic processor, which is inspired by how neurons in the human brain communicate with one another. The reason why neuromorphic processors try to mimic the human brain is the same reason why the algorithm tries to mimic human vision: the human brain is simply incredibly energy efficient! Sure, AI might be able to beat you at a board game. However, your brain only consumes as much power as a single lightbulb, while the AI needs something like the electricity consumption of an entire household. For that reason, many researchers believe that there is something to learn about how to make AI more energy-efficient by studying our own efficient and intelligent brains. Naturally, computers and AI are very different from humans and biological neurons, so human biology is not a blueprint but rather a source of potential inspiration and ideas.

The hard attention algorithm used in this work is designed for maximizing the efficiency of the algorithm on hardware. This means making it is as fast as possible while requiring as little power, memory, and area as possible. The algorithm itself drastically improves effectiveness over state-of-the-art solutions for computer vision using a special kind of camera known as an event-based camera (the design of which is also inspired by human vision in the interest of greater efficiency). It utilizes a whole set of hardware optimization techniques, and is implemented on a neuromorphic processor to produce an incredibly efficient computer vision algorithm for detecting gestures performed by a human. When a person waves their arms in front of the camera, the system automatically filters out information from the most interesting region (usually the arm) to minimize the amount of information that it needs to process. It automatically adjusts its own focus to filter out irrelevant objects or to focus on bigger or smaller objects in the camera's view. It detects the gesture with high accuracy, high speed, and with minimal memory and energy usage.

By overcoming the limitations typically faced by embedded AI, the solution presented in this work offers potential solutions to many of the problems currently faced in research. One of these is the high computational cost of processing high resolution images, which makes it impossible for neuromorphic systems to process data from event-based cameras with high resolution. High resolution data is important to obtain high accuracy, but because of the current hardware inefficiencies, state-of-the-art solutions are forced to resort to low-resolution processing instead. By focusing processing towards small, low-resolution areas, the hard attention framework introduced in this thesis could be very useful for high resolution processing of event-based data on neuromorphic systems. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9142302

author

Arjmand, Cina ^LU

supervisor

Mattias Borg

organization

Department of Electrical and Information Technology

course

EITM01 20231

year

2023

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Artifical Intelligence, Machine Learning, Neuromorphic Engineering, Computer Vision

report number

LU/LTH-EIT 2023-960

language

English

id

9142302

date added to LUP

2023-12-21 13:04:32

date last changed

2023-12-21 13:04:32

@misc{9142302,
  abstract     = {{Neuromorphic processors are a promising new type of hardware for optimizing neural network computation using biologically-inspired principles. They can effectively leverage information sparsity such as in images from event-based cameras, and are well-adapted to processing event-based data in an energy-efficient fashion. However, neuromorphic processors struggle to process high resolution event-based data due to the computational cost of high resolution processing. This work introduces the Trainable Region of Interest Prediction (TRIP) framework for attaining hardware-efficient processing of event-based vision on a neuromorphic processor. TRIP uses active region-of-interest (ROI) generation to perform hard attention by cropping selected regions of input images, automatically filtering out unnecessary information and learning to process only the most important information in an image. TRIP is implemented on several neural networks tested on various event-based datasets. It leverages extensive hardware-optimization to maximize efficiency with respects to hardware-related metrics such as power, memory utilization, latency, number of network parameters, and area. The algorithm is implemented and benchmarked on the SENECA neuromorphic processor. The algorithms employing the TRIP framework exhibit intelligent ROI selection behavior and the capability to dynamically adjust ROI size and position to fit various targets, while obtaining or in some cases improving over state-of-the-art accuracy. Utilizing lower resolution input reduces the computation requirements of TRIP by $46\times$ compared to state-of-the-art solutions. The embedded hardware implementation of TRIP more than doubles the speed and energy efficiency of classification on the DVS Gesture recognition dataset compared to a baseline network, and generally outperforms other state-of-the-art neuromorphic processors benchmarked using DVS Gesture.}},
  author       = {{Arjmand, Cina}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Trainable Region of Interest Prediction: Hard Attention Framework for Hardware-Efficient Event-Based Computer Vision Neural Networks on Neuromorphic Processors}},
  year         = {{2023}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Trainable Region of Interest Prediction: Hard Attention Framework for Hardware-Efficient Event-Based Computer Vision Neural Networks on Neuromorphic Processors