Interconnecting Multiple FPGAs for Distributed Machine Learning Inference

Carlsson, Linus; Malmsjö, Felix

Interconnecting Multiple FPGAs for Distributed Machine Learning Inference

Mark

Carlsson, Linus ^LU and Malmsjö, Felix ^LU (2024) EITM02 20241
Department of Electrical and Information Technology

Abstract: Machine learning has been growing in popularity for many years now and is currently popular among both businesses and private individuals. Running machine learning models takes a lot of computational resources, which means a large amount of energy is consumed.

The aim of this thesis is to investigate the usage of machine learning model specific hardware on Field Programmable Gate Arrays (FPGAs) as a potential solution to low-latency machine learning inference. However, as FPGAs have limited resources, not all models can fit on one FPGA. Therefore, the model has to be partitioned to be run over two or more FPGAs to distribute the need for resources.

It was found that partitioning the machine learning model over multiple FPGAs... (More); Machine learning has been growing in popularity for many years now and is currently popular among both businesses and private individuals. Running machine learning models takes a lot of computational resources, which means a large amount of energy is consumed.

The aim of this thesis is to investigate the usage of machine learning model specific hardware on Field Programmable Gate Arrays (FPGAs) as a potential solution to low-latency machine learning inference. However, as FPGAs have limited resources, not all models can fit on one FPGA. Therefore, the model has to be partitioned to be run over two or more FPGAs to distribute the need for resources.

It was found that partitioning the machine learning model over multiple FPGAs introduces a few challenges, such as an unbalanced resource utilization over the FPGAs and a possible bottleneck of inter-FPGA communication when the goal is high throughput. The main difficulty when designing the hardware accelerator is how to store all the millions of parameters. On the other hand, the introduced latency of the inter-FPGA communication is negligible, which means focus can be placed on reducing the computational latency. Top-1 inference accuracy is also close to the fully floating-point model, while top-5 accuracy is slightly higher when implementing the convolutional neural network in hardware. (Less)
Popular Abstract: Nowadays, Artificial Intelligence (AI) is widely recognized for its ability to solve simple daily problems and even program complex algorithms. AI can be used for everything from chatbots that can give you a recipe with what you have in the pantry, determining the type of flower in a picture, to intricate problem-solving to figure out the optimal logistical setup in a store. When an AI makes a decision based on some input data is called inference.

Using AI to solve such problems takes a lot of energy. Using specialized hardware that is designed for a specific AI model to reduce the energy footprint shows great promise. One way to create energy optimized hardware is to use re-programmable integrated circuits called FPGAs, which allow for... (More); Nowadays, Artificial Intelligence (AI) is widely recognized for its ability to solve simple daily problems and even program complex algorithms. AI can be used for everything from chatbots that can give you a recipe with what you have in the pantry, determining the type of flower in a picture, to intricate problem-solving to figure out the optimal logistical setup in a store. When an AI makes a decision based on some input data is called inference.

Using AI to solve such problems takes a lot of energy. Using specialized hardware that is designed for a specific AI model to reduce the energy footprint shows great promise. One way to create energy optimized hardware is to use re-programmable integrated circuits called FPGAs, which allow for the creation of specific hardware that has a single goal and can achieve that goal efficiently with low power consumption. The cloud providers of today offer a wide range of scalable FPGA environments that can be easily used to deploy large clusters of FPGAs. It is in the interest of the cloud providers to reduce power consumption in the data centers, so we expect further expansion of the cloud FPGA market.

FPGAs have limited resources and when AI models become larger, multiple FPGAs need to be connected together in a chain. The data from one FPGA needs to be transferred to the next FPGA in the chain, introducing additional latency. One of the goals of this thesis is to explore how much this additional overhead influences the resulting performance.

Combining the scalability of the cloud and the flexibility of the FPGAs, we can show that connecting multiple FPGAs to solve inference adds no significant latency, and has a relatively low impact on overall bandwidth requirements. This work done in this thesis has created a solid foundation for future research to explore interconnecting clusters of FPGAs, to solve even larger inference problems such as chatbots, or even general compute problems. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9163353

author

Carlsson, Linus ^LU and Malmsjö, Felix ^LU

supervisor

Erik Larsson ^LU
Mohammad Attari ^LU

organization

Department of Electrical and Information Technology

course

EITM02 20241

year

2024

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Artificial Intelligence, AI, Machine Learning, ML, FPGA, Field Programmable Gate Array, Hardware Accelerator, AWS, Cloud, Chisel, CNN, Inference, DMA, DSP, PCIe, RTL, Scala, XDMA, Hardware Design, Accelerator, VGG16, Neural Network, Heterogeneous Computing, Green Computing, Inceptron, Computer Vision

report number

LU/LTH-EIT 2024-992

language

English

id

9163353

date added to LUP

2024-06-18 14:15:19

date last changed

2024-06-18 14:15:19

@misc{9163353,
  abstract     = {{Machine learning has been growing in popularity for many years now and is currently popular among both businesses and private individuals. Running machine learning models takes a lot of computational resources, which means a large amount of energy is consumed.

The aim of this thesis is to investigate the usage of machine learning model specific hardware on Field Programmable Gate Arrays (FPGAs) as a potential solution to low-latency machine learning inference. However, as FPGAs have limited resources, not all models can fit on one FPGA. Therefore, the model has to be partitioned to be run over two or more FPGAs to distribute the need for resources.

It was found that partitioning the machine learning model over multiple FPGAs introduces a few challenges, such as an unbalanced resource utilization over the FPGAs and a possible bottleneck of inter-FPGA communication when the goal is high throughput. The main difficulty when designing the hardware accelerator is how to store all the millions of parameters. On the other hand, the introduced latency of the inter-FPGA communication is negligible, which means focus can be placed on reducing the computational latency. Top-1 inference accuracy is also close to the fully floating-point model, while top-5 accuracy is slightly higher when implementing the convolutional neural network in hardware.}},
  author       = {{Carlsson, Linus and Malmsjö, Felix}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Interconnecting Multiple FPGAs for Distributed Machine Learning Inference}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Interconnecting Multiple FPGAs for Distributed Machine Learning Inference