System level cache prefetching algorithms for complex GPU workloads
(2024) EITM01 20241Department of Electrical and Information Technology
- Abstract
- Prefetching is a well known concept for CPUs but for GPUs it is fairly
unexplored. The memory management of a GPU plays a crucial role
in its performance, and cache prefetching has the potential to lower
the overall latency. This thesis compares different types of prefetching
methods for GPUs and remaking some CPU prefetchers to fit the
GPU architecture. All these prefetchers were then put inside the
system level cache (SLC), between the GPU and external memory.
Five different methods were tested on framework based on ARM’s
GPU model. The thesis was mainly based upon prefetching tech-
niques discussed in the following papers: Adaptive Stream Detec-
tion[11], Best-offset [18], Many-thread aware [17], APOGEE [26],
and Last-level... (More) - Prefetching is a well known concept for CPUs but for GPUs it is fairly
unexplored. The memory management of a GPU plays a crucial role
in its performance, and cache prefetching has the potential to lower
the overall latency. This thesis compares different types of prefetching
methods for GPUs and remaking some CPU prefetchers to fit the
GPU architecture. All these prefetchers were then put inside the
system level cache (SLC), between the GPU and external memory.
Five different methods were tested on framework based on ARM’s
GPU model. The thesis was mainly based upon prefetching tech-
niques discussed in the following papers: Adaptive Stream Detec-
tion[11], Best-offset [18], Many-thread aware [17], APOGEE [26],
and Last-level collective cache prefetcher [19]. The prefetchers pro-
duced in this thesis were either heavily inspired by or implemented
as closely as possible to the designs in the papers.
The thesis concludes that for graphics workloads the best prefetch-
ers implemented can achieve 0.51-1.29% decrease in GPU cycles on
average, depending on the chosen GPU configurations, while also
lowering the estimated energy usage. (Less) - Popular Abstract
- Prefetching is the process of guessing what data a system wants, before it asks for it. This idea has been used for Central Processing Units (CPUs) for a long time but is
rather unexplored for Graphic Processing Units (GPUs). This thesis will look at five
different prefetchers and evaluate their performance.
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9159122
- author
- Eklund Sigleifs, Fabian LU and Häggström, Erik LU
- supervisor
- organization
- course
- EITM01 20241
- year
- 2024
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- GPU, Prefetching, Cache prefetching, Memory management, System level cache, SLC, LLC, Last Level Cache, GPU architecture, Graphics workloads, ARM GPU model, Adaptive Stream Detection, Best-offset prefetcher, Many-thread aware prefetcher, APOGEE prefetcher, Last-level collective cache prefetcher, GPU performance, GPU latency reduction, Energy efficiency in GPUs, External memory, Prefetching techniques, GPU cycle reduction
- report number
- LU/LTH-EIT 2024-971
- language
- English
- id
- 9159122
- date added to LUP
- 2024-06-10 10:49:41
- date last changed
- 2024-06-10 10:49:41
@misc{9159122, abstract = {{Prefetching is a well known concept for CPUs but for GPUs it is fairly unexplored. The memory management of a GPU plays a crucial role in its performance, and cache prefetching has the potential to lower the overall latency. This thesis compares different types of prefetching methods for GPUs and remaking some CPU prefetchers to fit the GPU architecture. All these prefetchers were then put inside the system level cache (SLC), between the GPU and external memory. Five different methods were tested on framework based on ARM’s GPU model. The thesis was mainly based upon prefetching tech- niques discussed in the following papers: Adaptive Stream Detec- tion[11], Best-offset [18], Many-thread aware [17], APOGEE [26], and Last-level collective cache prefetcher [19]. The prefetchers pro- duced in this thesis were either heavily inspired by or implemented as closely as possible to the designs in the papers. The thesis concludes that for graphics workloads the best prefetch- ers implemented can achieve 0.51-1.29% decrease in GPU cycles on average, depending on the chosen GPU configurations, while also lowering the estimated energy usage.}}, author = {{Eklund Sigleifs, Fabian and Häggström, Erik}}, language = {{eng}}, note = {{Student Paper}}, title = {{System level cache prefetching algorithms for complex GPU workloads}}, year = {{2024}}, }