Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

A QR Decomposition Accelerator for Digital Beamforming

Singh, Vinay LU (2026) EITM02 20251
Department of Electrical and Information Technology
Abstract
QR decomposition (QRD) is a computationally-intensive matrix factorization method which is widely used in signal processing. Meeting the stringent processing budgets of real-time applications necessitates dedicated hardware acceleration. This thesis presents the architecture and Register-Transfer-Level (RTL) implementation of a QRD accelerator based on Givens rotations, specifically designed to achieve an end-to-end latency of ≤ 50 μs for an 8 × 8 complex covariance matrix. The architecture transforms the complex input into a 16 × 16 realified representation, which is processed by a CORDIC-based datapath to compute both the Q and R matrices. To manage data dependencies, the design employs a stage-wise binary-tree elimination schedule... (More)
QR decomposition (QRD) is a computationally-intensive matrix factorization method which is widely used in signal processing. Meeting the stringent processing budgets of real-time applications necessitates dedicated hardware acceleration. This thesis presents the architecture and Register-Transfer-Level (RTL) implementation of a QRD accelerator based on Givens rotations, specifically designed to achieve an end-to-end latency of ≤ 50 μs for an 8 × 8 complex covariance matrix. The architecture transforms the complex input into a 16 × 16 realified representation, which is processed by a CORDIC-based datapath to compute both the Q and R matrices. To manage data dependencies, the design employs a stage-wise binary-tree elimination schedule enforced by a hard memory-visibility barrier.

While the compute core is fully pipelined with a fixed latency of 16 cycles, the system throughput is governed by the on-chip memory service model, resulting in a sustained initiation interval of 2.

Functional correctness is established via bit-exact verification against a golden fixed-point C reference model using Q1.15 arithmetic. The design was implemented and validated on a ZynqTM UltraScale+TM Field Programmable Gate Array (FPGA) development board. Operating at 245.76 MHz, the accelerator completes a single QRD in 9.83 μs, satisfying the target requirement with a significant performance margin and achieving a sustained throughput of approximately 101 kQRD/s. The results indicate that end-to-end latency is primarily dominated by system-level data movement and synchronization barriers rather than raw arithmetic computation. These findings motivate future research into relaxed
consistency models and inter-stage data forwarding to further optimize scaling for higher-dimensional matrices. The novel architecture and scheduling methodology developed in this work has been filed for patent protection. (Less)
Popular Abstract
Delivering 5G by Speeding Up the Math— How can 5G become faster and
more reliable, especially in crowded places? One important piece is how quickly a
base station can do the math needed to aim its radio beams.

A 5G base station does not send the same signal in every direction like a
lightbulb. Instead, it can focus energy into narrow beams and points them toward
different users. This is called beamforming. It helps users get stronger signals and
reduces interference.

To aim these beams, the base station must repeatedly solve a heavy math
problem called QR Decomposition (QRD). The radio environment changes very
quickly, so this math has to be done again and again within a very short time
window. If the calculation takes too... (More)
Delivering 5G by Speeding Up the Math— How can 5G become faster and
more reliable, especially in crowded places? One important piece is how quickly a
base station can do the math needed to aim its radio beams.

A 5G base station does not send the same signal in every direction like a
lightbulb. Instead, it can focus energy into narrow beams and points them toward
different users. This is called beamforming. It helps users get stronger signals and
reduces interference.

To aim these beams, the base station must repeatedly solve a heavy math
problem called QR Decomposition (QRD). The radio environment changes very
quickly, so this math has to be done again and again within a very short time
window. If the calculation takes too long, the beam settings become outdated and
performance drops.

This thesis presents a dedicated hardware accelerator—a small specialized “engine”—
built to run this QRD math much faster than a general-purpose processor.
The key idea is to do more work in parallel. A simple analogy is a sports tournament:
many matches happen at the same time, and only the winners move
forward. In the same way, the accelerator arranges the QRD steps so that many
parts can run simultaneously, instead of one after another.

With this approach, the design can finish one QRD in 9.83 μs at 245.76 MHz,
well below the project target of ≤ 50 μs. For comparison, a human blink takes
roughly 100,000 μs. At this speed, the accelerator can handle about 100,000 QRDs
per second.

By reducing this processing delay, base stations can update beam directions
more quickly and serve more users efficiently. The scheduling method used in this
work was considered novel enough to be filed for patent protection. (Less)
Please use this url to cite or link to this publication:
author
Singh, Vinay LU
supervisor
organization
alternative title
A QRD Accelerator for Digital Beamforming
course
EITM02 20251
year
type
H2 - Master's Degree (Two Years)
subject
keywords
QR Decomposition (QRD), Digital Beamforming, 5G Massive MIMO, Scalable Hardware Accelerator, Low-Latency hardware architecture, Algorithm-Architecture Co-design, Real-Time Processing, Binary-Tree Scheduling, Data Dependency Handling, Givens Rotation, CORDIC
report number
LU/LTH-EIT 2026-1112
language
English
id
9224035
date added to LUP
2026-04-07 08:59:26
date last changed
2026-04-07 08:59:26
@misc{9224035,
  abstract     = {{QR decomposition (QRD) is a computationally-intensive matrix factorization method which is widely used in signal processing. Meeting the stringent processing budgets of real-time applications necessitates dedicated hardware acceleration. This thesis presents the architecture and Register-Transfer-Level (RTL) implementation of a QRD accelerator based on Givens rotations, specifically designed to achieve an end-to-end latency of ≤ 50 μs for an 8 × 8 complex covariance matrix. The architecture transforms the complex input into a 16 × 16 realified representation, which is processed by a CORDIC-based datapath to compute both the Q and R matrices. To manage data dependencies, the design employs a stage-wise binary-tree elimination schedule enforced by a hard memory-visibility barrier.

While the compute core is fully pipelined with a fixed latency of 16 cycles, the system throughput is governed by the on-chip memory service model, resulting in a sustained initiation interval of 2.

 Functional correctness is established via bit-exact verification against a golden fixed-point C reference model using Q1.15 arithmetic. The design was implemented and validated on a ZynqTM UltraScale+TM Field Programmable Gate Array (FPGA) development board. Operating at 245.76 MHz, the accelerator completes a single QRD in 9.83 μs, satisfying the target requirement with a significant performance margin and achieving a sustained throughput of approximately 101 kQRD/s. The results indicate that end-to-end latency is primarily dominated by system-level data movement and synchronization barriers rather than raw arithmetic computation. These findings motivate future research into relaxed
consistency models and inter-stage data forwarding to further optimize scaling for higher-dimensional matrices. The novel architecture and scheduling methodology developed in this work has been filed for patent protection.}},
  author       = {{Singh, Vinay}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{A QR Decomposition Accelerator for Digital Beamforming}},
  year         = {{2026}},
}