

### Adaptive Baseband Pro cessing and Configurable Hardware for Wireless Communication

Gangarajaiah, Rakesh

2017

Document Version: Publisher's PDF, also known as Version of record

Link to publication

Citation for published version (APA):

Gangarajaiah, R. (2017). Adaptive Baseband Pro cessing and Configurable Hardware for Wireless Communication. [Doctoral Thesis (compilation), Lund University]. Department of Electrical and Information Technology, Lund University.

Total number of authors:

Unless other specific re-use rights are stated the following general rights apply:
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study

- or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
   You may freely distribute the URL identifying the publication in the public portal

Read more about Creative commons licenses: https://creativecommons.org/licenses/

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

## Adaptive Baseband Processing and Configurable Hardware for Wireless Communication

## Rakesh Gangarajaiah



Doctoral Dissertation Electrical Engineering Lund, March 2017

Department for Electrical and Information Technology Lund University P.O. Box 118 SE-221 00 LUND SWEDEN

No. 99 ISSN 1654-790X ISBN 978-91-7753-180-7 (print) ISBN 978-91-7753-181-4 (pdf) Series of licentiate and doctoral dissertations.

© Rakesh Gangarajaiah 2017. Produced using LATEX Documentation System. Printed in Sweden by *Tryckeriet i E-huset*, Lund. March 2017.



## Abstract

The world of information is literally at one's fingertips, allowing access to previously unimaginable amounts of data, thanks to advances in wireless communication. The growing demand for high speed data has necessitated the use of wider bandwidths, and wireless technologies such as Multiple-Input Multiple-Output (MIMO) have been adopted to increase spectral efficiency. These advanced communication technologies require sophisticated signal processing, often leading to higher power consumption and reduced battery life. Therefore, increasing energy efficiency of baseband hardware for MIMO signal processing has become extremely vital. High Quality of Service (QoS) requirements invariably lead to a larger number of computations and a higher power dissipation. However, recognizing the dynamic nature of the wireless communication medium in which only some channel scenarios require complex signal processing, and that not all situations call for high data rates, allows the use of an adaptive channel aware signal processing strategy to provide a desired QoS. Information such as interference conditions, coherence bandwidth and Signal to Noise Ratio (SNR) can be used to reduce algorithmic computations in favorable channels. Hardware circuits which run these algorithms need flexibility and easy reconfigurability to switch between multiple designs for different parameters. These parameters can be used to tune the operations of different components in a receiver based on feedback from the digital baseband. This dissertation focuses on the optimization of digital baseband circuitry of receivers which use feedback to trade power and performance. A co-optimization approach, where designs are optimized starting from the algorithmic stage through the hardware architectural stage to the final circuit implementation is adopted to realize energy efficient digital baseband hardware for mobile 4G devices. These concepts are also extended to the next generation 5G systems where the energy efficiency of the base station is improved.

This work includes six papers that examine digital circuits in MIMO wireless receivers. Several key blocks in these receiver include analog circuits that have residual non-linearities, leading to signal intermodulation and distortion. Paper-I introduces a digital technique to detect such non-linearities and calibrate analog circuits to improve signal quality. The concept of a digital non-linearity tuning system developed in Paper-I is implemented and demonstrated in hardware. The performance of this implementation is tested with an analog channel select filter, and results are presented in Paper-II. MIMO systems such as the ones used in 4G, may employ QR Decomposition (QRD) processors to simplify the implementation of tree search based signal detectors. However, the small form factor of the mobile device increases spatial correlation, which is detrimental to signal multiplexing. Consequently, a QRD processor capable of handling high spatial correlation is presented in Paper-III. The algorithm

vi Abstract

and hardware implementation are optimized for carrier aggregation, which increases requirements on signal processing throughput, leading to higher power dissipation. Paper-IV presents a method to perform channel-aware processing with a simple interpolation strategy to adaptively reduce QRD computation count. Channel properties such as coherence bandwidth and SNR are used to reduce multiplications by 40% to 80%. These concepts are extended to use time domain correlation properties, and a full QRD processor for 4G systems fabricated in 28 nm FD-SOI technology is presented in Paper-V. The design is implemented with a configurable architecture and measurements show that circuit tuning results in a highly energy efficient processor, requiring 0.2 nJ to 1.3 nJ for each QRD. Finally, these adaptive channel-aware signal processing concepts are examined in the scope of the next generation of communication systems. Massive MIMO systems increase spectral efficiency by using a large number of antennas at the base station. Consequently, the signal processing at the base station has a high computational count. Paper-VI presents a configurable detection scheme which reduces this complexity by using techniques such as selective user detection and interpolation based signal processing. Hardware is optimized for resource sharing, resulting in a highly reconfigurable and energy efficient uplink signal detector.

## Popular Science

It was not so long ago that the internet was accessible only through computers hooked up to a fixed telephone line. Data connections were expensive and slow, often requiring minutes to download a picture or a video. Thanks to the advances in wireless communication and semiconductor technology, cheap smart telephones that can perform multiple functions ranging from simple text messaging to live video streaming are now commonly used. Multiple users can seamlessly communicate with one another using different technologies such as Wi-Fi, Bluetooth, 3G, and 4G. The demand for high speed mobile data connections is growing, and 4G networks are being deployed all over the world. Several proposals have already been made for the next generation or 5G communication systems that promise download speeds of several gigabits/s even in crowded places such as stadiums or city centers. It is estimated that more than seven billion mobile subscriptions are now in use and almost half of them have internet connectivity. Although data download speeds have increased over generations, satisfying the needs of online gaming and social media, one aspect of the smartphone has left users dissatisfied. The battery life is woefully short, requiring a recharge almost every day. Modern semiconductor chips consume much less power compared to chips ten years ago, but it appears that battery life has actually reduced over time. One of the important reason for this is the processing algorithms that remove noise and interference from the extremely low power signals that are received by the mobile device. Imagine sunbathing on a good sunny day and compare it to sitting in front of a candle. The power of the signal transmitted from the base station is similar to what you would experience from the sun whereas what a mobile phone receives is comparable to the power from a candle placed several kilometers away! Furthermore, other sources of interference are continually present due to the wireless nature of the communication medium, increasing the complexity of algorithms needed for signal detection, and thus power consumption.

Despite these challenging conditions, 4G provides higher data rates than previous generations. One of the techniques that has enabled these high speeds is multi-antenna communication. Several parallel streams of data are transmitted by the base station to a user. However, even more complex signal processing is needed to recover those parallel streams compared to single antenna communication. One way of reducing this complexity is to adaptively change the processing based on environmental conditions and user requirements. For example, streaming data for a high-definition video when sitting at home is relatively easy when compared to streaming data on a high speed train. Additionally, channel conditions are highly variable in mobile communication environments, and user requirements also have a wide range. For instance, dropping a few frames when streaming a live sporting event is acceptable, whereas downloading a file

has to be done with almost no errors. Adaptively detecting these requirements enables the phone to become smarter, allowing performance adjustments to be made to conserve battery power. Hardware has to be designed to support this adaptability, and a certain degree of reconfigurability is therefore required in chip implementations.

This dissertation explores such adaptive processing techniques and reconfigurable hardware for wireless receivers that use feedback to trade performance and power. Different methods of using awareness about operating conditions to improve energy efficiency are presented. The focus is on algorithms and digital implementations for wide-band multi-antenna receivers. Applications, where the analog part of the radio receiver is optimized for performance with the support of digital algorithms are also considered. Additionally, some of these concepts are applied to receivers designed for 5G systems to dynamically optimize their operation. The dissertation includes a brief introduction to the research field and six papers that present details on the experiments. The results show that using feedback with channel-aware adaptive techniques not only improves energy efficiency but is often necessary to produce cost effective mobile devices for high speed data communication.

This work was performed as part of the "Digitally Assisted Radio Evolution (DARE)" project, funded by Swedish Foundation for Strategic Research and chip fabrication was supported by STMicroelectronics.

# Contents

| ${f Abstr}$                   | ract                                                  | $\mathbf{v}$           |
|-------------------------------|-------------------------------------------------------|------------------------|
| Popul                         | ar Science                                            | vii                    |
| $\operatorname{Cont}\epsilon$ | ents                                                  | ix                     |
| Prefa                         | ce                                                    | xi                     |
| Ackno                         | owledgments                                           | xiii                   |
| List o                        | f Abbreviations                                       | $\mathbf{x}\mathbf{v}$ |
| Introd                        | duction                                               | 1                      |
| 1                             | Motivation                                            | 1                      |
| -                             | 1.1 Scope of this Dissertation                        | 3                      |
| -                             | 1.2 Contributions and Outline                         | 4                      |
| <b>2</b> ]                    | Digital Baseband Processing                           | 7                      |
| 6                             | 2.1 Introduction                                      | 7                      |
| 4                             | 2.2 The Digital Baseband                              | 13                     |
| 4                             | 2.3 Channel Properties and Adaptive Processing        | 27                     |
| 3 (                           | Complexity and Power Reduction                        | 35                     |
| ;                             | 3.1 Introduction                                      | 35                     |
| ;                             | 3.2 Algorithmic Techniques for Adaptive Processing    | 37                     |
|                               | 3.3 Architectural Techniques for Configurability      | 45                     |
|                               | 3.4 Circuit Techniques for Power Reduction            | 49                     |
| 4                             | Applications in Digitally Assisted Radio Receivers    | <b>53</b>              |
| 4                             | 4.1 Non-linearity Mitigation for Analog Circuits      | 53                     |
| 4                             | 4.2 Channel Preprocessor for Small Scale MIMO Systems | 57                     |
| 4                             | 4.3 Detectors for Massive MIMO Systems                | 62                     |
| <b>5</b> ]                    | Paper Summary and Discussion                          | 69                     |
| į                             | 5.1 Research Contributions                            | 69                     |
| į                             | 5.2 Discussion and Future Work                        | 73                     |
| Re                            | ferences                                              | 81                     |

x Contents

| Inc | eluded Papers                                                                              | 83  |
|-----|--------------------------------------------------------------------------------------------|-----|
| Ι   | A Digitally Assisted Non-Linearity Suppression Scheme for RF front ends                    | 87  |
| II  | A Digitally Assisted Non-Linearity Mitigation System<br>for Tunable Channel Select Filters | 97  |
| III | A High Speed QR Decomposition Processor for Carrier-Aggregated LTE-A Downlink Systems      | 107 |
| IV  | Low Complexity Adaptive Channel Estimation and QR Decomposition for an LTE-A Downlink      | 115 |
| V   | An Adaptive QR Decomposition Processor for Carrier-Aggregated LTE-A in 28 nm FD-SOI        | 125 |
| VI  | A Cholesky Decomposition based Massive MIMO Uplink<br>Detector with Adaptive Interpolation | 141 |

## Preface

This dissertation summarizes my work from October 2011 to February 2017 as a PhD student in the Digital ASIC group, at the Department of Electrical and Information Technology, Lund University, Sweden. The dissertation is divided into two parts, and the first part has five chapters that provide a general introduction to the research field. The second part includes papers written during the past four years which are also listed below.

#### **Included Research Papers**

The main contribution is derived from the following publications:

- I R. Gangarajaiah, M. Abdulaziz, L. Liu, and H. Sjöland, "A Digitally Assisted Non-Linearity Suppression Scheme for RF front ends", reprinted from the *Proceedings of IEEE 25th Annual International Symposium on Personal, Indoor and Mobile Radio Communications*, Washington DC, USA, September 2014, pp. 623–627.
- II R. Gangarajaiah, M. Abdulaziz, H. Sjöland, P. Nilsson, and L. Liu, "A Digitally Assisted Non-Linearity Mitigation System for Tunable Channel Select Filters", reprinted from the *IEEE Transactions on Circuits and Systems-II: Express Briefs*, vol. 63, no. 1, pp. 69–73, January 2016.
- III R. Gangarajaiah, P. Nilsson, L. Liu and O. Edfors, "A High Speed QR Decomposition Processor for Carrier-Aggregated LTE-A Downlink Systems", reprinted from the *Proceedings of IEEE European Conference on Circuit Theory and Design*, Dresden, Germany, September 2013, pp. 1–4.
- IV R. Gangarajaiah, P. Nilsson, O. Edfors and L. Liu, "Low Complexity Adaptive Channel Estimation and QR Decomposition for an LTE-A Downlink", reprinted from the Proceedings of IEEE 25th Annual International Symposium on Personal, Indoor and Mobile Radio Communications, Washington DC, USA, September 2014, pp. 459–463.
  - V R. Gangarajaiah, O. Edfors and L. Liu, "An Adaptive QR Decomposition Processor for Carrier Aggregated LTE-A in 28 nm FD-SOI", accepted for publication in the *IEEE Transactions on Circuits and Systems-I: Regular Papers*, 2017.
- VI R. Gangarajaiah, O. Edfors, and L. Liu, "A Cholesky Decomposition based Massive MIMO Uplink Detector with Adaptive Interpolation," accepted for publication in the *IEEE International Symposium on Circuits* and Systems, 2017.

xii Preface

#### Other Publications

I have also contributed to other publications during my research studies that are listed below. They are however, not considered to be a part of this dissertation.

- P. Nilsson, Y. Sun, R. Gangarajaiah, and E. Hertz, "Low Power Unrolled CORDIC Architectures", *IEEE Nordic Circuits and Systems Conference*, Oslo, Norway, October 2015, pp. 1–4.
- P. Nilsson, A.U.R. Shaik, **R. Gangarajaiah**, and E. Hertz, "Hardware Implementation of the Exponential Function using Taylor Series, *IEEE NORCHIP*, Tampere, Finland, October 2014, pp. 1–4.
- M. Stala, **R. Gangarajaiah**, O. Edfors and V. Öwall, "Area and Power Reduction in DFT Based Channel Estimators for OFDM systems, *IEEE NORCHIP*, Vilnius, Lithuania, November 2013, pp. 1–4.

## Acknowledgments

First and foremost, I would like to express my sincere and deepest gratitude to Professor Peter Nilsson, Associate Professor Liang Liu, and Professor Ove Edfors for their continuous support and guidance. They accepted me as a student and I am forever thankful for their belief that I could reach this stage.

I have known Professor Peter Nilsson from my time as a master student. I will forever be indebted to him for his guidance, encouragement and invaluable feedback. It will be hard to find a better teacher and a more humble person. I feel fortunate to have had the opportunity to work with him. The way Peter interacted with students will always inspire me. His approach that no idea is unwise, and no question is foolish, has made me comprehend what it takes to be a good teacher. I will forever cherish fond memories of Peter.

Associate Professor Liang Liu has mentored me with clear direction and vision through difficult times when I felt lost. He has been instrumental in helping me understand many technical concepts, especially writing research articles. I cannot thank him enough for not getting bored of pointing out the same mistakes, and for having been so patient in helping me improve skills on all fronts. I also thank him for finding time to discuss different projects, for spending many weekends reading over reports, papers, this dissertation and providing feedback on innumerable occasions.

I started out as a student in chip design and Professor Ove Edfors introduced me to wireless communication, a fascinating field which I still find hard to understand well enough. He is always full of dazzling ideas and suggestions. I thank him for his feedback, inspiration and motivation.

I would also like to thank the seniors at the department of EIT. Associate Professor Joachim Rodrigues recommended me for the position of a PhD student, without whom I would probably not be here today. Professor Viktor Öwall, Professor Henrik Sjöland, Associate Professor Pietro Andreani and Associate Professor Erik Larsson have provided insightful comments on multiple occasions. I also thank them for giving me the opportunity to work with many talented researchers.

The past few years with the Digital ASIC group has been immensely enjoyable. Ahmed, Yang, Mojtaba, Dimitar, Farrokh, Breeta, Babak, Oskar, Xiaodong, Siyu, Hemanth, Steffen, Muris, Mohammed, Anders, Waqas, Johan, Isael, Deepak, Chenxin, Yasser, Reza and others I may have not mentioned here have helped me on numerous occasions. I would also like to thank Michal and Magnus at Mistbase for a very fruitful internship.

Many others at EIT have made life easier. Pia has helped on multiple occasions with administrative issues. Stefan, Martin, Andreas, Eric J., Joseph and Robert have assisted with lab equipment and computer issues.

I would not have been in Sweden today, if not for the sacrifices made by my family. I am forever grateful to my parents and sister for supporting my choices. I will never be able to repay this debt, but I will strive to make them proud every day. Megi has been the pillar of support over the past four years. It would have been impossible to reach this stage without her unconditional love and affection. She has encouraged me at every step. I thank her for her understanding about the late hours and several work related weekends.

Many friends from India supported me during undergraduate studies and encouraged me to come to Sweden. I would like to thank them all, especially the ashrama boys and friends from Tumkur.

Sweden has welcomed me with open arms and has provided many opportunities. I would like to thank everyone who has directly and indirectly made it possible for students like me to study here. I would like to acknowledge the Swedish Foundation for Strategic Research for funding this work and Ericsson for supporting conference visits. I would also like to thank others who I may not have named here for all their help and support over the past many years.

Rakesh Gangarajaiah

falush Mh.

## List of Abbreviations

**3GPP** 3rd Generation Partnership Project

ADC Analog to Digital Converter

**ASIC** Application Specific Integrated Circuit

BER Bit Error Rate
BS Base Station
BW Bandwidth

CA Carrier Aggregation
 CD Cholesky Decomposition
 CE Channel Estimation
 CFO Carrier Frequency Offset

CMOS Complementary Metal Oxide Semiconductor

**CP** Cyclic Prefix

**CSF** Channel Select Filter

**DVFS** Dynamic Voltage and Frequency Scaling

EPA Extended Pedestrian A
 ETU Extended Typical Urban
 EVA Extended Vehicular A
 EVM Error Vector Magnitude

FD-SOI Fully Depleted Silicon on Insulator

FDD Frequency Division Duplex
 FFT Fast Fourier Transform
 FIR Finite Impulse Response

FPGA Field Programmable Gate Array

**GS** Gram-Schmidt

HB Half Band

HHT Householder TransformHLS High Level Synthesis

IM3 3<sup>rd</sup> Order Intermodulation
 IQ In-phase and Quadrature
 ISI Inter Symbol Interference

LMS Least Mean Squares
LNA Low Noise Amplifier

LS Least Squares

LTE Long Term Evolution

LTE-A Long Term Evolution-Advanced

MF Matched Filtering

MGS Modified Gram-Schmidt

MIMO Multiple-Input Multiple-Output

ML Maximum Likelihood

**MMSE** Minimum Mean Square Error

**OFDM** Orthogonal Frequency Division Multiplexing

**PDP** Power Delay Profile

**PVT** Process, Voltage and Temperature

**QAM** Quadrature Amplitude Modulation

QoS Quality of Service QRD QR Decomposition

RF Radio Frequency

**SNR** Signal to Noise Ratio

SVD Singular Value Decomposition

**TDD** Time Division Duplex

UE User Equipment

**ZF** Zero-Forcing

## Chapter 1

## **Motivation**

Wireless communication systems have undergone a revolutionary change over the past few decades, transforming society and heralding an era of information exchange. New standards have emerged, and the frequency spectrum has become crowded to cater to the increasing demand for high speed data connections. The progress in Complementary Metal Oxide Semiconductor (CMOS) technology has been one of the key contributors to the growth of the wireless industry and the evolution of the smartphone. The transistor, which is the building block of semiconductor designs has shrunk in size and the number of transistors on a chip has roughly doubled every two years, following Moore's predictions closely [1]. These denser integrated circuits have facilitated the implementation of complex digital algorithms for a multitude of functions found in a modern day smartphone. This increased functionality, however, comes with the drawback of higher power consumption [2]. While this may not be an issue in devices operated with a direct power source, it is a major concern in battery operated devices. Furthermore, the wireless communication channel is highly variable due to fading effects, interference and noise sources. Regardless of the channel conditions, users desire a high Quality of Service (QoS). Traditional design methodologies with optimization of individual blocks can provide the required QoS. But, this strategy with just in-block optimizations is not efficient in terms of both silicon area and power dissipation. Therefore, a more global approach which involves cross optimization of several blocks together with local optimizations is needed to achieve energy efficiency.

To address this, chip designers and researchers have looked at different methods of optimizing the performance of receivers with low power consumption. Figure 1.1 shows a high level block diagram of a wireless receiver. The analog front end is used to receive, amplify and filter signals transmitted from another device. Analog to digital converters transform the output of the analog front end into the digital domain where most of the processing is performed. Local feedback within the signal domains with local controllers shown in Figure 1.1 is used to optimize the performance of individual blocks. Nonetheless, to achieve the goal of high performance at low power, a more global feedback



Figure 1.1: Digitally assisted radio receiver.

strategy is needed, where different blocks interact across these domains. This feedback is essential for receivers implemented with the latest CMOS technologies where the design of analog blocks has become more challenging due to increased process, voltage, temperature variations and low supply voltage headroom. On the other hand, newer CMOS technologies have immensely benefited digital circuits. The computational capability has increased tremendously, allowing complex algorithms to run on these circuits and the supply voltage reduction has lowered power dissipation. Though several design techniques are used to improve the performance of analog blocks [3–5], it has become essential to harness the digital computation power to compensate for some of the losses in the analog front end. Purely analog compensation techniques have been proposed for intermodulation mitigation [6] and fully digital compensation methods have been employed to efficiently compensate for errors such as Carrier Frequency Offset (CFO). A third method, which combines analog and digital domain solutions to tune the analog component towards optimal operation may also be used. This can be performed by monitoring the performance in the digital baseband and using a control structure similar to the one shown in Figure 1.1. The same tuning concept can also be applied to circuits in the digital baseband.

A mobile receiver operates in different channel scenarios with varying levels of interference and noise. Algorithms used to decode signals in such conditions have a significant impact on both performance and power dissipation, with higher complexity algorithms providing better performance. The inherent variability in the channel indicates that different levels of effort are required to achieve a particular QoS. Thus, local feedback of channel conditions and user requirements can be used to optimize performance of algorithms and corresponding circuits towards lower power dissipation. This can be achieved with flexible hardware architectures that allow on-the-fly switching between different algorithms at the cost of increased silicon area. Careful architectural level

choices are thus required to minimize the area overhead while keeping flexibility for the adaptive tradeoff between power and performance. Together with algorithmic and architectural design choices, several circuit level modifications can be used to reduce power dissipation. Dynamic voltage and frequency scaling is a commonly used method in processors to tune performance based on run-time requirements. Multiple power domains are used in designs to drastically reduce power consumption in unused blocks. Features such as the back gate in Fully Depleted Silicon on Insulator (FD-SOI) technology, provide another degree of control to reduce static power or to dynamically increase operational frequency. An efficient implementation of a wireless receiver requires a right balance of local and global feedback together with algorithmic, architectural and circuit level optimizations. A design approach with co-optimization across these different levels is required to achieve energy efficiency and cost effectiveness.

#### 1.1 Scope of this Dissertation

The goal of this research is to improve the energy efficiency of high data rate receivers by adaptively tuning performance based on operating conditions. Co-optimization techniques are employed to achieve this goal. Namely, the exploration of strategies to combine feedback, algorithms, reconfigurable hardware architectures and corresponding circuit implementations to realize power efficient receivers for current and next generation wireless devices. Multiple-Input Multiple-Output (MIMO) communication has become the norm in high performance receivers. The driving force behind this performance is digital baseband processing which is the main topic of research.

MIMO technology enhances data rates and spectral efficiency but at the cost of higher complexity and power consumption at the receiver. Although most tasks related to MIMO are digital in nature, the interface to the outside world is still analog. High linearity is required in these analog blocks to fully extract MIMO gains in the digital baseband. Performance and power reduction are on different ends of a see-saw, where reducing one may allow higher gains in the other. This dissertation addresses the design of adaptive receivers and aims to answer the following questions:

- How can a low complexity feedback mechanism be designed to improve the performance of analog circuits?
- How can the complexity of digital baseband circuits be reduced by using information about operating conditions and QoS requirements?
- How can co-optimization be leveraged to design efficient hardware with sufficient configurability and low overhead?
- How can these concepts be applied to improve the energy efficiency of circuits in the current and next generation wireless receivers?

#### 1.2 Contributions and Outline

Figure 1.2 shows a design space with multiple levels of optimizations for different wireless systems (standards) and two feedback strategies. The work performed in this dissertation is mapped into this design space as highlighted. The optimization process is divided into algorithmic, architectural and circuit levels. The algorithmic design stage generally provides the highest flexibility to improve performance. The architectural level exploration is used to find efficient hardware implementations while the circuit optimizations are aimed at reducing power dissipation and improving throughput. Although energy efficiency can be enhanced by optimizing the design at each level, higher gains are obtained by combining all three levels. Two feedback strategies, digital to analog (Dig. $\rightarrow$ Ana.) and digital to digital (Dig. $\rightarrow$ Dig.) are considered.

The dissertation is divided into two sections. The first section is organized into five chapters and provides an introduction to the research field.

Chapter 1 presents the motivation and the outline of this work.

**Chapter 2** provides an overview of baseband processing. A few methods to enhance spectrum efficiency are introduced and challenges in wireless receiver design are discussed. Properties of the channel are also examined.

Chapter 3 introduces techniques for reducing complexity and power consumption in different stages of a circuit design cycle.

**Chapter 4** presents the techniques in Chapter 3 in light of three example applications. These applications are also presented in more detail in the included papers.

**Chapter 5** summarizes the research contributions and includes a brief discussion on possible improvements and future research topics.

The second part of this dissertation includes six papers that target different standards by combining optimizations and feedback as depicted in Figure 1.2. The analog front end blocks in a radio usually have some level of reconfigurability, used to change parameters such as gains, oscillator frequencies and resolution of Analog to Digital Converters (ADCs). Some implementations also include controls to reduce distortion and improve linearity. A system which detects non-linearities of analog circuits in the digital domain and provides Digital $\rightarrow$ Analog feedback is presented in **Paper I**. The non-linearity



Figure 1.2: Dissertation contribution mapping.

tuning system proposed in Paper I is optimized, implemented in a Field Programmable Gate Array (FPGA) and tested by interfacing a tunable analog filter. These results are presented in **Paper II**.

In the entirely digital feedback domain, baseband circuits used in MIMO systems are optimized for energy efficiency. Small scale MIMO systems such as the ones in the Long Term Evolution-Advanced standard [7,8] rely on non-linear signal detectors to fully realize spatial multiplexing gains. The QR Decomposition (QRD) channel preprocessor is often used to ease the implementation of these detectors. A high speed QRD processor capable of decoding wideband signals is presented in **Paper III**. To reduce the computation count of QRD in Carrier Aggregation (CA) scenarios, an adaptive strategy which exploits channel conditions is proposed in **Paper IV**. A design which incorporates this adaptive approach with Digital Digital feedback is optimized on the architectural and circuit level. The design is fabricated in 28 nm CMOS and measurement results are presented to highlight the advantages of co-optimization in **Paper** V. These concepts are extended to the next generation of communication systems based on massive MIMO [9]. An adaptive uplink detector with hybrid detection that combines two algorithms to reduce complexity is presented in Paper VI. Several architecture level techniques are combined with algorithmic optimizations to result in an energy efficient, high throughput signal detector for massive MIMO base stations.

#### Notation

The following notations will be used throughout this dissertation. Non-bold letters, e.g. a, or  $\alpha$ , will be used for scalars. Bold-upper case letters, e.g.  $\boldsymbol{A}$  will be used for matrices and bold lower case letters, e.g.  $\boldsymbol{a}$  will be used for column vectors. The Hermitian conjugate is  $(\cdot)^H$ . The transpose of a matrix or a vector is denoted by the  $(\cdot)^T$  operator, and the complex conjugate is denoted by  $(\cdot)^*$ . Numbered subscripts are used for column vectors in a matrix, e.g.  $\boldsymbol{a}_1$  refers to the first column of a matrix  $\boldsymbol{A}$ . Two subscripts are used when referring to a single element of a matrix, e.g.  $a_{14}$  is the element on the first row of the fourth column of a matrix  $\boldsymbol{A}$ .

## Chapter 2

## Digital Baseband Processing

This chapter begins by presenting methods to achieve high data rates with good reliability in communication links between a Base Station (BS) and a mobile User Equipment (UE). The effects of the wireless channel on data reception are examined, and methods for increasing the resilience of communication links to channel fading while improving spectral efficiency are presented. Later, the building blocks of a typical wireless receiver are introduced with a focus on the tasks performed in the digital baseband. A brief introduction to properties of wireless channels and challenges in system design for cellular standards are presented. Finally, the need for adaptive processing and techniques exploiting channel properties in order to achieve energy efficiency are discussed. The following chapter will present optimizations of algorithms and hardware implementation details for a few key blocks presented in this chapter.

#### 2.1 Introduction

The demand for cellular data has increased exponentially in the past few years and is expected to reach around 50 Exabytes/month by the end of this decade [10]. Increasing communication bandwidth is one of the ways to increase capacity, but the price/Hz for sub-6 GHz frequencies have reached exorbitant numbers, a recent auction fetching \$45 billion for 65 MHz of spectrum [11]. Thus, spectral efficiency has become critical, which has spurred researchers into finding methods to maximize the data rate while serving an increasing number of users in a cellular network. Time and frequency multiplexing allows the BS to share a limited spectrum with multiple users, and spatial multiplexing with multiple antennas is used to increase communication link capacity. However, advanced signal processing algorithms are required at the receiver to decode data and to obtain the high data rates promised by these methods. The additional performance gains from these algorithms come at the cost of increased power dissipation, which is a major concern in mobile devices operating on limited battery energy. Thus, the design of wireless receivers optimized for energy efficiency has been at the forefront of research for several years.

Today's smartphones can communicate using different wireless standards, for example, Bluetooth, Wi-Fi, 3G, more recently 4G or Long Term Evolution-Advanced (LTE-A), and the next generation of 5G devices are just around the corner. These mobile devices operate in cellular systems, where a BS transmits and receives data from multiple devices simultaneously. The signals transmitted from the BS are scattered by objects in the environment such as trees and buildings before reaching the UEs. Thus, the signals arrive at the UE through multiple paths, each with a different delay and attenuation. These multi-path components vary over time, resulting in a time varying frequency selective channel, causing small scale fading at the UE. Additionally, the movement of the UEs and scatterers in the environment results in shadow fading. The received signal also experiences a path loss in proportion to the distance between the BS and UEs [12]. Furthermore, many network operators and mobile devices may communicate with different standards in nearby frequencies, causing interference. Nonetheless, mobile receivers are expected to operate even with very low received signal power, in the range of -95 dBm for LTE-A compliant devices [13]. Techniques such as increasing transmit power, data pre-coding, choosing between modulation alphabets, spread spectrum signaling [12], channel coding with interleaving [14] and diversity schemes [15] have been introduced to increase the reliability and capacity of communication links between the BS and UEs.

#### 2.1.1 Wireless Access Technologies

The received signal power and the communication Bandwidth (BW) govern the capacity of the communication link, or the data rate at which information can be reliably exchanged. The Shannon-Hartley theorem defines an upper limit on this data rate in an additive white-noise Gaussian channel. This rate C, can be computed by

$$C = \mathrm{BW} \times \log_2 \left( 1 + \frac{S}{N} \right), \tag{2.1}$$

where BW is the bandwidth measured in Hz, S is the signal power and N is the noise power. Thus, capacity can be increased by either using more bandwidth or increasing the signal power. The latter provides limited gains due to its logarithmic dependency. Furthermore, the power transmitted from the BS has to satisfy regulatory limits and also has practical limitations. Hence, increasing BW has been the preferred choice in the cellular standardization process. A BW of 200 kHz was used in earlier 2G systems, which has now increased to 100 MHz in 4G systems. Though a theoretically linear increase in channel capacity can be obtained by increasing BW, the Bit Error Rate (BER) performance, which determines the reliability of the link, is dependent on channel properties such as fading and interference from external sources. Strong interferers cause problems in the Radio Frequency (RF) front end circuits in the receiver, and



Figure 2.1: OFDM signaling.

thus require careful design for linearity. Fading can be effectively handled in the digital baseband and some methods of reducing fading effects are presented below.

#### Orthogonal Frequency Division Multiplexing

Multi-path propagation of signals with different delays causes frequency selective fading. Additionally, the mobility of the UEs results in a time varying frequency selectivity. The Orthogonal Frequency Division Multiplexing (OFDM) technique makes it easier to equalize these fading effects and has been adopted in many wireless standards. It operates by dividing the communication BW into multiple bands called sub-carriers, each with a much smaller BW. The division is performed by examining channels properties and the sub-carriers BWs are chosen so that the channel's frequency selectivity is invisible over any given sub-carrier. More importantly, the sub-carriers are tightly spaced as shown in Figure 2.1(a) and are orthogonal in both time and frequency. Figure 2.1(b) depicts the time domain equivalent of the four sub-carriers, each with an increasing frequency. The OFDM symbol period is chosen to include an integer number of component sub-carriers cycles. However, a Cyclic Prefix (CP) is often necessary when operating in multi-path wireless channels, which increases the length of the symbol.

UEs in a cell experience different levels of selectivity and fading due to their distributed nature and dissimilar mobility. This may result in situations where some UEs have a good channel at a frequency where other UEs do not. The BS may exploit this phenomenon when using OFDM to distribute the



Figure 2.2: Multiple antenna communication.

full communication BW among multiple UEs with a goal of maximizing the Signal to Noise Ratio (SNR) at each receiver [12]. Each sub-carrier can also be individually modulated with a different alphabet, allowing the transmitter to increase spectral efficiency with higher order modulation alphabets such as 64-Quadrature Amplitude Modulation (QAM) or 256-QAM [16]. From a hardware perspective, OFDM can be implemented efficiently with the Fast Fourier Transform (FFT) algorithm, and its corresponding butterfly structures provide a simple way to divide the communication BW among several users.

Notwithstanding these advantages, practical utilization of OFDM modulation has an overhead in the form of a CP which is needed to avoid inter symbol interference in multi-path channels. Accurate timing and carrier synchronization are also needed, requiring analog calibration and digital compensation to maintain orthogonality between the digitized sub-carriers. Furthermore, the peak to average ratio of OFDM signaling is high, requiring more than a few dBs of back-off in power amplifiers during transmission [14]. Though tight spacing of the sub-carriers increases spectral efficiency, 10% - 40% of the BW is not used for data transmission to minimize adjacent channel leakage. Nevertheless, OFDM has been widely adopted in several standards such as 802.11 achigh speed Wi-Fi, LTE-A and is also considered for the next generation cellular communication.

#### Multiple Antenna Communication

Although the amount of BW allocated for cellular communication has increased, the number of subscribers for mobile connectivity has grown at an even faster rate. Resource scheduling with frequency and time multiplexing are two techniques used by network operators to share a limited BW among multiple users. The BS can use feedback on channel conditions to schedule downlink data transmissions, with the goal of maximizing the data rates and QoS offered to users. The spatial domain offers a third alternative to increase data rates and link reliability, and is exploited by using multiple antennas at the BS and the UE [17]. These antennas can be used to either increase the

SNR with diversity techniques or to create multiple communication streams. Figure 2.2 depicts such a spatial multiplexing system which has four antennas at both transmitter and receiver. A MIMO system with N transmitter and M receiver antennas may be represented in the baseband by

$$y = Hx + n, (2.2)$$

where  $\mathbf{y} = [y_1, y_2, ..., y_M]^T$  is the received data vector,  $\mathbf{H} \in \mathbb{C}^{M \times N}$  is the complex numbered channel gain matrix and  $\mathbf{x} = [x_1, x_2, ..., x_N]^T$  is the baseband equivalent of the transmitted signal. All other sources of interference and noise are modeled by the vector  $\mathbf{n}$ . The receiver estimates the effects of the channel by the use of predefined reference signals sent from the transmitter at regular intervals and informs the BS on the channel conditions during uplink data transfer. Based on this feedback, the BS adapts the communication link by methods such as pre-coding to cancel some of the channel effects, changing modulation format or switching between diversity schemes to improve SNR. In good channel conditions such as when  $\mathbf{H}$  in (2.2) is full rank with a low condition number and SNR at the receiver is high, parallel data communication with multiple streams can take place between BS and UE. On the other hand, diversity schemes may be employed in scenarios when the UE experiences low SNR due to deep fading [15].

#### Massive MIMO

The communication link quality can be improved by increasing the number of antennas at either the transmitter, the receiver or at both ends. However, increasing the number of antennas at the BS is easier due to factors such as the availability of a direct source of power and physical space. Furthermore, FPGAs and digital signal processors can be used to implement baseband algorithms instead of Application Specific Integrated Circuits (ASICs) as hardware efficiency and power consumption are not as great a concern as in a UE. Massive MIMO systems are based on this concept and typically employ around 100 or more BS antennas to serve 10 to 20 active users [18, 19]. Such a system is shown in Figure 2.3 where M antennas at the BS communicate with K single antenna UEs. A large number of antennas at the BS provide high spatial resolution which may be used to accurately precode transmit signals. This precoding may be chosen to equalize channel effects and minimize interference, thus simplifying baseband processing on the battery operated UEs. The high spatial resolution also enables the full BW to be simultaneously assigned to multiple users, increasing system capacity. Small scale fading, which degrades the performance in small scale MIMO systems such as the ones in Figure 2.2, can be reduced significantly with simple signal processing. The transmit power from each of the antennas at the BS can also be reduced to the order of milli-



Figure 2.3: A massive MIMO system with M antennas at the BS and K single antenna UEs.

Watts [18], compared to the power in the range of tens of Watts in currently deployed BSs [20]. Moreover, the robustness of the BS is increased as the impact of failure from a few transceivers is limited. The massive MIMO system can also be represented by (2.2), with  $\mathbf{H} \in \mathbb{C}^{M \times K}$  and an additional constraint of M >> K, where M is the number of antennas at the BS and K is the single antenna UE count. Furthermore, linear pre-coding and detection schemes can be used in massive MIMO systems as the number of BS antennas are typically much larger than the number of users.

Although massive MIMO brings several advantages; it comes with its fair share of challenges. The data processing required at the BS increases due to a large number of radios. A central processing unit is needed to combine this data and process it in a timely fashion. Even though linear detection algorithms provide good performance, the dimension of the channel matrix H presents implementation challenges concerning processing latency and silicon area. Accurate channel estimation is another problem. Time Division Duplex (TDD) mode is often employed in massive MIMO systems and pilot symbols from the UEs may be used to simplify channel estimation. However, this may limit the number of users that can be simultaneously served and lead to interference and pilot contamination between nearby cells [9]. The quality of downlink pre-coding is directly dependent on the estimation accuracy, and reciprocity calibration is needed in TDD systems [21]. Notwithstanding these challenges, massive MIMO has shown promising results and is thus one of the leading candidates for the 5G communication era [22].



Figure 2.4: Structure of a multi-antenna wireless receiver.



Figure 2.5: Digital baseband blocks in a MIMO-OFDM receiver.

#### 2.2 The Digital Baseband

MIMO and OFDM presented in the previous subsection are two methods to improve communication capacity and reliability. A practical implementation of these techniques in low power mobile radios needs efficient hardware architectures to maximize energy efficiency. A simplified block diagram of a wireless receiver capable of using MIMO and OFDM is shown in Figure 2.4. The RF signals transmitted from a BS are collected by multiple antennas at the receiver, which are amplified and down converted to baseband frequencies with Low Noise Amplifiers (LNAs) and mixers. Analog Channel Select Filters (CSFs) remove interference close to the baseband and the ADCs digitize the filtered signal. These samples are processed by the digital baseband circuits where further filtering, equalization and signal detection are performed before the data is used by the application software. The digital baseband also detects, calibrates and compensates for imperfections in the analog front end. Figure 2.5 shows a more detailed view of the digital blocks. The ADCs often oversample the signals which are further filtered and converted to the baseband rate with decimators. Synchronization is used to lock the UE to the BS followed by detection of residual timing and frequency offsets. The analog calibration block provides feedback to the components such as mixers to adjust carrier frequencies if needed, or tunes the performance of blocks such as LNAs, CSFs and ADCs. The digital compensation block improves the signal quality by perform-

ing tasks such as CFO and In-phase and Quadrature (IQ) imbalance removal. The OFDM demodulator implemented with FFTs, converts the signal into its frequency domain equivalent. The Channel Estimation (CE) unit detects the effects of the multi-path channel and feeds data to the channel preprocessor. The received signals and the data from the preprocessor are used for symbol detection followed by decoding. The combined goal of the digital baseband blocks is to use information such as the estimates of channel gains, modulation format and SNR to decode signals from the analog front end into a stream of bits (symbols). The performance of the receiver while achieving this goal may be measured by using metrics such as the Bit Error Rate (BER), which indicates the average number of bits incorrectly detected in a long stream of data bits. This dissertation adopts BER as one of the metrics to evaluate the performance of baseband algorithms and corresponding hardware solutions in different operating conditions. The next few subsections describe the functionality of the blocks in the MIMO-OFDM receiver in Figure 2.5 in more detail. Some common problems in these receivers and corresponding algorithmic implementations to mitigate their effects to reduce BER are also discussed.

#### 2.2.1 Digital Front End

The digital front end performs tasks such as compensation and calibration of analog blocks, sample rate conversion and synchronization on each stream of data from the multiple receiver chains in a MIMO system.

#### Decimation

The ADC is the interface between the analog signals and the digital baseband circuits. Different implementations of ADCs are available, and oversampling is often employed to reduce inband noise and enhance resolution.  $\Delta\Sigma$  ADCs are one such implementation that provide high resolution with low power consumption [23]. These ADCs require filters to remove out-of-band noise and sample rate converters to match the output rate to the baseband rate. Oversampling may also be used to relax decimation filter requirements and to improve timing synchronization in the baseband [24]. Furthermore, multi-mode cellular devices capable of decoding signals from different wireless standards also require sample rate converters. The filtering process and the sample rate conversion is usually combined into the decimation process [25]. Finite Impulse Response (FIR) configurations are popular due to their simple architecture and half band FIR filter implementations are used to reduce cost [26]. These filters are implemented to provide configurable output rates, to match different BWs and sample rates of corresponding standards depending on the receiver settings [27]. The decimated data is used by the synchronization block to find the reference carrier frequency and OFDM symbol timing.



Figure 2.6: Effects of timing errors.

#### Synchronization

Wireless receivers do not share a common reference with the BS and hence require both timing and frequency synchronization (Sync.). Additionally, OFDM systems rely on maintaining the orthogonality of sub-carriers. To ease the synchronization process, OFDM systems organize data transmission into structures called frames or packets by combining several data symbols and insert special reference or training symbols at regular intervals [28]. The UE may perform timing synchronization using these reference signals in two steps. First, coarse frame synchronization is done to get a rough estimate of the starting sample of a frame [29, 30]. Next, accurate symbol start may be obtained to find the first sample in the OFDM symbol with the cyclic prefix [24]. However, fine synchronization may be hard to accomplish in the presence of noise and interference. Methods such as scheduling the inputs to the FFT unit in the right order can be used to improve the result after coarse synchronization is achieved.

The effects of incorrect timing synchronization are shown in Figure 2.6(a), where Inter Symbol Interference (ISI) causes a complete breakdown of communication. Figure 2.6(b) shows the constellation diagram with coarse synchronization, where the ISI is completely removed and also fine synchronization which results in an ideal 4-QAM constellation.

The second type of synchronization required is between the carrier frequencies of the BS and the UE. The local oscillator at the UE needs to feed the

mixers with the same carrier frequency as used by the BS to upconvert the baseband transmit signals. Good carrier synchronization is required to avoid inter-carrier interference, and the next subsection presents the effects of Carrier Frequency Offset (CFO) and a method to mitigate its effect.

#### Digital Compensation and Analog Calibration

Designers have invented many techniques to improve the performance of components in the analog front end. But Process, Voltage and Temperature (PVT) variations, which cannot be completely controlled at design time, affect linearity and precision. Thus, these variations must be compensated for, either with analog techniques [4, 31, 32] or in the digital domain [33, 34]. Digital implementations are more robust to PVT variations due to the on-off nature of digital circuits. As an example of digital compensation, consider the problem of inter-carrier interference due to CFO. Coherent detection, used in a majority of modern wireless communication systems relies on the capability of the receiver's oscillators to lock to the transmitter's carrier frequency. However, a perfect lock is not always achieved and CFO values of around 0.1 ppm are accepted in communication standards such as LTE-A [13]. CFO causes intercarrier interference in OFDM systems and Figure 2.7(a) shows the effect of 150 Hz and 500 Hz offsets, corresponding to 1\% and 3\% of the sub-carrier spacing in LTE-A, respectively [13]. There are mainly two methods of compensating for such problems due to analog imperfections in the digital domain.

The *first* method uses the computation power of digital circuits to cancel the non-linearities in the baseband and algorithms with their corresponding hardware implementations have been proposed in literature to detect and mitigate the effects [35,36]. One of the choices is the use of a digital compensation block, such as the one in Figure 2.5. For example, fractional CFO value of  $+\Delta f$ , which is less than half the sub-carrier spacing in an OFDM system may be canceled by multiplying the time domain samples by a complex sinusoid as

$$y(n)_{\text{cfo\_fixed}} = y(n) \times e^{\frac{-j2\pi\Delta fn}{F_s}}, \forall n \in \{0, 1, 2\cdots\} \operatorname{mod} BL,$$
 
$$BL = \frac{F_s}{\Delta f}, \tag{2.3}$$

where  $F_s$  is the baseband sample rate and BL is the block repetition length. Similarly IQ imbalance from mixers may be canceled in the digital domain by estimating the amplitude and phase imbalance [36].

The simplified baseband model of a MIMO system in (2.2) assumes linearity of the wireless channel which is valid for all practical cellular communication systems. Additionally, the model also relies on the linearity of the analog front end which is difficult to achieve in all conditions, especially under PVT variations. Non-linearities create intermodulation, increasing interference in



Figure 2.7: Effects of analog imperfections.

the baseband, and can be modeled by

$$\mathbf{y}_{ant} = \mathbf{H}\mathbf{x} + \mathbf{n},$$

$$\mathbf{y} = \begin{bmatrix} Gy_{ant1} + \alpha y_{ant1}^2 + \beta y_{ant1}^3 + \cdots \\ \cdots \\ \cdots \\ Gy_{antM} + \alpha y_{antM}^2 + \beta y_{antM}^3 + \cdots \end{bmatrix} + \mathbf{n},$$
(2.4)

where y is the baseband signal similar to (2.2),  $y_{ant1}$  is the baseband equivalent of the signal at the receiver antenna 1,  $y_{antM}$  is the baseband equivalent of the signal at the receiver antenna M. The combined linear gain of the analog front end is denoted by G. The scaling factors for the second and third order non-linearity components  $y_{ant}^2$  and  $y_{ant}^3$  are  $\alpha$  and  $\beta$  respectively. In practical receivers, the values of  $\alpha$  and  $\beta$  are much lower than G. Nonetheless, if the signal  $y_{ant}$  contains frequency components  $f_1$  and  $f_2$ , intermodulation between these frequencies results in components at

$$(f_1 - f_2), (f_1 + f_2), (2f_1 - f_2), (2f_1 - f_2), (2f_2 - f_1), (2f_2 + f_1), \cdots$$
 (2.5)

The second order intermodulation terms  $(f_1 - f_2)$ ,  $(f_1 + f_2)$  are reduced to the level of device mismatch by using differential signaling in the analog front end. In the presence of strong interference, the third order intermodulation may affect receiver sensitivity and corrupt the signals of interest. Figure 2.7(b) shows the effects of  $3^{rd}$  Order Intermodulation (IM3) distortion on the baseband signal. Another important metric for receiver performance in addition to BER is

the Error Vector Magnitude (EVM). IM3 distortion with a power 26 dB lower than the baseband signal results in an EVM of around 5%. The maximum allowed EVM is dependent on the signaling constellation and 4G systems require the combined EVM of the analog front end to be lower than 8% for 64-QAM modulation [37]. Additional phase noise from the oscillators [38], IQ imbalance in the mixers, sample clock offsets in the ADCs will further increase EVM. The variable nature of these non-linearities and dependence on external factors such as interference signals, requires frequent monitoring and calibration to ensure signal fidelity.

The **second** approach of mitigating these non-linearites, in contrast to the first completely digital compensation technique, relies on detection and tuning the RF components towards higher linearity. This approach may be more power efficient as the digital circuits associated with tuning can be switched off once the desired level of performance is achieved.

#### **OFDM** Demodulator

The data from the digital compensation block in Figure 2.5 is processed by the OFDM demodulator. One of the key benefits of using OFDM is its low complexity FFT based implementation. Several efficient algorithms with different radices have been proposed with the radix-2 and radix-4 adopted in many hardware implementations [39]. Mixed radices may also be beneficial in some scenarios [40, 41]. The OFDM demodulator block converts the time domain signals into their frequency domain equivalents to be used by the following channel estimators and symbol detectors. For MIMO systems, one FFT unit is used for each antenna port of the receiver.

#### 2.2.2 Channel Estimation

The signal at receiver antennas is a modified and combined version of the transmitted signals as depicted in Figure 2.2. The effects of the wireless channel on the transmitted data have to be estimated before they can be equalized. The CE block in the baseband performs this task with the help of reference symbols and tones transmitted by the BS at different frequencies and time instants. Though the addition of reference signals reduces resources available to transmit data, a certain minimum number of pilots are essential to detect frequency selectivity and time variations in the channel. Figure 2.8(a) shows 4-QAM data received through such a frequency selective channel. The pilot tones interspersed with data tones, are compared against a set of reference values to estimate the effects of the channel. These estimates are then used for equalization resulting in the constellation shown in Figure 2.8(b). It is evident from Figure 2.8 that accurate CE is necessary to recover data. Several implementations with varying complexity based on the Singular Value Decomposition (SVD) [42], the Minimum Mean Square Error (MMSE), the Least



- (a) Received data @ SNR =  $25 \,\mathrm{dB}$ .
- (b) Data equalized with Least Squares (LS) estimates.

Figure 2.8: Effects of frequency selective channel.

Squares (LS) criterion [12, 43], and matching pursuit algorithms [41, 44] have been proposed. The UE may use the CE data to provide feedback to the BS for changing parameters such as BW, modulation and spatial multiplexing order.

#### 2.2.3 Channel Matrix Preprocessing

The data from the CE unit is used by the symbol detector to equalize channel effects and to decouple the spatial streams in MIMO systems. A channel preprocessing unit is employed to re-format the data from the CE unit to ease the implementation of these symbol detectors. This preprocessing usually involves matrix decompositions, the results of which may also be used in the precoding process for downlink transmission in massive MIMO systems. One of the methods proposed in [45] uses an SVD preprocessor to convert the channel matrix into a product of two unitary and a diagonal matrix. Another preprocessor based on QRD converts the input matrix into a product of a unitary and an upper triangular matrix. LU decomposition may be used to solve equations of linear systems similar to Gaussian elimination [46]. Massive MIMO systems may use approximate matrix inversions instead of direct matrix inversions [47] to reduce hardware cost and speed up the detection process. Cholesky Decomposition (CD), LDL decomposition [48] and Eigenvalue decomposition are other methods which can also be used in massive MIMO systems. In this dissertation, two of these preprocessing operations, namely the QRD and the CD are considered.

### QR Decomposition

A matrix may be decomposed into a unique product of an orthonormal matrix Q and an upper triangular matrix R (upto their signs). The channel estimates H in (2.2) is in most cases non-singular and hence, the QRD process may be used. Consider an estimate H in a  $4 \times 4$  MIMO system

$$\boldsymbol{H} = \begin{bmatrix} h_{11} & h_{12} & h_{13} & h_{14} \\ h_{21} & h_{22} & h_{23} & h_{24} \\ h_{31} & h_{32} & h_{33} & h_{34} \\ h_{41} & h_{42} & h_{43} & h_{44} \end{bmatrix}. \tag{2.6}$$

The QRD of this matrix yields

$$\boldsymbol{H} = \boldsymbol{Q}\boldsymbol{R},$$

$$\boldsymbol{H} = \begin{bmatrix} q_{11} & q_{12} & q_{13} & q_{14} \\ q_{21} & q_{22} & q_{23} & q_{24} \\ q_{31} & q_{32} & q_{33} & q_{34} \\ q_{41} & q_{42} & q_{43} & q_{44} \end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} & r_{14} \\ 0 & r_{22} & r_{23} & r_{24} \\ 0 & 0 & r_{33} & r_{34} \\ 0 & 0 & 0 & r_{44} \end{bmatrix}.$$

$$(2.7)$$

The matrix Q is unitary, i.e.,  $Q^HQ = I$ , where I is the identity matrix and R is upper triangular with real entries on the diagonal. Algorithms used for QRD can be classified into two broad categories. The first category orthonormalizes the H matrix by a series of right multiplications with upper triangular matrices. An example of this triangular orthogonalization category is the Gram-Schmidt (GS) process. The second category, classified under orthogonal triangularization, convert H into an upper triangular matrix by a series of left multiplications with orthonormal matrices [46]. The Householder Transform (HHT) and Given's rotations fall into this category. These two sets of algorithms can be mathematically described as

Triangular Orthogonalization 
$$\to Q = HR_1R_2\cdots R_N$$
, and  $Q_M\cdots Q_2Q_1H = R \to \text{Orthogonal triangularization}$ ,

where  $R_1, R_2, \dots R_N$ , are upper triangular matrices and  $Q_1, Q_2, \dots, Q_M$  are unitary matrices.

Algorithm 1 lists the GS process, and a pictorial representation of the GS algorithm on a  $2 \times 2$  matrix is presented in Figure 2.9. The process starts by using the first column vector  $\boldsymbol{h}_1$ , of the input  $\boldsymbol{H}$  matrix as the starting reference, and  $\boldsymbol{q}_1$  is obtained by normalizing the length of  $\boldsymbol{h}_1$ . The projection of  $\boldsymbol{h}_2$  in the direction of  $\boldsymbol{q}_1$ ,  $(\boldsymbol{h}_2^H\boldsymbol{q}_1)\boldsymbol{q}_1$  is then subtracted from  $\boldsymbol{h}_2$ , resulting in a vector orthogonal to  $\boldsymbol{q}_1$ . The second orthonormal vector  $\boldsymbol{q}_2$  is obtained by normalizing this result. The full  $\boldsymbol{Q}$  matrix is constructed by using the individual vectors  $\boldsymbol{q}_1, \boldsymbol{q}_2, \cdots \boldsymbol{q}_N$ . Although the algorithm is straightforward to

#### **Algorithm 1** Gram-Schmidt based QRD for $M \times N$ matrix.

```
1: procedure GramSchmidt(H)
                                                                                                 ▶ Multiplications
            \triangleright Intialize Q and R to all zero matrix
 2:
            Q \leftarrow O
 3:
            R \leftarrow O
 4:
            \triangleright Start decomposition
 5:
            for j \leftarrow 1 to N do
 6:
                  a_j \leftarrow h_j
 7:
                  for i \leftarrow 1 to j - 1 do
 8:
                       r_{ij} \leftarrow \boldsymbol{q}_i^H \boldsymbol{h}_j
                                                                                                 \triangleright M
 9:
                       \boldsymbol{a}_i \leftarrow \boldsymbol{a}_i - r_{ij}\boldsymbol{q}_i
                                                                                                 \triangleright M
10:
                  end for
11:
                  r_{jj} \leftarrow sqrt(\boldsymbol{a}_i^H \boldsymbol{a}_j)
                                                                                                 \triangleright M
12:
                                                                                                 \triangleright M
                  q_j \leftarrow a_j/r_{jj}
13:
                  \triangleright Update columns of Q
14:
15:
                  Q_i \leftarrow q_i
            end for
16:
            return Q, R
17:
18: end procedure
```



Figure 2.9: Gram-Schmidt orthogonalization process.

implement in hardware, the GS process suffers from instabilities in fixed point implementations and an alternative version of the algorithm called the Modified Gram-Schmidt (MGS) is used to improve stability [46]. The complexity

of algorithms are often measured by the number of multiplication operations and total multiplication count for the GS algorithms for an  $M \times N$  matrix is obtained by

$$\sum_{i=1}^{N} \left( \sum_{i=1}^{j-1} 2M \right) + 2M = MN^2 + MN. \tag{2.8}$$

The complexity is in the order of  $\mathcal{O}(M^3)$  for full rank square matrices of size  $M \times M$ .

On the other hand, the HHT operates by using reflections, and an implementation of the HHT is shown in Algorithm 2. The pictorial representation of the HHT for the same example matrix as used for the GS process is shown in Figure 2.10. The transform starts by operating on the first column  $h_1$ , similar to the GS process. However, instead of normalizing this vector to produce  $q_1$ , the HHT reflects the vector  $h_1$  onto the X-axis to produce  $r_1$ . This is achieved by left multiplying  $h_1$  with the reflection matrix of the form

$$Q = \left(I - 2\frac{vv^H}{v^H v}\right),\tag{2.9}$$

where v is a vector orthogonal to the reflection plane. There are two such possible reflection planes, the edges of which are marked by  $P^+$  and  $P^-$ . These planes can be visualized as being perpendicular to the page on which Figure 2.10 is printed, but placed along the dashed lines marked by  $P^+$  and  $P^-$ . The HHT picks the plane which is farthest from the vector being reflected, resulting in improved resilience to rounding errors in fixed point hardware. The reflection matrix is applied on all subsequent columns of the input matrix, resulting in reflected vectors as shown in Figure 2.10. For the next iteration, the matrix is deflated by removing the first column and row of the modified H matrix and the reflection procedure is repeated. It can be noticed from Algorithm 2 that the HHT operates on vectors with decreasing size as j increases from 1 to N and the computational complexity is obtained by

$$\sum_{j=1}^{N} \left( \sum_{k=j}^{N} 2c + 2c \right) \approx MN^2 - \frac{N^3}{3}, \tag{2.10}$$

where c is defined as (M+1-j). The complexity of the HHT algorithm is also in the order of  $\mathcal{O}(M^3)$  for full rank square matrices. Although the complexity is lower than the GS process, it has to be noted that the Q matrix is not explicitly computed in the HHT algorithm.

### **Algorithm 2** Householder Transform based QRD for $M \times N$ matrix.

```
1: procedure Householder(H)
                                                                                                               \triangleright Multiplications
              \triangleright Intialize V to identity matrix
              oldsymbol{V} \leftarrow oldsymbol{I}
 3:
             \triangleright Start decomposition
              for j = 1 to N do
 5:
                    c \leftarrow M + 1 - j
 6:
                    oldsymbol{x} \leftarrow oldsymbol{H}_{j:M,j}
 7:
                    e \leftarrow I_{j:M,j}
 8:
                    \boldsymbol{v} \leftarrow sign(x_1) \| \boldsymbol{x} \|_2 \boldsymbol{e} + \boldsymbol{x}
 9:
                                                                                                               \triangleright c
                    r \leftarrow 2/(\boldsymbol{v}^H \boldsymbol{v})
10:
                                                                                                               \triangleright c
                    for k = j to N do
11:
                           \boldsymbol{H}_{i:M,k} \leftarrow \boldsymbol{H}_{i:M,k} - r\boldsymbol{v} \left( \boldsymbol{v}^H \boldsymbol{H}_{i:M,k} \right)
12:
13:
                     end for
                    V_j \leftarrow v
14:
              end for
15:
             return \boldsymbol{V}, \boldsymbol{H}
16:
17: end procedure
```



Figure 2.10: The first Householder reflection.

#### Cholesky Decomposition

Cholesky decomposition (CD) may be used for decomposing Hermitian positive definite matrices. Such matrices are used in Zero-Forcing (ZF) and MMSE based symbol detectors. In massive MIMO systems with K single antenna users, the linear system represented by (2.2) has a large dimension due to the value of M (the number of antennas at the BS). This dimension may be reduced by multiplying (2.2) from the left by the Hermitian  $\mathbf{H}^H$ , resulting in the Gram matrix  $\mathbf{H}^H\mathbf{H}$  of dimension  $K \times K$ . Such Hermitian matrices can be decomposed with either LDL decomposition or when they are positive definite, by the Cholesky decomposition. Consider a Gram matrix defined as

$$\boldsymbol{H}^{H}\boldsymbol{H} = \begin{bmatrix} h_{11} & h_{12} & h_{13} & h_{14} \\ h_{12}^{*} & h_{22} & h_{23} & h_{24} \\ h_{12}^{*} & h_{23}^{*} & h_{33} & h_{34} \\ h_{14}^{*} & h_{24}^{*} & h_{34}^{*} & h_{44} \end{bmatrix}.$$
(2.11)

CD is a symmetric process that decomposes the above matrix into a product of two matrices by triangular triangularization and can be represented as

$$CD(\mathbf{H}^{H}\mathbf{H}) = (\mathbf{L}_{1}\mathbf{L}_{2}\cdots\mathbf{L}_{N})\left(\mathbf{L}_{N}^{H}\cdots\mathbf{L}_{2}^{H}\mathbf{L}_{1}^{H}\right)$$

$$= \mathbf{L}\mathbf{L}^{H},$$
(2.12)

where  $\boldsymbol{L}$  is a lower triangular matrix.

This process can be visualized as the expansion of  $\boldsymbol{H}$  with a non-orthonormal basis, which are the columns of  $\boldsymbol{L}$ . Algorithm 3 shows an implementation of the CD process and Figure 2.11 depicts its operation on a Hermitian  $2\times 2$  matrix. To begin with, the vector  $\boldsymbol{s}_1$  is scaled by the square root of its first element to obtain  $\boldsymbol{l}_1$ . The component of  $\boldsymbol{s}_2$  in the direction of the first basis vector  $\boldsymbol{l}_1$  is subtracted from  $\boldsymbol{s}_2$  to get the scaled version of the second basis vector  $\beta \boldsymbol{l}_2$ . Similar to the HHT algorithm, the CD process continues by operating on successively smaller submatrices to fully triangularize the input Hermitian matrix. This algorithm not only requires lower memory than QRD as in-place replacements can be performed, but also has a lower multiplication count given by

$$\sum_{i=1}^{K} \left( \sum_{j=1+1}^{K} \left( \sum_{k=i+1}^{j} 1 + 1 \right) + 1 \right) = \frac{1}{6} (K)(K+1)(K+2). \tag{2.13}$$

## **Algorithm 3** Cholesky Decomposition for $K \times K$ matrix.

```
1: procedure CholeskyDecomposition(H^H H)
                                                                                                ▶ Multiplications
            \triangleright Intialize L with lower triangular part
           \boldsymbol{L} \leftarrow tril(\boldsymbol{H}^H \boldsymbol{H})
 3:
            oldsymbol{S} \leftarrow oldsymbol{L}
 4:
           \triangleright Start decomposition
 5:
            for i = 1 to K do
 6:
                 d \leftarrow 1/sqrt(\boldsymbol{L}_{ii})
 7:
                 \boldsymbol{L}_{ii} \leftarrow d\boldsymbol{L}_{ii}
                                                                                                ⊳ 1
 8:
                 for j = i + 1 to K do
 9:
                       \boldsymbol{L}_{ji} \leftarrow d\boldsymbol{L}_{ji}
                                                                                                ⊳ 1
10:
                       for k = i + 1 to j do
11:
                             \boldsymbol{L}_{ik} \leftarrow \boldsymbol{L}_{ii} \boldsymbol{L}_{ki}^H
12:
                                                                                                ⊳ 1
                        end for
13:
                  end for
14:
            end for
15:
            return \boldsymbol{L}
16:
17: end procedure
```



Figure 2.11: Cholesky decomposition based triangularization.

#### 2.2.4 Symbol Detection

Symbol detection is the process of finding an estimate of the transmitted vector  $\boldsymbol{x}$  in (2.2) with the lowest probability of error. A Maximum Likelihood (ML) detector for a MIMO system finds an estimate  $\boldsymbol{x}_{ML}$  by using the channel estimates  $\boldsymbol{H}$  and the received signal  $\boldsymbol{y}$  as

$$\boldsymbol{x}_{ML} = \underset{\widehat{\boldsymbol{x}} \in \mathcal{S}}{\operatorname{argmin}} \| \boldsymbol{y} - \boldsymbol{H} \widehat{\boldsymbol{x}} \|^2$$
 (2.14)

where  $\hat{x}$  is the estimate of the transmit data vector in the set  $\mathcal{S}$ , of all possible transmit vectors. This set  $\mathcal{S}$ , grows exponentially with the constellation size and transmitter antennas leading to a high complexity for full ML detection. The sphere decoder reduces this complexity by looking for the estimate in a smaller subset of  $\mathcal{S}$ , called the search space. However, choosing the search space may not be easy and a variable amount of time is required to solve the detection problem [49]. A modified version of the sphere decoder is the K-Best algorithm which does not guarantee that the best candidate vector, but has a fixed execution time. It may also be implemented with a parallel architecture and hence is preferred for hardware implementations [50,51]. The QRD preprocessor is often used to simplify the implementation of these K-Best detectors. Consider (2.2) and the QRD of  $\mathbf{H} = \mathbf{Q}\mathbf{R}$ . Left multiplying the received vector  $\mathbf{y}$  in (2.2) with the Hermitian conjugate  $\mathbf{Q}^H$ , results in a rotated vector  $\hat{\mathbf{y}}$  and a modified noise vector  $\hat{\mathbf{n}}$  with properties similar to the original i.i.d Gaussian noise vector  $\mathbf{n}$ . This may be represented as

$$\mathbf{Q}^{H}\mathbf{y} = \mathbf{R}\mathbf{x} + \mathbf{Q}^{H}\mathbf{n}$$

$$\hat{\mathbf{y}} = \mathbf{R}\mathbf{x} + \hat{\mathbf{n}} \rightarrow \begin{bmatrix} r_{11} & r_{12} & r_{13} & r_{14} \\ 0 & r_{22} & r_{23} & r_{24} \\ 0 & 0 & r_{33} & r_{34} \\ 0 & 0 & 0 & r_{44} \end{bmatrix} \mathbf{x} + \hat{\mathbf{n}}.$$
(2.15)

Non-linear detectors operate on the upper triangular system in (2.15) to find the final estimate for the transmit vector.

The complexity of the detection process may be further reduced with linear detectors at the expense of performance loss. A linear estimate  $x_L$  may be obtained by

$$\boldsymbol{x}_L = f(\boldsymbol{H})\boldsymbol{y},\tag{2.16}$$

where  $f(\mathbf{H})$  represents the operation performed by the linear detector. The computationally simple Matched Filtering (MF) detector uses  $f(\mathbf{H}) = \mathbf{H}^H$ , where  $\mathbf{H}^H$  is the Hermitian conjugate of the channel estimate. The MF process maximizes SNR, and is also called maximum ratio combining when used at the receiver, or maximum ratio transmission when used at the transmitter. The

MF estimate of a received signal y is obtained from (2.2) by

$$\mathbf{x}_{MF} = \mathbf{H}^H \mathbf{y}$$

$$= \mathbf{H}^H \mathbf{H} \mathbf{x} + \mathbf{H}^H n. \tag{2.17}$$

The ZF detection uses an  $f(\mathbf{H}) = \mathbf{H}^{\dagger}$ , where  $\mathbf{H}^{\dagger}$  is the pseudo inverse of the channel estimates and operates on (2.2) to produce the ZF estimate  $\mathbf{x}_{ZF}$  as

$$x_{ZF} = \mathbf{H}^{\dagger} \mathbf{y}$$

$$= \mathbf{H}^{\dagger} \mathbf{H} \mathbf{x} + \mathbf{H}^{\dagger} \mathbf{n}. \tag{2.18}$$

The ZF detector eliminates interference between the data streams but enhances noise. The MMSE detector balances interference cancellation and noise enhancement and may be obtained when

$$f(\mathbf{H}) = \left(\mathbf{H}^H \mathbf{H} + \alpha \mathbf{I}\right)^{-1} \mathbf{H}^H, \tag{2.19}$$

where  $\alpha$  is dependent on the SNR, and I is an identity matrix.

The performance of linear detectors in the presence of fading and noise deteriorates, and thus small scale MIMO systems employ non-linear detection schemes. However, in massive MIMO systems, channels have low correlation, or are nearly orthogonal under favorable conditions, and linear schemes such as ZF provide nearly optimal performance. Solving (2.18) and (2.19) in such systems requires the inverse of matrices with large dimensions. Instead of computing the full inverse, an approximate method based on the Neumann series is proposed in [47,52]. Nonetheless, accurate inverses may be needed as number of users increases or when using higher order constellations.

# 2.3 Channel Properties and Adaptive Processing

A typical cellular communication system operates in an environment such as the one shown in Figure 2.12. The signals transmitted from the BS interact with multiple scattering objects such as trees, building and cars before reaching the receivers. The distributed nature of the users may also result in scenarios where a few of them receive higher signal power from the BS than others. Additionally, positions of mobile UEs in a cell change over time resulting in varying environment conditions and channel gains. Furthermore, several network operators, external interference and noise sources are also present close to the signals of interest, which degrade the quality of the desired signal. The properties of these channels such as frequency selectivity, time variance and interference are examined next.



Figure 2.12: A cellular communication system.



Figure 2.13: Fading channels for different environments and mobility.

#### 2.3.1 Frequency Selectivity

The wireless channel can be modeled as a multi-tap linear filter and, like any filter, it can be characterized by its frequency response. As an example, the channels in 4G systems are classified into three categories based on frequency selectivity, namely the Extended Pedestrian A (EPA), Extended Vehicular A (EVA) and the Extended Typical Urban (ETU) models with the Power Delay Profiles (PDPs) listed in Table 2.1. The PDPs describe the power of a received

| EVA Channel     |            | EVA Channel     |            | EVA Channel     |            |
|-----------------|------------|-----------------|------------|-----------------|------------|
| Path delay (ns) | Power (dB) | Path delay (ns) | Power (dB) | Path delay (ns) | Power (dB) |
| 0               | 0          | 0               | 0          | 0               | -1         |
| 30              | -1         | 30              | -1.5       | 50              | -1         |
| 70              | -2         | 150             | -1.4       | 120             | -1         |
| 90              | -3         | 310             | -3.6       | 200             | 0          |
| 110             | -8         | 370             | -0.6       | 230             | 0          |
| 190             | -17.2      | 710             | -9.1       | 500             | 0          |
| 410             | -20.8      | 1090            | -7         | 1600            | -3         |
|                 |            | 1730            | -12        | 2300            | -5         |
|                 |            | 2510            | -16.9      | 5000            | -7         |

Table 2.1: Power delay profiles of 4G channel models.

signal in multi-path channels as a function of time [12]. The EPA channel has the shortest PDP resulting in low frequency selectivity with a response similar to the one in Figure 2.13(a). It can be noticed that in the frequency domain, the channel gain changes slowly over the range of tens of sub-carriers. An EVA channel has medium selectivity due to longer delays in its PDP of up to  $2.5 \,\mu s$  and the ETU channel has the highest selectivity, similar to Figure 2.13(b). The level of selectivity, or the degree of correlation between two frequencies, can be expressed in terms of the coherence bandwidth, defined as the average frequency difference that is required for the correlation to drop below a certain threshold [12]. Highly selective channels have a small coherence bandwidth and vice versa, indicating that baseband tasks such as CE can be performed less frequently in channels with low selectivity. The coherence bandwidth of a channel is related to the root mean square (r.m.s) delay spread of its PDP and the largest value for the three models in Table 2.1 is 991 ns for the ETU model. The 90% coherence bandwidth ( $C_{BW_{90}}$ ) of this channel may be calculated by

$$C_{BW_{90}} = \frac{1}{50 \times \text{(r.m.s. delay spread)}}$$
 (2.20)

and has a value of around 20 kHz [53]. The 90% coherence bandwidth for the EPA channel is around  $470\,\mathrm{kHz}.$ 

#### 2.3.2 Time Selectivity

The frequency response of the channel changes over time due to mobility of UEs and scatterers such as cars. Similar to the coherence bandwidth in the frequency domain, the rate of this change over time can be expressed by the coherence time, defined as the duration over which the channel correlation in time changes by a certain amount. Slow moving or fixed UEs experience smaller changes over time, as shown in Figure 2.13(a), as opposed to a UE moving at



Figure 2.14:  $3^{rd}$  order intermodulaton interference.

higher speeds as depicted in Figure 2.13(b). The variations in the channel response over time can be attributed to Doppler frequency shifts, which are dependent on the relative mobility between the UE and the BS, the carrier frequency and the angle at which the UE is moving towards or away from the multi-path components. The maximum difference in these frequency shifts is measured with the Doppler spread. The 4G channel models capture these changes over time with three different Doppler spreads of 5 Hz, 70 Hz and 300 Hz corresponding to a mobility of around 2.5 km/h, 36 km/h and around 150 km/h respectively at a carrier frequency of 2.1 GHz.

#### 2.3.3 Spatial Selectivity

Mobile UEs are spatially distributed over a cell as shown in Figure 2.12. The multiple antennas can be used for beamforming, to transmit or receive power to/from a desired spatial direction. However, when the antennas are very closely spaced, or if there are not enough scatterers around, the signals received will be spatially correlated. This may introduce difficulties in resolving the parallel communication links and lead to increased requirements in signal detectors. Small scale MIMO systems where the UEs have a small form factor are affected the most. Thus, the design of baseband algorithms for such UEs should also consider spatial selectivity to ensure high performance. On the other hand, in BSs like those in massive MIMO systems, the large number of antennas allows accurate spatial selection, where even two physically close UEs may be separated with relative ease [54].

There may also be scenarios with significant differences in channel conditions, resulting in low correlation among the resolvable paths. The channels between the BS and some UEs may be spatially well separated from another group of UEs. Such information can be used in massive MIMO base stations to optimize computations for reducing power dissipation by switching between different algorithms.

#### 2.3.4 Interference

The reference sensitivity in wireless receivers is in the range of  $-95 \, dBm$ , allowing them to communicate at extremely low received signal power. Furthermore, interference influences the UE performance, especially in scenarios where the UE is close to the cell edge. In such scenarios, the UE may use the maximum allowed transmit power to communicate with the BS which can cause severe interference from intermodulation in its own receiver when operating in the Frequency Division Duplex (FDD) mode. For example, in 4G systems, band 17 in the 700 MHz range may cause IM3 distortion which falls in band 4 at 2100 MHz as shown in Figure 2.14 [13]. An off chip duplexer with attenuation of around 50 dB is typically employed to isolate the receiver of the UE from its transmitter. Nonetheless, a maximum transmit power of 25 dBm may act as a strong interferer even with the duplexer, when received signal power is close to the reference sensitivity. Furthermore, several other standards and operators share the frequency spectrum which may result in additional interferers at the receiver [23]. Designing highly linear analog front ends is a good way of minimizing the effects of external interferers. However, this is a difficult task to achieve, especially in mobile devices operating with limited battery energy.

#### 2.3.5 System Parameters

The Long Term Evolution (LTE) standard from the 3rd Generation Partnership Project (3GPP) is designed for 4G cellular communication systems and makes use of MIMO and OFDM techniques. There are many system parameters such as BW, modulation alphabet, coding rate etc., that can be dynamically changed to suit not only channel conditions, but also UE capabilities and requirements. The coherence bandwidth and coherence time of channels where these systems will be deployed play an important role in OFDM system design, influencing parameters such as sub-carrier spacing, pilot placement and CP length. The LTE standard employs pilots both in time and frequency to ease channel estimation. To support spatial multiplexing, pilots are defined in an orthogonal fashion for the antenna ports. Figure 2.15 shows an example placement of these pilots. A sub-carrier spacing of 15 kHz is chosen to ensure frequency flatness over individual sub-carriers and pilots are placed every six sub-carrier to handle channels similar to the ETU model. In the time domain, the symbols are around 66 µs long and a CP of 5 µs is used to minimize inter symbol interference when operating in channels with long delay spreads. Very high velocities of 500 km/h are supported, resulting in Doppler shifts of around 1 kHz at a carrier frequency of 2.1 GHz. At least two pilots are placed every 0.5 ms to detect these variations over the time domain. The communication BW ranges from 1.4 MHz when using 128 sub-carriers to 20 MHz for operation with 2048 sub-carriers per component carrier and pilots consume around 5% of these resources. Furthermore, some of the sub-carriers are reserved for the



Figure 2.15: Pilot structure in  $4 \times 4$  LTE-A.

guard bands and only part of the full BW is used for actual data transmission. For example, 1200 sub-carriers are used when operating in the  $20\,\mathrm{MHz}$  BW mode.

LTE release 10 or LTE-A is an evolved version of LTE with new features such as 8 × 8 MIMO configuration in the downlink and Carrier Aggregation (CA). The CA technique allows up to five component carriers to be combined, increasing communication BW to 100 MHz, enabling theoretical speeds of around 3 Gbps when used with  $8 \times 8$  MIMO. Most of the frequency bands used for cellular communication below 2.6 GHz have a BW less than 100 MHz. Thus, inter-band aggregation is supported, where component carriers from different frequency bands may be combined. These bands can be either contiguous or non-contiguous as depicted in Figure 2.16 and provide flexibility for network operators to use the precious frequency resources efficiently. LTE supports both FDD and TDD. FDD is used mainly in the lower frequency bands where paired spectra are available, and TDD is used at higher frequencies where wider BWs are available. When operating in the FDD mode, data is transmitted and received simultaneously, but on two different frequency bands. In the TDD mode, the same frequency band is used in a time multiplexed fashion for both uplink and downlink communication. Three modulation schemes are used for data transmission, namely 4-QAM, 16-QAM and 64-QAM. Control information is usually modulated with 4-QAM, and channel coding is employed for error detection and correction [14].



Figure 2.16: Carrier aggregation in LTE-A.

#### 2.3.6 The Need for Adaptive Processing

The previous sections introduced properties of wireless channels and interference conditions. Some of the central functionalities of a typical MIMO-OFDM receiver were presented and features in recent 4G standard to increase the data rate were discussed. This subsection serves as a short review of some of the challenges and presents the need for adaptive processing in wireless receivers.

Typically, a BS serves hundreds of users by dynamically allocating resources to the UEs in a cellular communication system. However, applications such as high definition video streaming and web browsing require significantly different levels of QoS. Thus, the performance of the blocks in Figure 2.4 may be adaptively tuned based on the QoS requirements of the application in use. For example, the EVM requirements from the analog front end may be relaxed when receiving 4-QAM data as opposed to when receiving 64-QAM data [37]. Analog blocks in some of the many parallel receiver chains may be completely shut down when not using spatial multiplexing. On the digital side, the baseband sample rate may be reduced to match the BW assigned to the UE. Decimation filters can be reconfigured, or parts of them can be completely turned off and the resolution of the FFT block can be changed to reduce computations. When receiving wideband signals such as in CA scenarios, higher clock frequencies or multiple blocks may be operated in parallel. Frequency selectivity and user mobility can also be monitored to lower computations in favorable channels. Another solution would be to exploit properties such as coherence bandwidth and coherence time with interpolation techniques to lower computation cost.

A baseband processor designed for the most extreme scenario of 100 MHz BW and highly frequency selective channels which does not use adaptive techniques will be expensive in terms of silicon area, will consume high power and thus result in poor battery life. Hence, implementations optimized for the common use case with reconfigurability to support challenging scenarios and throughput requirements are needed. This dissertation examines some techniques to achieve this goal of providing a variable but good enough QoS, while optimizing computations in the baseband to achieve energy efficiency.

# Chapter 3

# Complexity and Power Reduction

The previous chapter described some of the building blocks in a typical wireless receiver. The chapter also introduced several challenges in implementing these components and opportunities for optimization. Consequently, this chapter presents methods to reduce computational complexity, thereby enabling a tradeoff between performance and energy efficiency. This is done by adopting a co-optimization approach between different stages of the design implementation cycle for circuits in wireless communication.

#### 3.1 Introduction

The design methodology for digital signal processing circuits can be divided into three stages. To begin with, algorithms suitable for a given task, such as CE, QRD or signal detection are evaluated, and the ones which provide the required performance with low complexity are chosen. In the next stage, hardware architectures for these algorithms are examined. Finally, an architecture capable of meeting the throughput demands is implemented with a selected CMOS technology. Several optimization techniques can be used throughout this design process, and Figure 3.1 shows some of the methods available at different stages. Although algorithmic level choices alone have a larger impact than architectural or circuit level decisions, even higher gains can be obtained by combining and co-optimizing across these three stages [55–57].

Digital circuit design for mobile wireless receivers follows a similar principle, with a hard requirement on energy efficiency. The wireless channel and interference sources are highly variable, increasing the difficulty of designing a single implementation that is energy efficient in all scenarios. Implementations for handling the worst case scenarios can satisfy the performance requirements in all scenarios. But at the same time, they may lead to inefficient hardware circuits regarding power and silicon area in good channel scenarios, such as when operating with low time and frequency selectivity. Recognizing the variable nature of the requirements allows a more flexible approach, where hardware may be optimized for typical scenarios, providing adequate



Figure 3.1: Multi level cross-optimization in a wireless receiver design cycle.

QoS, with reconfigurability to handle the more extreme conditions. Hence, flexible or adaptive algorithms that can provide a broad range of performance-complexity tradeoffs are of interest in wireless receivers. However, flexibility usually comes at an increased cost, for example, in terms of silicon area. The architectural design stage should thus ensure an acceptable balance between configurability and overhead, while at the same time satisfy throughput and latency requirements. Techniques such as parallelism and pipelining may be used to increase the throughput, while folding with resource sharing may be used to reduce silicon area. Switching to newer technology nodes is another way of improving throughput while lowering power dissipation. Additionally, other circuit level strategies such as Dynamic Voltage and Frequency Scaling (DVFS), clock/power gating and body biasing can be used for dynamic power-throughput tradeoff.

It is essential to perform a system level analysis with parameters such as typical requirements, channel conditions, available power and silicon cost to examine and co-optimize hardware implementations. For example, choosing an algorithm which can be implemented in a parallel architecture is not only beneficial for increasing throughput but also allows clock/power gating to be used on parts of the parallel implementation when throughput requirements are low. Another example is multi-algorithm selection that share common operations. Choosing an algorithm with high performance, that can be adapted to provide

intermediate performance, by omitting some computations is a good way of achieving reconfigurability while increasing hardware efficiency. Such implementations will allow the receiver to choose among algorithms based on QoS requirements, allowing power reduction in good channels. Feedback can then be used as shown in Figure 1.1 to reconfigure the receiver towards better energy efficiency. This feedback is most useful in the digital domain due to the additional computation power and easier hardware reconfigurability. Nonetheless, a balance has to be found to ensure that overhead from feedback and reconfigurability is manageable in terms of complexity when weighed against possible performance gains. Thus, hardware friendly algorithms with simple methods of reconfiguration are desired to cover a broad range of power-performance tuning.

This chapter investigates some of the techniques on each of the three design stages in Figure 3.1, and evaluates their impact on performance, throughput and power. It starts by examining algorithmic level decisions and later describes architectural and circuit level optimizations geared towards efficient implementations of these algorithms. Some metrics such as **complexity**, **performance**, **reconfigurability**, **area**, **throughput** and **power** that can be tuned at each of these design stages have been highlighted in Figure 3.1. These parameters will be used in the following sections in figures tailored to illustrate the interaction between the many design optimization methods available at each stage.

# 3.2 Algorithmic Techniques for Adaptive Processing

A wide range of algorithms are available to realize the functionality of the different components in a wireless receiver. However, an algorithm suitable for one channel scenario may be unsuitable in other scenarios. Additionally, the QoS requirements from UEs are also variable. For example, streaming an online video might require bursts of high speed data with longer periods of low data rate communication. Therefore, algorithms that can adapt their performance to match a wide range of QoS requirements and channel conditions are of particular interest for battery operated wireless receivers.

### 3.2.1 Algorithm Selection

An example design space at the algorithmic stage is shown in Figure 3.2. The goal in this stage is to choose an algorithm that can meet a target performance while minimizing complexity and providing a large degree of adaptability. As an example, consider an algorithm Alg. 1 that has a low complexity but does not provide the minimum required performance. Thus, another algorithm Alg. 2, with better performance is needed. However, the added performance comes at the cost of increased complexity, which translates to a higher number of



Figure 3.2: Algorithmic level optimizations.

computations and a corresponding increase in power consumption. Although the non-adaptive version of Alg. 2 is an acceptable solution from the performance perspective, the complexity may be too high. Figure 3.2 also shows an adaptive version of Alg. 2, which can reduce complexity for a loss in performance and thus, would be suitable for scenarios with variable requirements. This adaptability may come with an increased silicon cost and simple methods for reconfigurability are also desired from a hardware architecture perspective. Thus the algorithmic stage choices play an important role, and some examples of improving energy efficiency are presented below.

The tasks performed by the digital baseband blocks in Figure 2.5 can be accomplished with a plethora of algorithms. For example, the FFT operation may be based on algorithms with different radices. Channel estimation may be performed with channel statistics [42], pilot tones [58] or blind estimation may be used [59]. There are also solutions that combine these techniques. Similarly, different types of channel preprocessing algorithms may be used [46] and three commonly used ones in MIMO systems are presented in subsection 2.2.3. Symbol detection is another operation where several algorithms with varying complexity from the simple MF to the more complicated ML algorithm are available. The ML algorithm finds the most probable transmitted symbol, given the received symbol, but has a large complexity and needs a considerable amount of hardware resources. MF, on the other hand, is very simple to implement in hardware but provides adequate performance only in non-interference limited scenarios. The ZF algorithm has a higher complexity than MF and is often used in interference limited conditions. Choosing algorithms that share operations increases reconfigurability in hardware implementations. For example, some of the operations in the ZF algorithm are similar to the operations performed with MF. Choosing a K-Best non-linear detector with a tunable K parameter also increases reconfigurability and offers power-performance trade-offs. Non-linear detection schemes are needed to combat fading and are a necessity in small scale MIMO systems whereas massive MIMO systems can provide good performance with just linear detection schemes. This dissertation explores the channel preprocessing and signal detection in small and massive MIMO system, and methods of lowering complexity for these two operations on the algorithmic level are presented below.

#### 3.2.2 Algorithmic Approximations

Different types of optimizations such as reducing word lengths, approximating division or square root operations with a look up table, can be performed after algorithmic selection to balance silicon cost, performance, and energy efficiency. For example, fixed point hardware implementation has a limited accuracy, and the quantization of the internal variables affects algorithmic performance. Wider variables, represented with a higher number of bits provide good accuracy but require larger silicon area for storage and corresponding logic. Not all computations and variables have an equal impact, and hence word length optimizations may be performed to ensure accurate representation of only the critical variables while reducing widths in non-critical parts. This requires co-optimization with the architectural level to find the right balance between performance and silicon cost. The FFT operation implemented with the multistage architecture requires longer word lengths towards later stages to retain precision. Alternate implementations with a floating point unit are also possible which may require additional hardware blocks to handle data scaling [60]. Another example is the output of the division operation in QRD and CD processes, that has a higher impact on the BER performance than other variables. The word lengths of these units can be optimized by examining algorithmic performance prior to implementation.

Operations in the digital baseband can be characterized as the process of getting an estimate of a variable based on some limited information. For example, the channel gain matrix  $\mathbf{H}$ , at the pilot position is obtained with LS estimation as shown in Figure 3.3(a). The pilot spacing is chosen to handle worst case conditions in terms of frequency selectivity and time variance, and they are placed in an orthogonal fashion in time, frequency and spatial domains to ease CE. Interpolation is then used to obtain  $\mathbf{H}$  at the data tones from the LS estimates of pilot tones as shown in Figure 3.3(b). Weighted interpolation, which considers more than two pilot estimates, has a lower error than linear interpolation but requires larger storage and higher computations. The outputs of the CE unit are used by the channel preprocessor. For example, the interpolated estimate  $\mathbf{H}_4$  at sub-carrier 4 is used to produce the corresponding matrices with the QRD or the CD algorithms. Such a direct implementation



Figure 3.3: Channel estimation with interpolation.

which operates on channel estimates for every data tone will require hardware that is capable of delivering a high throughput. Even though is an acceptable solution for wideband systems, it may not be efficient in terms of silicon area and power dissipation.

Recognizing that channel conditions are variable and the UEs experience different levels of fading as depicted in Figure 2.13, allows interpolation techniques to be used for QRD and CD processes to reduce throughput requirements in good channel conditions. For example, the QRDs  $Q_1R_1$  and  $Q_7R_7$  of channel estimates  $H_1$  and  $H_7$  respectively can be interpolated to produce the intermediate  $Q_{5i}$  and  $R_{5i}$  matrices, without explicitly calculating  $H_5$ . This enables the QRD processor to operate at a lower throughput, as accurate QRD operations can be replaced by interpolation approximations. Several users share the time and frequency resources in a cellular system, and a single user is rarely assigned the full BW for extended periods of time. Such low BW assignment scenarios can also be exploited to dynamically reduce QRD throughput.

Linear interpolation is a simple method to realize dynamic reconfigurability. A single parameter, namely the interpolation distance, can control the accuracy of interpolation. Figure 3.4 shows the variations of the columns of the Q matrix obtained from the QRDs of channels generated from one the LTE-A channel models in a  $3 \times 3$  real valued MIMO system. It can be noticed that the orthogonal vectors move along smooth curves on the unit sphere. Linear interpolation may be used to estimate the intermediate values of the columns of the Q matrix (q vectors) as shown in Figure 3.4. This distance over which



Figure 3.4: Interpolation for Q columns.

interpolation can be performed may be related to how fast the columns of the Q matrix change, which in turn can be related to the coherence bandwidth and the coherence time of the channel. Linear interpolation of a matrix has a complexity in the order of  $\mathcal{O}(M^2)$  compared to the accurate QRD complexity in the order of  $\mathcal{O}(M^3)$ , where M is the length of the q vector.

The error introduced by interpolating the Q and R matrices affects BER and the QoS. In the following, an analysis of this error is presented and compared to the error obtained when using linear interpolation of channel estimates followed by their respective QRDs. Assume  $H_1$  is the channel estimate at subcarrier 1, and  $H_P$  is the estimate at a sub-carrier located P-1 frequency bins from sub-carrier 1. If interpolating over a distance of P sub-carriers, define

$$\delta \triangleq 0 : \frac{1}{P} : 1,\tag{3.1}$$

$$\alpha_i = (1 - \delta(i)) \ \forall i = 1, 2, \dots, P,$$
 (3.2)

and let the QRD of  $\boldsymbol{H}_1$  and  $\boldsymbol{H}_P$  be

$$QRD(\boldsymbol{H}_1) \triangleq \boldsymbol{Q}_1 \boldsymbol{R}_1,$$
  

$$QRD(\boldsymbol{H}_P) \triangleq \boldsymbol{Q}_P \boldsymbol{R}_P.$$
(3.3)

Define a full matrix  $E_1$  and an upper triangular matrix  $E_2$  such that

$$\mathbf{Q}_{P} \triangleq (\mathbf{I} + \mathbf{E}_{1})\mathbf{Q}_{1}, 
\mathbf{R}_{P} \triangleq \mathbf{R}_{1}(\mathbf{I} + \mathbf{E}_{2}). \tag{3.4}$$

The matrix  $E_1$  represents the differences between  $Q_1$  and  $Q_P$ , whereas the matrix  $E_2$  captures the differences in the corresponding R matrices. The linear interpolated  $Q_i$  and  $R_i$  of a matrix  $H_i$ ,  $\forall i = 1, 2, \dots, P$  is obtained by

$$Q_i = \alpha_i Q_1 + (1 - \alpha_i) Q_P,$$
  

$$R_i = \alpha_i R_1 + (1 - \alpha_i) R_P,$$
(3.5)

and the QR product of (3.5) is

$$Q_{i}R_{i} = (\alpha_{i}Q_{1} + (1 - \alpha_{i})Q_{P})(\alpha_{i}R_{1} + (1 - \alpha_{i})R_{P}),$$
  
=  $H_{1} + (1 - \alpha_{i})(E_{1}H_{1} + H_{1}E_{2}) + (1 - \alpha_{i})^{2}(E_{1}H_{1}E_{2}).$  (3.6)

If direct linear interpolation of channel matrices  $H_1$  and  $H_P$  is used instead of interpolating the outputs of their corresponding QRDs, then the intermediate matrix  $H_i$  is obtained as

$$\mathbf{H}_{i} = \alpha_{i} \mathbf{H}_{1} + (1 - \alpha_{i}) \mathbf{H}_{N}, 
= \mathbf{H}_{1} + (1 - \alpha_{i}) (\mathbf{E}_{1} \mathbf{H}_{1} + \mathbf{H}_{1} \mathbf{E}_{2}) + (1 - \alpha_{i}) (\mathbf{E}_{1} \mathbf{H}_{1} \mathbf{E}_{2}).$$
(3.7)

The result of the QRD interpolation can be compared with direct channel interpolation using (3.6) and (3.7). The error in interpolation is given by

$$QRD_{error} = \alpha_i (1 - \alpha_i) (E_1 H_1 E_2). \tag{3.8}$$

This error is proportional to difference between  $H_1$  and  $H_P$ , represented by  $E_1$  and  $E_2$ , and can thus be controlled by choosing similar channel estimates.

A similar analysis can also be used to understand the effects of linear interpolation in the CD preprocessor. Assuming that  $H_1$  and  $H_P$  are positive definite Hermitian matrices, define

$$CD(\boldsymbol{H}_1) \triangleq \boldsymbol{L}_1 \boldsymbol{L}_1^H, CD(\boldsymbol{H}_P) \triangleq \boldsymbol{L}_P \boldsymbol{L}_P^H.$$
(3.9)

Define a lower triangular matrix  $T_2$  that represents the difference between channel estimates  $L_1$  and  $L_P$  so that

$$\boldsymbol{L}_{P} \triangleq (\boldsymbol{I} + \boldsymbol{T}_{2})\boldsymbol{L}_{1}. \tag{3.10}$$

The linear interpolated  $L_i$  of a matrix  $H_i$ ,  $\forall i = 1, 2, \dots, P$  is obtained by

$$\boldsymbol{L}_i \triangleq \alpha_i \boldsymbol{L}_1 + (1 - \alpha_i) \boldsymbol{L}_P. \tag{3.11}$$

In order to compare the error in the interpolated  $L_i$ , construct the the  $LL^H$  product for (3.11) as

$$\boldsymbol{L}_{i}\boldsymbol{L}_{i}^{H} = (\alpha_{i}\boldsymbol{L}_{1} + (1 - \alpha_{i})\boldsymbol{L}_{P})(\alpha_{i}\boldsymbol{L}_{1} + (1 - \alpha_{i})\boldsymbol{L}_{P}^{H}), 
= \boldsymbol{H}_{1} + (1 - \alpha_{i})\left(\boldsymbol{T}_{2}\boldsymbol{H}_{1} + \boldsymbol{H}_{1}\boldsymbol{T}_{2}^{H}\right) + (1 - \alpha_{i})^{2}\left(\boldsymbol{T}_{2}\boldsymbol{H}_{1}\boldsymbol{T}_{2}^{H}\right).$$
(3.12)

The corresponding direct linear interpolation for  $H_i$  results in

$$\boldsymbol{H}_{i} = \alpha_{i} \boldsymbol{H}_{1} + (1 - \alpha_{i}) \boldsymbol{H}_{N},$$

$$= \boldsymbol{H}_{1} + (1 - \alpha_{i}) \left( \boldsymbol{T}_{2} \boldsymbol{H}_{1} + \boldsymbol{H}_{1} \boldsymbol{T}_{2}^{H} \right) + (1 - \alpha_{i}) \left( \boldsymbol{T}_{2} \boldsymbol{H}_{1} \boldsymbol{T}_{2}^{H} \right)$$
(3.13)

The result of CD interpolation can be compared the interpolated  $H_i$  using (3.12) and (3.13). The CD interpolation error is

$$CD_{error} = \alpha_i (1 - \alpha_i) \left( T_2 H_1 T_2^H \right).$$
 (3.14)

The error term has a similar structure as the QRD error from (3.8) and depends on the similarity of  $L_1$  and  $L_P$ . The similarity can be controlled by choosing the interpolation distance based on coherence bandwidth and coherence time. The CD interpolation error is smaller than QRD interpolation error as the number of elements in  $L_i$  matrices is around 1/3 of the total number of elements of in  $Q_i$  and  $R_i$  matrices combined. Furthermore, the CD algorithm has a lower complexity than QRD, and thus is a better algorithmic level choice when using interpolations. Nevertheless, linear approximations in both QRD and CD reduce complexity by an order of  $\mathcal{O}(M)$  and allow adaptive control of the QoS by changing the interpolation distance.

#### 3.2.3 Adaptive Digital Signal Processing

Adaptive algorithms can be used to automatically adjust system parameters based on feedback about receiver performance and environment. One application for such adaptive signal processing is the configuration of analog circuits in a radio to improve their performance. The analog components in a radio have a very difficult task of amplifying and filtering only the wanted signals in the presence of a multitude of interference and noise sources. This task is even more complicated in FDD systems, where the receiver has to detect and amplify signals with power levels of around  $-95\,\mathrm{dBm}$ . Due to the constant presence of variable interference, the linearity of the radio front end affects the quality



Figure 3.5: The LMS filter and its algorithmic description.

of the wireless data link. Digital circuits have been used in traditional radios to recalibrate these blocks towards a better operating point, corresponding to feedback from the digital to the analog domain in Figure 1.1. For example, automatic gain control detects the power of the signal and decreases the gain of the LNA or the CSF when receiving a strong signal. Another example is when the BW of a CSF is changed to match the assigned BW by using feedback based on control channel data. Similarly, the local oscillator frequency is tuned to match the carrier frequency using information from the baseband.

This dissertation has focused on one such application, where the nonlinearities in a CSF are detected and reduced with an adaptive algorithm. Although the transfer function of the CSF is designed to provide good linearity and a particular gain, the actual hardware implementation may deviate from the design specification due to PVT variations. Thus, analog hardware blocks are often implemented with a capability to adapt their performance at runtime to offset these variations. An algorithm may be used to detect the variations, enabling a control loop to tune the filter towards a better operating point. Although the algorithm can be realized in the analog domain, the implementation may itself suffer from PVT variations. Furthermore, extra care is needed during the layout stage, and the analog adaptive filter may increase noise and degrade receiver performance [6]. A fully digital solution overcomes these problems and can be implemented with a standard digital design flow. The variations in the CSF filter response can be detected by the use of digital algorithms such as the Least Mean Squares (LMS) algorithm. The algorithm is based on the stochastic gradient descent method and tracks the variations in the transfer function of a target system [61]. Such a system is shown in Figure 3.5, where h is the unknown transfer function, and q is the adaptive filter based on Alg. 3.5(b). The filter operates by detecting the error between its own output z(n) and the unknown system's output y(n) and evolves over time to minimize the average error e(n).

If an analog circuit is represented by an equivalent transfer function h+h(n), where h(n) denotes the part due to PVT variations; an adaptive filter can be used to track these variations over time. This "digital" intelligence comes at an increased silicon cost from the adaptive algorithm's hardware implementation. Nonetheless, the algorithm allows the receiver to be fine-tuned towards its best operating point when the linearity of the front end is important, such as in the presence of strong interference signals. The overhead can be minimized by sharing the same control loop to tune multiple components in a time multiplexed fashion. The additional power dissipated from the control loop can be controlled by switching off the algorithm once the desired level of linearity has been reached.

### 3.3 Architectural Techniques for Configurability

Algorithmic level choices provide the possibility to trade complexity, reconfigurability and performance. The hardware architecture chosen for the actual implementation also plays a major role in realizing this tradeoff. The primary goal at the architectural design stage is to minimize silicon area while providing a required throughput and a certain degree of configurability. Figure 3.6 depicts the interaction between these parameters and Figure 3.1 highlights some techniques available for hardware designers to achieve this. As an example, consider an implementation Impl. 1 of an algorithm which has a small area, limited configurability and a throughput lower than the minimum requirement. Throughput can be improved with the pipelining technique which increases silicon area. For example, pipelined implementations of the FIR filters may be used to improve the throughput of the decimators [62]. On the other hand, if latency is important, then a parallel architecture that requires higher area may be employed [63]. The throughput can be doubled with two instances of Impl. 1 and quadrupled with four instances. These parallel blocks can be programmed to either process data or not, providing an easy method for reconfigurability. If silicon area is a critical parameter, folding can be used [64]. The same processing block can be run at a higher frequency in a time multiplexed fashion to process data faster. Careful scheduling of algorithmic operations may be required for reusing hardware blocks. Using just one of these techniques may not be suitable and hence these methods are often combined to reach the desired target in terms of throughput and area [65]. A good example is the use of polyphase decimation filters, which achieve parallelism with multiple filter banks. Each of these filters may be implemented in a pipelined FIR architecture to increase operational frequency. In the following subsection, a brief introduction to some of these techniques is presented.



Figure 3.6: Architectural level optimizations.

#### 3.3.1 Pipelining

The maximum operational frequency of a digital circuit is determined by the critical path in its hardware implementation. The maximum throughput of the design, dependent on this frequency, can be increased by pipelining. Consider an implementation of a function f(x) shown in Figure 3.7(a), with a maximum clock frequency clk1, resulting in a throughput below a specified minimum requirement, similar to Impl. 1 of Figure 3.6. Dividing the operations performed in the function f(x) into two sub-blocks  $f_1(x)$  and  $f_2(x)$  and adding a register in between these functions, breaks the critical path and allows the design to process data at a higher frequency clk2. Figure 3.7(b) highlights the original critical path in a design before pipelining which includes three multipliers and three adders and the shorter critical path after pipelining which consists of two multipliers are one adder. An added advantage of pipelining is that it may reduce dynamic power dissipation of the combinational logic by decreasing switching propagation. Pipelining also enables reconfigurability. For example, in Figure 3.7(a), if only the output  $y_1$  from  $f_1(x)$  is required, the second part of the pipeline,  $f_2(x)$  can be turned off. On the downside, pipelining increases latency and requires registers in non-critical paths of the design to balance latency. Figure 3.6 shows a system level view of using pipelining which allows the design to reach the required throughput with an increase in area cost.

#### 3.3.2 Parallelism/Unfolding

Instantiating multiple copies of the same hardware unit is another way of increasing throughput. A block level representation of such an operation is shown in Figure 3.8(a), where the original implementation f(x) is used twice to double the throughput. As an example, consider a matrix-vector product generator



Figure 3.7: Pipelining hardware for increasing throughput.



Figure 3.8: Parallelism for increasing throughput.

often used in digital baseband processing. The multiplication of  $4 \times 4$  matrix  $\boldsymbol{A}$ , by a  $4 \times 1$  vector  $\boldsymbol{b}$ , to produce a  $4 \times 1$  vector  $\boldsymbol{c}$  can be represented by

$$c = Ab,$$

$$\begin{bmatrix} c_1 \\ c_2 \\ c_3 \\ c_4 \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} \begin{bmatrix} b_1 \\ b_2 \\ b_b \\ b_4 \end{bmatrix},$$

$$(3.15)$$

where  $c_1$ , the first element of c is obtained by

$$c_1 = a_{11}b_1 + a_{12}b_2 + a_{13}b_3 + a_{14}b_4. (3.16)$$

Figure 3.8(b) shows an implementation of (3.16), which requires four multipli-



with a reusable functional block f(x).

multiplier.

Figure 3.9: Folding to reduce silicon area.

ers and three adders. The unit has to be run four times when only one instance of the matrix-vector product generator is available, once for each element of the output vector c. A parallel implementation of the same matrix-vector product generator is shown in Figure 3.8(c), where four instances of the unit from Figure 3.8(b) run together to produce all the elements of the c vector at the same time instant. Configurability can be achieved by dynamically changing the number of instances based on throughput requirements. This is represented in Figure 3.6, where two versions of the parallel implementation of Impl 1 are used. In the Unfolded by 2 (UF2), version, two instances double the throughput, and the Unfolded by 4 (UF4) version quadruples the throughput. Parallel architectures are also beneficial when employing supply or V<sub>DD</sub> scaling to lower dynamic power dissipation. Furthermore, some of the parallel instances may be completely turned off with circuit level techniques such as clock/power gating. The cost of using parallelism is the nearly linear increase in silicon area. Feedback paths, data dependency, and the silicon cost may also limit the degree of parallelism.

#### 3.3.3 Time Multiplexing/Folding

Savings in area can be obtained with folding or time multiplexing in scenarios where an implementation easily meets throughput requirements. Figure 3.9(a) shows a design where the implementation of f(x) is used twice to produce the output y. This opens up the possibility to reuse the same implementation to perform the function of two blocks, albeit operating at mutually exclusive time instants. Such an implementation requires lower total silicon area and utilizes hardware more efficiently. The time multiplexed implementation of the matrix-vector multiplier with basic building blocks such as a multiplier, an adder and a storage register is shown 3.9(b). This technique comes with an overhead of increased control logic and multiplexers to schedule the operations and additional registers to store intermediate results.

### 3.4 Circuit Techniques for Power Reduction

The previous section presented architectural techniques to improve throughput and reconfigurability, and this section introduces methods to reduce power dissipation. Figure 3.10 shows a design space for circuit level optimizations and Figure 3.1 provides a system level view of the techniques available to achieve a particular power/throughput goal. Variable requirements in digital baseband circuits can be exploited to reduce power dissipation with clock/power gating techniques, with a small cost to silicon area. Modern technologies such as  $28\,\mathrm{nm}$  FD-SOI provide access to the back gate for forward body biasing, a technique which can be used to increase throughput at the cost of higher power dissipation. Traditional methods such as supply voltage and clock frequency scaling can also be used to dynamically change the throughput, leading to a corresponding change in power dissipation. The total power  $P_{tot}$ , dissipated in a digital circuit can be expressed as

$$P_{tot} = P_{dun} + P_{stat}, (3.17)$$

where  $P_{dyn}$  is the dynamic power and the static component is  $P_{stat}$  [66]. The dynamic power is dependent on the supply voltage  $V_{DD}$ , the operation frequency of the circuit f, the switching activity  $\alpha$ , the equivalent capacitance of the circuit C and can be modeled as

$$P_{dyn} \propto \alpha C f V_{\rm DD}^2$$
. (3.18)

Technology scaling has enabled transistors to operate at lower supply voltages and has thus reduced the dynamic power dissipation due to its quadratic dependence on  $V_{DD}$ . On the other hand, a shorter transistor channel length has lowered the threshold voltage  $V_{TH}$ . The  $P_{stat}$  component has an exponential dependence on  $V_{TH}$ , which has led to an increase in its contribution to the total power. This is particularly the case in memories, which often occupy around 50% of silicon area in modern day chips. Consequently, several techniques such as transistor stacking and channel length stretching are employed to reduce  $P_{stat}$  [2].

Many operations in the digital baseband processing in wireless receivers are combinatorial in nature, involving matrix manipulations. The dynamic power component,  $P_{dyn}$  is usually more dominant than the static component in these processing blocks (not considering memory), and the following subsections introduce methods to control dynamic power dissipation.



Figure 3.10: Circuit level optimizations.

#### 3.4.1 Clock and Power Gating

Most of today's digital designs are synchronous in nature and rely on a clock signal for transitioning between different states to construct meaningful functionality in a controlled manner. The clock signal is used by sequential storage elements such as registers present throughout the design, and regular clock buffering is required to meet the strict timing constraints. This clock signal has the highest switching activity and often contributes to 30% to 40% of the total dynamic power dissipation in ASICs [67]. The interconnect and clock tree power consumption in FPGAs may be as high as 70% [68]. Hence, switching off the clock to idle blocks in a design is one of the methods of reducing dynamic power dissipation.

An implementation with two blocks, f(x) and g(x) operating with two different clocks is shown in 3.11. In scenarios when the function unit f(x) is not needed, the clock to this unit can be turned off. Extra care has to be taken to ensure that glitches do not propagate which requires specialized clock gating cells. Additionally, the design has to ensure that the inputs and outputs from the gated unit are changed in an appropriate way to maintain correct functionality when the clock is turned on again. Dynamic clock gating can be achieved by modifying signals such as the "en1" and "en2" in Figure 3.11 which may be set in a top level control register.

Although dynamic power can be reduced with clock gating, the static power component is still present which can be reduced by the power gating technique. The supply voltage  $V_{\rm DD}$  to a particular block can be turned off via power gating cells to eliminate both the dynamic and static power consumption. On the downside, additional supply routing, power gating circuits and control logic are needed, which can increase the silicon area.



Figure 3.11: Clock gating for dynamic power control.

#### 3.4.2 Dynamic Voltage (V<sub>DD</sub>) and Frequency Scaling

The dynamic power component from (3.18) shows a linear dependence on clock frequency and a quadratic dependence on the supply voltage  $V_{DD}$ . Thus, lowering clock frequency or  $V_{DD}$  results in a corresponding reduction in power dissipation. However, reducing  $V_{DD}$  below the standard operating voltage of the chosen CMOS technology slows down transistors and is detrimental from the throughput point of view. Additionally, process variations may affect the performance of transistors at lower supply voltages. Therefore, additional circuits are required to ensure functionality when using the  $V_{DD}$  scaling technique [69]. The use of multiple voltage islands is another method to lower power dissipation [70]. Different functional blocks such as memories and core logic can be designed to operate in their own power domains with separate controls for  $V_{DD}$  scaling, enabling a finer power control. This multi- $V_{DD}$  technique requires voltage conversion buffers to ensure proper signal transition between these domains. On chip supply generators may also be required to increase the efficiency of  $V_{DD}$  scaling.

Frequency scaling is an easier method of reducing power dissipation, where an almost linear decrease in power consumption may be obtained by reducing clock frequency. Modern digital design tools ensure that clock constraints are met at the desired target frequency and the circuits are guaranteed to be functionally correct at frequencies below this target. On the downside, clock generation circuits for variable clock frequency are required which may increase design complexity.

Notwithstanding some of the drawbacks in terms of increased area, dynamic  $V_{\rm DD}$  and frequency scaling are commonly used in many designs [71,72]. These two techniques are especially attractive for adaptive digital baseband blocks that have variable throughput requirements. Furthermore, techniques such as parallelism may be combined with DVFS to co-optimize the architecture and circuit implementation to further lower power dissipation.

#### 3.4.3 Multi- $V_{TH}$ Transistors and Body Biasing

The threshold voltage  $(V_{TH})$  of a transistor determines its switching speed and affects the maximum operational frequency of digital circuits. Lower  $V_{TH}$  transistors provide higher speed but also result in larger static leakage when compared to transistors with a higher  $V_{TH}$ . Some CMOS technologies offer the choice of combining transistors with multiple  $V_{TH}$  to balance operational speed and power dissipation at design time. Modern digital design tools are capable of optimizing the critical paths by choosing low  $V_{TH}$  transistors to improve speed and high  $V_{TH}$  transistors for other parts of a design to reduce power dissipation. Additionally, the  $V_{TH}$  of a transistor can be dynamically modified by the body biasing technique, where a non-zero voltage is applied to the fourth terminal of the transistor. The 28 nm FD-SOI technology from STMicroelectronics extends the body biasing feature by introducing a back gate or a second gate below the channel, providing an even wider range of  $V_{TH}$  control [73,74]. This second gate can be biased by an additional supply voltage, either to reduce leakage, or to dynamically boost operational speed.

From a system level perspective, consider the original design in Figure 3.10 that meets the minimum required throughput but is unable to meet the maximum requirement. Forward body biasing can be used to increase the circuit throughput at the expense of increased power dissipation [75]. Combinations of the techniques presented in this subsection can also be used to reduce power dissipation. For example, DVFS and forward body biasing can be combined to reduce power dissipation while increasing throughput.

All these algorithmic, architectural and circuit level techniques provide a multitude of options for designers to examine what fits best for their application. The key to minimizing power dissipation is to employ co-optimization and combine methods across the different design levels in Figure 3.1. This dissertation uses this approach to perform algorithmic-architectural-circuit co-optimizations to increase hardware efficiency while providing the necessary reconfigurability to allow dynamic power-performance tradeoff. The next chapter describes specific examples in light of the methods presented in this chapter.

# Chapter 4

# Applications in Digitally Assisted Radio Receivers

The previous chapter described methods to perform optimizations across different layers in a receiver design cycle. Techniques to balance adaptability and complexity at an algorithmic level were introduced and architectural level options to increase throughput and reduce area were discussed. Circuit level design choices to balance power and throughput were reviewed. Some examples of exploiting these methods in scenarios that have a variable QoS requirement were also described. This chapter briefly describes the application of all these techniques with a co-optimization strategy that relies on feedback on operating conditions to reduce power dissipation and improve energy efficiency. The chapter begin with a description of a system that uses the Dig. $\rightarrow$ Ana. feedback from Figure 1.1 to fine-tune a CSF towards higher linearity. Later, power reduction with Dig. $\rightarrow$ Dig. feedback for baseband circuits in small scale and massive MIMO systems are presented.

# 4.1 Non-linearity Mitigation for Analog Circuits

Cellular communication standards such as 3GPP LTE support FDD communication, where a mobile UE can simultaneously transmit and receive data from a BS over multiple frequency bands. Figure 4.1 depicts such an FDD based system where a duplexer with a transmit and receive signal isolation of around 40 dB to 50 dB [76], is employed to block the transmit signal from reaching the receiver chain. However, the transmit signal leakage from the duplexer can still have a higher power than the desired signal when the UE is operating at the cell edge or when experiencing high path loss. Furthermore, external interference sources may also be present in nearby frequencies. Thus, amplification of only the wanted signal and filtering of interferers is needed, which is achieved by using an LNA and a CSF. High linearity is one of the important requirements in these blocks to minimize intermodulation. Figure 2.14 shows an example scenario, where the IM3 affects the downlink signal and Figure 2.7(b) highlights the received baseband signal under different levels of IM3 dis-



Figure 4.1: Transmit signal leakage, external interference and filtering.

tortion. Two common methods of improving linearity are digital cancellation and analog calibration, presented in section 2.2.1. Digital cancellation is possible when the characteristics of the interferer are known, which may be hard to estimate when the interference source is external to the mobile UE. In such scenarios, non-linearity detection and tuning the analog blocks towards higher linearity is a better option. The tuning method is also beneficial from a system level perspective to adaptively optimize receiver performance with a global controller. For example, system level parameters such as the BER and EVM can be examined to decide on whether to tune the analog front end or not. This dissertation focuses on one such low cost calibration method to improve the intermodulation distortion performance of a CSF designed for receivers compatible with the LTE standard. The details of the proposed non-linearity tuning system are presented in **Paper I** and **Paper II**.

Figure 4.2(a) shows a block diagram of the proposed system, where the time continuous signal x(t) from the mixers is filtered by a CSF and decimated to produce the baseband signal y(n). The digital tuning system is built around this main path. It uses an oversampling auxiliary ADC and a digital intermodulation generator to recreate non-linearities of different orders. A wideband auxiliary ADC is needed for the system to capture external interference signals and decimators are used to lower the sample rate after intermodulation generation. An adaptive filter tracks the transfer function of the CSF, while a correlator detects the level of non-linearities. A system controller then tunes the CSF towards higher linearity based on the output of the correlator. Figure 4.2(b) shows the measurement setup used to test the performance of the digital tuning system implemented on a Xilinx FPGA and interfaced to a CSF designed in 65 nm CMOS with an external ADC.



Figure 4.2: A calibration system for intermodulation distortion.



Figure 4.3: Multi stage decimation.

#### Multirate Filters

Multi-mode receivers use decimators to match the ADC output rate to the baseband rate. FIR implementations are commonly employed due to their simple hardware architectures and inherent stability. Furthermore, it is beneficial to perform decimation in multiple stages as shown in Figure 4.3 [26]. The requirements on each stage in terms of transition bandwidth and filter order can be lowered, resulting in a more hardware efficient structure than a direct FIR implementation. Further optimizations are possible with the noble identity of decimation, using which the transfer function H(z) for a decimation factor N in Figure 4.4(a) may be transformed into a filter F(z) in Figure 4.4(c). Polyphase decomposition with the noble identity of decimation lowers complexity further and Figure 4.5(a) shows such an implementation of a decimate by 2 filter. A Half Band (HB) filter is a particular type of the FIR filter in which nearly half of the coefficients are zero, and Figure 4.5(c) shows a polyphase filter implemented with the half band architecture.

Combining multistage decimation process with polyphase half band decimators has several advantages. First, the orders of the leading filters can be reduced resulting in shorter filter lengths and simpler hardware implementations. Second, the parallel architecture allows each filter to operate at half the



Figure 4.4: Sample rate simplification.



Figure 4.5: Polyphase and half band filters.

rate of the original filter, reducing dynamic power dissipation. Finally, reconfigurability in terms of decimation factor may be obtained with multiplexers by picking outputs from the right stage. The digital tuning system in Figure 4.2(a) requires two decimators in the auxiliary path that are implemented with three stages resulting in a combined decimation factor of 8. Figure 4.6(a) shows the individual frequency responses, which is compared against the response of an FIR filter with similar complexity in Figure 4.6(b). These architectural level modifications not only lead to lower hardware area and but also lower dynamic power dissipation when compared to a direct FIR implementation.

The digital tuning system based on Figure 4.2(a) was implemented with an external ADC and an FPGA. The proposed system requires 42 k gates of logic, corresponding to an area of around half that required by the CSF. The system was was able to detect the optimal operating point of the CSF, even with a low resolution ADC, enabling a reduction in IM3 distortion by around 14 dB. The following items summarize the techniques used for implementing the system:

- On the algorithmic level, a low complexity adaptive filter that needs simple hardware units such as multipliers and adders is chosen to track the transfer function of the CSF.
- On the architectural level, pipelining, parallelism and clock scaling are combined to realize hardware efficient multirate decimation filters. A fully unrolled structure is used for the LMS filter and the correlator.
- On the circuit level, the system is implemented on an FPGA and interfaced to a tunable CSF.



(a) Decimation by 8 with three HB filters. (b) Comparison of frequency responses of multi stage and direct FIR decimation.

## 4.2 Channel Preprocessor for Small Scale MIMO Systems

MIMO communication is one of the techniques that has enabled network operators to offer high speed data links. Small scale MIMO systems use a low number of antennas (up to 8) at both the BS and the mobile UE to create parallel communication links. Figure 4.7 compares the uncoded BER performance the linear ZF detector and a non-linear K-Best detector in a  $4 \times 4$ MIMO system with full spatial multiplexing. The difference in performance is evident, and thus non-linear detectors are often used in mobile UEs to fully extract spatial multiplexing gains. Efficient hardware implementation of tree search based detectors such as the K-Best detector rely on the QRD channel preprocessor, and two commonly used algorithms for QRD are presented in subsection 2.2.3. Traditionally, QRD is performed for every data tone of each OFDM symbol. However, such implementations for wider bandwidths of up to 100 MHz supported by cellular standards such as LTE-A can lead to extremely high throughput requirements. Although architectural methods like pipelining or parallelism can be used to meet these requirements, the power required to process data with these computationally heavy algorithms increases by a significant degree. Thus, algorithmic level changes are essential to lower computation count, and to support architectural level modifications. Additionally, energy efficiency of baseband circuits such as the CE, channel preprocessors and signal detectors may also be increased with Dig.→Dig. domain feedback on current channel conditions. This dissertation presents the application of this digital feedback approach which augments algorithmic, architectural and circuit level optimizations to improve efficiency of the QRD preprocessor. The details of the implementation are presented in Paper III, Paper IV and Paper V.



Figure 4.7: BER performance of linear and non-linear detectors in  $4 \times 4$  MIMO.

The usually small physical form factor of mobile UEs increases the risk of spatial correlation in the received signal. Spatial correlation adversely affects the performance of MIMO systems, and digital baseband blocks have to be designed to combat these effects in order to provide robust performance. The condition number of the channel gain matrix can be used to evaluate the level of spatial correlation between the BS and the UE, with smaller condition numbers indicating lower correlation. Figure 4.8(a) depicts the distribution of condition numbers in LTE-A channel models. Low spatial correlation results in values of around ten, whereas medium correlation scenarios can lead to values in the range of few hundreds [37].

Fixed point implementations of the QRD process need to accurately decompose the channel estimates  $\boldsymbol{H}$  even in highly correlated scenarios and the algorithm chosen for decomposition has an impact on the accuracy. A high level of orthogonality in the  $\boldsymbol{Q}$  matrix is desired to ensure that the properties of noise are not changed when transforming the linear system in (2.7) into an upper triangular system in (2.15). The error in the orthogonality of  $\boldsymbol{Q}$  can be measured by

$$\operatorname{Error}_{\boldsymbol{Q}} = \parallel \boldsymbol{Q}^H \boldsymbol{Q} - \boldsymbol{I} \parallel_F, \tag{4.1}$$

where Q is produced from the QRD of channel estimate H and I is the identity



- (a) Condition number distribution of  $\boldsymbol{H}$  for different channel correlation.
- (b) Error in Q Orthonormality.

Figure 4.8: Orthonormality comparison between MGS and HHT.

matrix. Figure 4.8(b) compares the accuracy of fixed point implementations of the MGS algorithm and the HHT. It can be seen that the  $\boldsymbol{Q}$  matrices from the HHT have lower error even at high condition numbers. Thus on an algorithmic level, the HHT is more suitable for small scale MIMO systems and hence is chosen for hardware implementation. Additionally, the computation complexity of the HHT from (2.10) is lower than GS process from (2.8) when the explicit calculation of  $\boldsymbol{Q}$  matrices is not needed.

Regardless of the algorithm chosen, the QRD process has a complexity in the order of  $\mathcal{O}(M^3)$  for an  $M \times M$  full rank matrix, and the QRD throughput required in an OFDM system with a pilot pattern shown in Figure 2.15 can be obtained as

$$\mathbf{QRD}/s = BW_{sc} \times (N_{sym} - N_p/P_{space}) \times 1000 \tag{4.2}$$

where  $BW_{sc}$  is the bandwidth measured in sub-carriers,  $N_{sym}$  is the number of OFDM symbols in 1 ms and  $N_p$  is the number of symbols carrying pilots, and  $P_{space}$  is the pilot/reserved tone spacing. This turns out to be around 14 M QRD/s for a 20 MHz BW LTE signal and increases to 72 M QRD/s in five-band CA scenarios. Typical cellular networks distribute limited time and frequency resources among multiple users. Furthermore, channel conditions and hence the channel estimates do not change drastically in all situations as depicted by example channels in Figure 2.13. Thus, instead of computing QRD at every OFDM tone, algorithmic approximations with linear interpolation such as the ones proposed in subsection 3.2.2 and depicted in Figure 3.4 can be used to re-



- (a) Adaptive QRD processing for CA LTE-A.
- (b) Chip microphotograph of adaptive QRD preprocessor.

Figure 4.9: QRD processor for carrier aggregated (CA) LTE-A systems.

duce throughput requirements. Accurate QRD computations can be replaced by interpolation approximations, provided that the error from interpolation is kept low. The interpolation distance can be chosen based on channel parameters such as coherence bandwidth and coherence time, to find a balance between the error due to interpolation and the throughput requirements. The complexity can also be reduced by an order of  $\mathcal{O}(M)$  by adopting linear interpolation. More importantly, choosing interpolation factors of 4 and 8 reduces complexity further, as the multiplication operations in linear interpolation of (3.5) can be replaced by much simpler shift and add operations. Interpolation with constant multipliers not only requires lower silicon area, but also has a simpler architecture than the QRD unit.

From a hardware architecture perspective, linear interpolation provides a simple mechanism to achieve reconfigurability and the architecture in Figure 4.9(a) is adopted. A single HHT based QRD unit is used to produce the elements of the R matrix. Instead of generating the Q matrix, the received data vectors (y), are rotated with the v vectors from (2.9) and pipelining is employed to increase the operational frequency of the rotation unit. Multiple instances of rotation units are organized into two rotation banks and operated in parallel to increase the throughput, when required. On a circuit level, clock gating is used when only one of the rotation banks is needed. The parallel architecture

also enables aggressive supply voltage scaling. The design is implemented to support forward body biasing to dynamically boost QRD throughput in high frequency selectivity and fast mobility scenarios.

Figure 4.9(b) shows a microphotograph of the adaptive interpolating QRD chip fabricated in 28 nm FD-SOI technology. Measurement results indicate that the proposed system can meet the throughput requirements for CA scenarios even in highly frequency and time selective channels. The use of algorithmic level choices such as choosing the HHT over the MGS enables the QRD preprocessor to provide accurate decomposition outputs even in spatially correlated scenarios. Algorithmic approximation with interpolation reduces power dissipation by 70% to 80% over the more traditional implementations without interpolation. Furthermore, the QRD throughputs can be optimized with digital domain feedback to switch between interpolation distance leading to an energy requirement as low as 0.2 nJ/QRD. On an architectural level, a High Level Synthesis (HLS) based flow is used for design space exploration to maximize resource sharing with folding. On a circuit level, the combined use of forward body biasing, voltage scaling and clock gated parallel rotation units results in a reduction of power dissipation by around 2.5 times when compared to operation without parallelism. The adaptive channel aware QRD preprocessor combines multiple levels of design optimizations presented in Figure 3.1 and demonstrates the effectiveness of co-optimization to achieve energy efficiency. The following items recap the methods used for optimizing the design and implementation:

- On the algorithmic level, a QRD algorithm suitable for correlated scenarios is chosen. Additionally, approximations via linear interpolations are used to lower QRD throughput in favorable channel conditions.
- On the architectural level, the QRD processor uses a folded implementation to increase hardware utilization. Parallel rotation banks are employed to support aggressive DVFS.
- On the circuit level, the system is implemented in 28 nm FD-SOI technology with support for forward body biasing. Clock gating is also used to reduce dynamic power dissipation.

## 4.3 Detectors for Massive MIMO Systems

Small scale MIMO based 4G systems provide a significant improvement in data rates over older generations with single antennas. Unfortunately, these systems are reaching their limits with respect to downlink data rates and capacities. Therefore, solutions using wider bandwidths in the order of several giga-Hertz and improved spectral efficiency are being investigated for the next generation communication systems. Massive MIMO systems are one such alternative that provide advantages such as a high array gain, reduction in small scale fading, simplification of processing at the UE by the use of transmit signal precoding and reduction of transmit power from BS antennas together with higher link reliability [18]. However, processing data from a large number of antennas (in the range of 100s) creates challenges in terms of power dissipation and latency in the digital baseband of the BS. Although a typical BS is equipped with a direct source of power, energy efficiency is still important. This dissertation focuses on the signal detector block in massive MIMO systems with a goal of finding an energy efficient and reconfigurable implementation to handle a variable number of users under different channel conditions. The design and implementation details are presented in **Paper VI**.

Signal detection in MIMO systems is a computationally intensive task and Figure 4.7 highlights the need for non-linear detectors in small scale MIMO systems. Fortunately, simpler linear detection schemes based on MF and ZF provide good performance in massive MIMO systems and non-linear detection schemes are rarely needed. MF based detection is beneficial when the received signal is dominated by noise and operates by multiplying the received signal with the Hermitian conjugate of the channel estimate as shown in (2.17). However, the ZF detection algorithm is needed in scenarios where interference is more dominant than noise. These two detection schemes can be represented as

$$\mathbf{x}_{MF} = \mathbf{H}^H \mathbf{y},$$

$$\mathbf{x}_{ZF} = (\mathbf{H}^H \mathbf{H})^{-1} \mathbf{x}_{MF}.$$
(4.3)

It can be noticed that  $x_{MF}$ , the result of MF detection can be reused by ZF detection to produce  $x_{ZF}$ . Additionally, the inversion of the Gram matrix  $H^HH$  can be obtained with the Cholesky Decomposition (CD) process and  $x_{ZF}$  can be solved in two steps, with forward substitution followed by backward substitutions as

$$\boldsymbol{x}_{ZF} = \left(\boldsymbol{H}^{H} \boldsymbol{H}\right)^{-1} \boldsymbol{x}_{MF} = \left(\boldsymbol{L} \boldsymbol{L}^{H}\right)^{-1} \boldsymbol{x}_{MF}, \tag{4.4}$$

$$y_F = L^{-1}x_{MF}$$

$$x_{ZF} = L^{-H}y_F.$$
(4.5)



Average per receive antenna SNR of all users (in dB)

Figure 4.10: Two linear detection schemes for users with similar SNR.

The MF detection scheme has a complexity in the order of  $\mathcal{O}(MK)$  in a system with M antennas at the BS and K single antenna users, whereas ZF detection with CD based inversion has an additional complexity in the order of  $\mathcal{O}(MK^2+K^3)$ . The calculation of the Gram matrix in (4.3) is the most computationally intensive task and requires MK(K+1)/2 multiplications. Although MF has a lower complexity than ZF, the uplink BER curves in Figure 4.10 for a system with 128 antennas at the BS and 16 single antenna users indicate that ZF algorithm is required when the channels are non-orthogonal.

Mobile users are generally distributed in a cell and may have dissimilar SNR in the absence of uplink power control. The signal received from users close to the BS or users that have a line of sight path may have a higher power compared to users farther away. For example, an SNR difference of 16 dB may be present when the distance between a group of users and the BS is four times smaller than the distance between other users and the BS [12]. In such scenarios, a low complexity algorithm such as MF for the non-interference limited users may result in satisfactory BER performance. Figure 4.11 shows the average BER obtained with MF detection in a system with 16 users, of which 8 users have around 18 dB higher signal power. The performance of the same system with ZF for all 16 users is depicted in Figure 4.12, showing a small BER improvement



Per receive antenna SNR of interference limited users (in dB)

Figure 4.11: Performance of MF detection in a massive MIMO system.

for the strong users and a much larger improvement for the users with lower signal power.

Thus, taking user distribution into account may allow the BS to choose between MF and ZF detection, providing an algorithmic level control knob to reduce computation count and save power. However, the mobility of users can result in constantly changing scenarios with respect to interference which may change the number of users who require MF or ZF. Thus, a reconfigurable detector is required not only to switch between the two detection schemes but also to handle a variable number of users for each detection scheme. Furthermore, the outputs of the MF detector can also be reused by the ZF detector. The hardware architecture in Figure 4.13(a) is proposed as a solution. The users close to the BS are detected by the user selection unit (User Sel.). This may be performed by examining the entries of the Gram Matrix, whose off-diagonal entries indicate the co-interference between users and the diagonal elements indicate the signal strength from corresponding users. Reducing the number of users who need ZF detection also lowers the computation count of Gram matrix generation. The architecture in Figure 4.13(b) is proposed to incorporate reconfigurability for handling a variable number of users for ZF detection. The CD preprocessor is optimized for 8 users and the block decomposition al-



Per receive antenna SNR of users with lower power (in dB)

Figure 4.12: Performance of ZF detection in a massive MIMO system.



Figure 4.13: Adaptive detection for massive MIMO.

detector for up to 16 users.

massive MIMO.

gorithm is used to find the CD of larger  $16 \times 16$  Gram matrices in scenarios where all users are interference limited. The block decomposition algorithm provides a way to construct the CD of larger matrices with results from the CD of smaller sub-matrices. Let the Gram matrix for 16 users be defined as

$$\boldsymbol{H}^{H}\boldsymbol{H} = \begin{bmatrix} \boldsymbol{A} & \boldsymbol{B}^{H} \\ \boldsymbol{B} & \boldsymbol{D} \end{bmatrix}, \tag{4.6}$$

and let

$$S = D - BA^{-1}B^H, \tag{4.7}$$

be the Schur complement obtained with smaller  $8 \times 8$  submatrices  $\boldsymbol{A}$ ,  $\boldsymbol{B}$  and  $\boldsymbol{D}$ . Then by using the block decomposition algorithm, the

$$CD(\boldsymbol{H}^{H}\boldsymbol{H}) = \begin{bmatrix} \boldsymbol{L}_{A} & \mathbf{0} \\ \boldsymbol{B}\boldsymbol{L}_{A}^{-H} & \boldsymbol{L}_{S} \end{bmatrix},$$
 (4.8)

where  $L_A$  and  $L_S$  are lower triangular matrices obtained from the CD of A and S respectively. It can be noted that the block decomposition algorithm requires  $A^{-1}$ . However, the properties of a massive MIMO system can be used to approximate  $A^{-1}$  as

$$\boldsymbol{A}^{-1} \approx \mathbb{E}\left[a_{jj}^{-1}\right] = \gamma \boldsymbol{I},\tag{4.9}$$

where  $a_{jj}$  are the diagonal elements of  $\boldsymbol{A}$ . The Schur complement can then be simplified to

$$S \approx D - \gamma B B^H. \tag{4.10}$$

These additional algorithmic level approximations allow a hardware efficient CD unit for 8 users to be reused for 16 user ZF detection scenarios. The throughput requirements on the CD processor can also be reduced by using linear interpolation with a distance  $D_{sc}$ , dependent on the coherence bandwidth and coherence time of the operating channel, similar to the interpolation strategy employed to reduce QRD throughput in small scale MIMO systems.

The architecture of the configurable ZF detector in Figure 4.13(b) requires an  $8\times 8$  matrix multiplier (Mat. Mul.), that can be time multiplexed with the Gram matrix generation unit, saving silicon area. Further area reduction is obtained by performing both forward and backward substitution (FBS.) operations with the same hardware block. On a circuit level, the CD unit, the interpolation unit for the  $\boldsymbol{L}$  matrices, and the forward/backward substitution unit are operated on different clock frequencies to optimize power dissipation. The adaptive ZF detector is implemented in 28 nm FD-SOI technology and requires 141 k gates of logic. Post-layout simulations indicate a requirement of around 1.4 nJ/CD with a BER loss lower than 1 dB when compared to the performance of a floating point implementation. Furthermore, using interpola-

tions instead of accurate CD computations reduces the power dissipation of the CD unit by an additional 50% to 90%, corresponding to 25%-50% reduction in the power requirement of the ZF detector.

The adaptive detection scheme is tailored for massive MIMO BSs to balance power dissipation and performance with a capability to handle variable number of users. Algorithm selection between MF and ZF reduces computational complexity. The use of digital feedback on channel properties such as coherence bandwidth to replace CD computations by interpolations provides an easily configurable knob to control power dissipation. The hardware is designed to support these algorithmic level approximations and additional architectural level techniques are used to increase hardware efficiency. Circuit level simulations indicate that the proposed ZF detector is highly energy efficient and reinforces the advantages of using digital feedback to lower power dissipation. The following items sum up the techniques used to realize the adaptive massive MIMO signal detector:

- On the algorithmic level, spatial distribution of users is exploited to switch between two algorithms to lower computation count. A block decomposition algorithm is used to reduce hardware cost while supporting scenarios where the more complex ZF is needed. Linear interpolation based on channel properties is used to further reduce computation count.
- On the architectural level, the ZF detector uses a folded implementation to increase hardware utilization. A single block is used to perform two tasks, forward and backward substitution, increasing hardware efficiency.
- A multi-clock domain design approach is adopted to lower power dissipation. The design will also benefit from the chosen 28 nm technology, where back gate biasing can also be used to reduce power dissipation.

# Chapter 5

# Paper Summary and Discussion

This chapter presents a summary of the papers included in the second part of this compilation. By doing so, it briefly discusses the relevance of this work to the research field. The chapter also considers possible directions for the future to improve the results presented in this dissertation.

#### 5.1 Research Contributions

I am the main author of all the included papers. The co-authors of the paper helped with valuable feedback over the course of the research work. My supervisor's ideas, suggestions, feedback and guidance from the initial research stage to the final manuscript writing has immensely helped in improving the quality of research presented in the papers.

#### 5.1.1 Paper I

Several key blocks in a wireless receiver include analog circuits, that may have residual non-linearities leading to signal intermodulation and distortion. Consequently, a digital technique to detect these non-linearities and calibrate analog components to improve signal quality is proposed in this paper. An architecture for digital calibration based on an auxiliary ADC and an adaptive algorithm is presented with simulation results for performance evaluation and hardware analysis. A simplified model of a tunable CSF is used as the analog component in an LTE system with the LMS algorithm for non-linearity detection. Parameters such as auxiliary ADC resolution, adaptive filter length and multiplier widths are evaluated to understand the cost of the proposed architecture. The results obtained show that a low resolution auxiliary ADC and a few other low complexity blocks can detect the non-linearities.

Contribution: I contributed with background research and set up the full system simulation model, performed simulations, analyzed the results and wrote the manuscript.

#### 5.1.2 Paper II

Paper II builds upon the concepts developed in Paper I. A full digital non-linearity calibration system for a tunable CSF chip designed in 65 nm CMOS is implemented on a Xilinx FPGA. One of the key components required for the hardware implementation, namely the decimation filters are analyzed and optimized for hardware with a multistage half band structure. The performance of the calibration technique is tested with real life signals. Experiments and measurement results show that the digital tuning loop can be realized with low hardware cost and is able to improve the performance of the CSF. Digital feedback is used to re-calibrate the CSF towards its optimal operating point which leads to a 14 dB reduction in IM3 distortion.

**Contribution:** I performed system simulations, designed all the digital hardware blocks, integrated the system for FPGA implementation, performed measurements and wrote the manuscript.

#### 5.1.3 Paper III

Small scale MIMO systems often use non-linear detection algorithms to improve BER performance. To support these detectors, a QRD processor is used to decompose the channel estimates  $\boldsymbol{H}$  into two matrices. The QRD accuracy in fixed point hardware is dependent on channel correlations and the error in the decomposition process has a direct effect on BER performance. Mobile UEs with closely spaced antennas can experience increased correlation and **Paper III** analyzes two variants of the QRD algorithm with respect to decomposition accuracy in correlated scenarios. The results show that HHT has higher resilience, and hence is chosen for hardware implementation. A real-valued version for the HHT suitable for non-linear detectors operating with real numbers is implemented. HLS is used for architectural analysis and two designs with a throughput of  $72\,\mathrm{M}\,\mathrm{QRD/s}$  for supporting five-band CA are presented.

Contribution: I performed the background research, algorithmic analysis, performance simulations, design implementation and wrote the manuscript.

#### 5.1.4 Paper IV

A peak throughput of 72 M QRD/s is required when a single UE is assigned five-bands of 20 MHz bandwidth in channels with high frequency and time selectivity. However, channel conditions are not always at their worst, and the full bandwidth is rarely assigned to a single user in a cellular communication system. Taking these into account, **Paper IV** proposes a framework to reduce QRD throughput by using a channel aware methodology. An adaptive approach based on operating conditions such as coherence bandwidth and the

received signal SNR is presented, where low complexity interpolation approximations replace accurate QRD computations. Three different channel models are analyzed to find interpolation distances that result in an uncoded BER performance loss of less than 1 dB. The selective QRD methodology not only reduces computation count but also allows optimizations on the channel estimator. One such implementation with a windowed FFT based channel estimation is also proposed in the paper. The adaptive framework provides the flexibility required to balance power and performance in a MIMO receiver, resulting in 40% to 80% reduction in multiplications.

**Contribution:** I did the background research, performance simulations, evaluated the results and wrote the manuscript.

### 5.1.5 Paper V

Paper V presents the hardware implementation of an adaptive QRD processor based on the analysis performed in **Paper IV**. The interpolation concept is extended to the time domain by examining the coherence time properties of LTE-A channels. Two system level architectures are proposed to support interpolation by factors of 4 and 8, and a micro-architecture for the real-value decomposition based HHT is presented. The framework from Paper IV is extended to exploit additional levels of adaptability such as clock gating and back gate biasing available at the circuit level. The full system with one QRD processor and two rotation banks is fabricated in 28 nm CMOS technology. Measurements results show that QRD processor requires 1.3 nJ/QRD, the lowest reported number in literature, to the best of the author's knowledge. Furthermore, the adaptive system can operate in interpolation modes, which reduces the QRD energy requirement by an additional 70% to 80%. The design is able to produce a throughput of 22 MQRD/s and can handle five-band CA scenarios even in highly frequency selective and time varying channel conditions. The back gate biasing feature is used to increase the operational speed in wide bandwidth assignment scenarios, showcasing the benefits of FD-SOI technology. This paper presents optimizations performed on all levels of the digital design flow with the goal of reducing power dissipation and the end result highlights that a combination of all these techniques is essential to achieve energy efficiency.

**Contribution:** I did the algorithmic and architectural analysis, designed all the hardware blocks, did the chip implementation, testing, measurements and wrote the manuscript.

#### 5.1.6 Paper VI

Massive MIMO is one of the leading candidates for the next generation of communication systems. A large number of antennas at the BS are used to reduce the complexity of signal detection in battery operated UEs while improving spectral efficiency. However, the computational complexity at the BS is higher than in small scale MIMO. Paper VI examines ways to reduce this complexity and presents a reconfigurable detector capable of switching between two linear detection schemes. MF is used whenever possible and ZF, which is the more complex of the two, is used only for interference dominated users, leading to a reduction in the total computation cost. A hardware implementation of the ZF detector based on CD is presented, and the block decomposition algorithm is used to detect up to 16 users. A linear interpolation method, similar to the one in **Paper V**, helps further reduce complexity by replacing CD computations by interpolation approximations. BER simulations of the proposed solution show a small performance loss compared to a floating point implementation. The full ZF detector is implemented in 28 nm FD-SOI technology and requires around 281 k gates. Post-layout power simulations indicate an energy requirement of 1.4 nJ/CD, which can be reduced by 50% to 80% with interpolations, translating to a reduction of 25% to 50% in the power dissipation of the ZF detector. The building blocks of the ZF detector are co-optimized, leading to a high degree of hardware reuse, resulting in an energy and area efficient signal detector for massive MIMO systems.

Contribution: I did the study to evaluate the performance of the proposed solution, designed all the blocks, performed simulations, hardware implementation and wrote the manuscript.

### Scientific Contribution

The contributions of this dissertation to the research field are summarized as follows:

- Demonstrated that different levels of optimizations on algorithmic, architectural and circuit level are invaluable to lower power dissipation, and that co-optimization over these leads to energy efficient digital designs.
- Proposed methods to digitally assist a radio receiver to improve performance by the use of two feedback strategies, namely Dig.→Ana. and Dig.→Dig. The former method is suitable for tunable analog circuits while the latter is beneficial for digital baseband circuits.
- Demonstrated the effectiveness of a low cost digital tuning scheme to mitigate non-linearities in analog circuits. The advantages of the scheme

include lower power dissipation, improved receiver performance and increased robustness to PVT variations. The scheme is especially useful for receivers fabricated with newer technologies.

- Presented a framework for an adaptive interpolation strategy for the QRD process in LTE-A that can be applied to other baseband processing blocks and communication standards. Established that a channel and QoS aware methodology for optimizing hardware is both practical and productive. Designing a digital baseband block for the average use case with adaptive capability for extreme scenarios is cost effective and reduces area.
- Confirmed that many of these concepts can be applied to receivers intended for the next generation of communication systems. Showed that the flexibility that comes with configurable hardware is highly beneficial and essential in massive MIMO systems.

#### 5.2 Discussion and Future Work

The focus of this work is to find ways to improve performance and energy efficiency of circuits used in wireless receivers. Technology scaling is beneficial for digital designers, enabling the implementation of complex algorithms at reasonable silicon cost, resulting in battery operated devices with computing power similar to desktop computers from a decade ago. However, the design of analog circuits at lower supply voltages is challenging, especially in terms of linearity. Paper I and Paper II of this compilation present a method for calibrating a reconfigurable CSF towards higher linearity and has shown promising results. Nonetheless, a full system integrated on a chip would be useful to evaluate the actual benefits. The same concept can be extended to other components in the analog front end, such as the LNA and it would be interesting to see a digital receiver capable of calibrating different components for optimizing full system energy efficiency.

Digital baseband processing blocks for MIMO systems is the other topic discussed in this dissertation. Adaptive processing based on the operating channel conditions is essential to reduce area and power dissipation. The wide range of scenarios in which a typical UE operates, calls for easily reconfigurable hardware. One way of achieving a good balance between reconfigurability and accuracy is to use low cost linear interpolation approximations. This work presents hardware architectures for two decomposition algorithms, the QRD for small scale MIMO and the CD for massive MIMO where the adaptive interpolation concept has been used to lower power dissipation. This interpolation scheme also enables a way to optimize blocks around it, such as the CE that can be designed to produce only the minimum required estimates. This is briefly discussed in Paper IV, but a more thorough analysis with a better channel estimator is needed. Uncoded BER is used for evaluating the performance of

most of the proposed solutions. Practical communication systems will employ some form of coding to improve reliability, which may provide another control knob to adapt system configurations. Further analysis is needed to understand the tradeoffs when operating in such systems. The hardware implementation of the adaptive QRD processor with multiple clock domains proposed in Paper V will be beneficial to reduce area and needs further investigation. The rotation banks used in the QRD processor occupy a significant portion of the design and can also be improved to reduce silicon area. On the physical design level, partitioning with multiple power domains will contribute towards reducing power dissipation. This is especially important in FD-SOI technology where forward body biasing provides the benefit of increasing operational frequency but also increases static leakage current.

Several problems present in small scale MIMO systems are inherently handled in massive MIMO systems. One of the important advantages of massive MIMO is the simplification of the receiver at the UE. This however, moves the complexity to the BS, which coupled with a large number of antennas requires adaptive channel aware processing to achieve high energy efficiency. The proposed detector based on the CD is optimized for a 20 MHz bandwidth system in the uplink, but can also be used as part of the precoder in the downlink, which needs further investigation. Time domain correlation of channels can also be leveraged to reduce CD throughput. A full system with MF and Gram matrix generator should also be implemented to examine and optimize the design. The analysis in Paper VI has shown that most of the power reduction is obtained by using interpolation distances in the range of 4 to 20. Thus, the interpolation block can be improved, by replacing full multipliers with constant multipliers which can lead to significant reduction of silicon area. The block decomposition algorithm is implemented for  $8 \times 8$  matrices, which can also be realized with smaller  $4 \times 4$  matrices. Such an approach will reduce silicon area but requires a more complicated schedule to ensure efficient hardware resource utilization. The proposed solution is capable of processing 16 users, but can also be extended to higher number of users with the building blocks presented. Some scenarios may arise in real life environments that require more advanced detection schemes. The detector provides a possibility to interface to advanced tree search based detectors, but has not been fully implemented and tested. A fascinating project would be to design a full system with all these different detection schemes and interface it to the Lund University massive MIMO testbed to evaluate its performance in a real outdoor environment.

## References

[1] G. E. Moore, "Cramming More Components Onto Integrated Circuits," *Proceedings of the IEEE*, vol. 86, no. 1, pp. 82–85, January 1998.

- [2] J. Rabaey, Low Power Design Essentials. Springer Science & Business Media, 2009.
- [3] A. Nejdel, "Flexible Receivers in CMOS for Wireless Communication," Ph.D. dissertation, Lund University, Faculty of Engineering LTH, Lund, October 2015.
- [4] M. Abdulaziz, M. Törmänen, and H. Sjöland, "A 4th Order Gm-C Filter with 10MHz Bandwidth and 39dBm IIP3 in 65nm CMOS," in 40th European Solid State Circuits Conference, September 2014, pp. 367–370.
- [5] M. Abdulaziz, "Linearity Enhancements of Receiver Front-end Circuits for Wireless Communication," Ph.D. dissertation, Lund University, Faculty of Engineering LTH, Lund, April 2016.
- [6] V. Aparin, G. J. Ballantyne, C. J. Persico, and A. Cicalini, "An integrated LMS adaptive filter of TX leakage for CDMA receiver front ends," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 5, pp. 1171–1182, May 2006.
- [7] 3GPP. LTE. [Online]. Available: http://www.3gpp.org/technologies/keywords-acronyms/98-lte
- [8] 3GPP. LTE-Advanced. [Online]. Available: http://www.3gpp.org/technologies/keywords-acronyms/97-lte-advanced
- [9] T. L. Marzetta, "Noncooperative Cellular Wireless with Unlimited Numbers of Base Station Antennas," *IEEE Transaction on Wireless Communication*, vol. 9, no. 11, pp. 3590–3600, November 2010.
- [10] R. Qureshi, "Ericsson Mobility Report," June 2016. [Online]. Available: https://www.ericsson.com/mobility-report
- [11] Federal Communications Commission, "Auction 97: Advanced Wireless Services." [Online]. Available: http://wireless.fcc.gov/auctions/default. htm?job=auction\_summary&id=97
- [12] A. F. Molisch, Wireless communications. John Wiley & Sons, 2007.
- [13] 3GPP, "Evolved Universal Terrestrial Radio Access (E-UTRA); User Equipment (UE) radio transmission and reception," 3rd Generation Partnership Project (3GPP), TS 36.101 V12.5.0, November 2014.

[14] E. Dahlman, S. Parkvall, and J. Sköld, LTE/LTE-Advanced for Mobile Broadband. Academic Press, 2011.

- [15] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless Communications. Cambridge University Press, 2003.
- [16] L. Verma, M. Fakharzadeh, and S. Choi, "WiFi on Steroids: 802.11AC and 802.11AD," *IEEE Wireless Communications*, vol. 20, no. 6, pp. 30–35, December 2013.
- [17] G. J. Foschini, "Layered Space-Time Architecture for Wireless Communication in a Fading Environment When Using Multi-Element Antennas," *Bell Labs Technical Journal*, vol. 1, no. 2, pp. 41–59, Autumn 1996.
- [18] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, "Scaling Up MIMO: Opportunities and Challenges with Very Large Arrays," *IEEE Signal Processing Magazine*, vol. 30, no. 1, pp. 40–60, January 2013.
- [19] J. Vieira, S. Malkowsky, K. Nieman, Z. Miers, N. Kundargi, L. Liu, I. Wong, V. Öwall, O. Edfors, and F. Tufvesson, "A Flexible 100-Antenna Testbed for Massive MIMO," in *IEEE Globecom Workshops*, Dec 2014, pp. 287–293.
- [20] 3GPP, "Evolved Universal Terrestrial Radio Access (E-UTRA); Radio Frequency (RF) system scenarios," 3rd Generation Partnership Project (3GPP), TS 36.942 V10.3.0, June 2012.
- [21] J. Vieira, F. Rusek and F. Tufvesson, "Reciprocity calibration methods for Massive MIMO based on antenna coupling," in *IEEE Global Commu*nications Conference, December 2014, pp. 3708–3712.
- [22] H. Q. Ngo, E. G. Larsson, and T. L. Marzetta, "Energy and Spectral Efficiency of Very Large Multiuser MIMO Systems," *IEEE Transactions* on Communications, vol. 61, no. 4, pp. 1436–1449, April 2013.
- [23] M. Andersson, M. Anderson, L. Sundström, S. Mattisson, and P. Andreani, "A Filtering  $\Delta\Sigma$  ADC for LTE and Beyond," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 7, pp. 1535–1547, July 2014.
- [24] J. J. van de Beek, M. Sandell, and P. O. Borjesson, "ML Estimation of Time and Frequency Offset in OFDM Systems," *IEEE Transactions on Signal Processing*, vol. 45, no. 7, pp. 1800–1805, July 1997.
- [25] R. E. Crochiere and L. R. Rabiner, "Interpolation and Decimation of Digital Signals A Tutorial Review," *Proceedings of the IEEE*, vol. 69, no. 3, pp. 300–331, March 1981.

[26] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Prentice-Hall, Inc., 1993.

- [27] AD9361 RF Agile Transceiver, Analog Devices, 2016, rev. F.
- [28] D. Chu, "Polyphase Codes With Good Periodic Correlation Properties (Corresp.)," *IEEE Transactions on Information Theory*, vol. 18, no. 4, pp. 531–532, July 1972.
- [29] Y. Yang, W. Che, N. Yan, X. Tan, and H. Min, "Efficient implementation of primary synchronisation signal detection in 3GPP LTE downlink," *Electronics Letters*, vol. 46, no. 5, pp. 376–377, March 2010.
- [30] 3GPP, "Evolved Universal Terrestrial Radio Access (E-UTRA); Physical channels and modulation," 3rd Generation Partnership Project (3GPP), TS 36.211 V12.8.0, June 2016.
- [31] A. Nejdel, M. Törmänen, and H. Sjöland, "A 0.7 to 3 GHz wireless receiver front end in 65-nm CMOS with an LNA linearized by positive feedback," *Analog Integrated Circuits and Signal Processing*, vol. 74, no. 1, pp. 49–57, January 2013.
- [32] M. Abdulaziz, M. Törmänen, and H. Sjöland, "A Compensation Technique for Two-Stage Differential OTAs," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 61, no. 8, pp. 594–598, August 2014.
- [33] P. H. Moose, "A Technique for Orthogonal Frequency Division Multiplexing Frequency Offset Correction," *IEEE Transactions on Communications*, vol. 42, no. 10, pp. 2908–2914, Octpber 1994.
- [34] K. P. Pun, J. E. Franca, C. Azeredo-Leme, C. F. Chan, and C. S. Choy, "Correction of frequency-dependent I/Q mismatches in quadrature receivers," *Electronics Letters*, vol. 37, no. 23, pp. 1415–1417, November 2001.
- [35] J. Tubbax, A. Fort, L. V. der Perre, S. Donnay, M. Engels, M. Moonen, and H. D. Man, "Joint Compensation of IQ Imbalance and Frequency Offset in OFDM systems," in *IEEE Global Telecommunications Conference*, vol. 4, December 2003, pp. 2365–2369 vol.4.
- [36] H. François and A. Bordoux, Digital Compensation for Analog Front-Ends. John Wiley & Sons, 2008.
- [37] 3GPP, "Evolved Universal Terrestrial Radio Access (E-UTRA); Base Station (BS) radio transmission and reception," 3rd Generation Partnership Project (3GPP), TS 36.101 V12.11.0, March 2016.

[38] R. Krishnan, "On the Impact of Phase Noise in Communication Systems — Performance Analysis and Algorithms," Ph.D. dissertation, Institutionen f\u00f6r signaler och system, Kommunikationssystem, Chalmers Tekniska H\u00f6gskola April 2015.

- [39] S. He and M. Torkelson, "A New Approach to Pipeline FFT Processor," in *Proceedings of International Conference on Parallel Processing*, April 1996, pp. 766–770.
- [40] J. Löfgren and P. Nilsson, "On Hardware Implementation of Radix 3 and Radix 5 FFT Kernels for LTE systems," in *NORCHIP*, November 2011, pp. 1–4.
- [41] J. Löfgren, L. Liu, O. Edfors, and P. Nilsson, "Improved Matching-Pursuit Implementation for LTE Channel Estimation," *IEEE Transactions on Cir*cuits and Systems I: Regular Papers, vol. 61, no. 1, pp. 226–237, January 2014.
- [42] O. Edfors, M. Sandell, J. J. van de Beek, S. K. Wilson, and P. O. Borjesson, "OFDM Channel Estimation by Singular Value Decomposition," *IEEE Transactions on Communications*, vol. 46, no. 7, pp. 931–939, July 1998.
- [43] J. J. van de Beek, O. Edfors, M. Sandell, S. K. Wilson, and P. O. Borjesson, "On Channel Estimation in OFDM Systems," in *IEEE 45th Vehicular Technology Conference*, vol. 2, July 1995, pp. 815–819.
- [44] S. F. Cotter and B. D. Rao, "Sparse Channel Estimation via Matching Pursuit With Application to Equalization," *IEEE Transactions on Com*munications, vol. 50, no. 3, pp. 374–377, March 2002.
- [45] C. Studer, P. Blosch, P. Friedli, and A. Burg, "Matrix Decomposition Architecture for MIMO Systems: Design and Implementation Trade-offs," in *The 41st Asilomar Conference on Signals, Systems and Computers*, November 2007, pp. 1986–1990.
- [46] L. Trefethen and D. Bau, Numerical Linear Algebra. Society for Industrial and Applied Mathematics, 1997.
- [47] H. Prabhu, O. Edfors, J. Rodrigues, L. Liu, and F. Rusek, "Hardware Efficient Approximative Matrix Inversion for Linear Pre-coding in Massive MIMO," in *IEEE International Symposium on Circuits and Systems*, June 2014, pp. 1700–1703.
- [48] S. J. Bellis, W. P. Marnane, and P. J. Fish, "Alternative systolic array for non-square-root Cholesky decomposition," *IEEE Proceedings - Computers* and Digital Techniques, vol. 144, no. 2, pp. 57–64, March 1997.

[49] L. G. Barbero and J. S. Thompson, "Fixing the Complexity of the Sphere Decoder for MIMO Detection," *IEEE Transactions on Wireless Commu*nications, vol. 7, no. 6, pp. 2131–2142, June 2008.

- [50] Z. Guo and P. Nilsson, "Algorithm and Implementation of the K-best Sphere Decoding for MIMO detection," *IEEE Journal on Selected Areas in Communications*, vol. 24, no. 3, pp. 491–503, March 2006.
- [51] J. Ketonen, M. Juntti, and J. R. Cavallaro, "Performance-Complexity Comparison of Receivers for a LTE MIMO-OFDM System," *IEEE Transactions on Signal Processing*, vol. 58, no. 6, pp. 3360–3372, June 2010.
- [52] M. Wu, B. Yin, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer, "Large-Scale MIMO Detection for 3GPP LTE: Algorithms and FPGA Implementations," *IEEE Journal of Selected Topics in Signal Processing*, vol. 8, no. 5, pp. 916–929, Oct 2014.
- [53] N. Costa and S. Haykin, Multiple-Input Multiple-Output Channel Models: Theory and Practice. John Wiley & Sons, 2007.
- [54] E. G. Larsson, O. Edfors, F. Tufvesson, and T. L. Marzetta, "Massive MIMO for Next Generation Wireless Systems," *IEEE Communications Magazine*, vol. 52, no. 2, pp. 186–195, February 2014.
- [55] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. N. Strenski, and P. G. Emma, "Optimizing pipelines for power and performance," in *Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture*, November 2002, pp. 333–344.
- [56] D. Markovic, V. Stojanovic, B. Nikolic, M. A. Horowitz, and R. W. Brodersen, "Methods for True Energy-Performance Optimization," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 8, pp. 1282–1293, August 2004.
- [57] D. Markovic, B. Nikolic, and R. W. Brodersen, "Power and Area Minimization for Multidimensional Signal Processing," *IEEE Journal of Solid-State Circuits*, vol. 42, no. 4, pp. 922–934, April 2007.
- [58] P. Hoeher, S. Kaiser, and P. Robertson, "Two-dimensional Pilot-symbol-aided Channel Estimation by Wiener Filtering," in *IEEE International Conference on Acoustics, Speech, and Signal Processing*, vol. 3, April 1997, pp. 1845–1848.
- [59] M. C. Necker and G. L. Stuber, "Totally Blind Channel Estimation for OFDM on Fast Varying Mobile Radio Channels," *IEEE Transactions on Wireless Communications*, vol. 3, no. 5, pp. 1514–1525, September 2004.

[60] S. N. Tang, J. W. Tsai, and T. Y. Chang, "A 2.4-GS/s FFT Processor for OFDM-Based WPAN Applications," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 57, no. 6, pp. 451–455, June 2010.

- [61] S. Haykin, Adaptive Filter Theory. Prentice Hall, 1996.
- [62] K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. John Wiley & Sons, 1999.
- [63] D. A. Parker and K. K. Parhi, "Low-Area/Power Parallel FIR Digital Filter Implementations," Journal of VLSI signal processing systems for signal, image and video technology, vol. 17, no. 1, pp. 75–92, 1997.
- [64] K. K. Parhi, C. Y. Wang, and A. P. Brown, "Synthesis of Control Circuits in Folded Pipelined DSP Architectures," *IEEE Journal of Solid-State Circuits*, vol. 27, no. 1, pp. 29–43, January 1992.
- [65] M. Ayinala, M. Brown, and K. K. Parhi, "Pipelined Parallel FFT Architectures via Folding Transformation," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 20, no. 6, pp. 1068–1081, June 2012.
- [66] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital integrated circuits: a design perspective. Pearson Education, 2003.
- [67] S. A. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch, and E. Schmidt, "System Level Clock Tree Synthesis for Power Optimization," in *Design*, Automation Test in Europe Conference Exhibition, April 2007, pp. 1–6.
- [68] F. Li, Y. Lin, and L. He, "Vdd Programmability to Reduce FPGA Interconnect Power," in IEEE/ACM International Conference on Computer Aided Design, November 2004, pp. 760–765.
- [69] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation," in *Proceedings of the* 36th Annual IEEE/ACM International Symposium on Microarchitecture, December 2003, pp. 7–18.
- [70] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and S. Kulkarni, "Pushing ASIC Performance in a Power Envelope," in *Proceedings of Design Automation Conference*, June 2003, pp. 788–793.
- [71] B. Keller, M. Cochet, B. Zimmer, Y. Lee, M. Blagojevic, J. Kwak, A. Puggelli, S. Bailey, P. F. Chiu, P. Dabbelt, C. Schmidt, E. Alon, K. Asanović, and B. Nikolić, "Sub-microsecond Adaptive Voltage Scaling in a 28nm FD-SOI Processor SoC," in 42nd European Solid-State Circuits Conference, September 2016, pp. 269–272.

- [72] ARM Cortex-A Series, ARM, 2015, ver. 1.
- [73] N. Planes, O. Weber, V. Barral, S. Haendler, D. Noblet, D. Croain, M. Bocat, P. O. Sassoulas, X. Federspiel, A. Cros, A. Bajolet, E. Richard, B. Dumont, P. Perreau, D. Petit, D. Golanski, C. Fenouillet-Béranger, N. Guillot, M. Rafik, V. Huard, S. Puget, X. Montagner, M. A. Jaud, O. Rozeau, O. Saxod, F. Wacquant, F. Monsieur, D. Barge, L. Pinzelli, M. Mellier, F. Boeuf, F. Arnaud, and M. Haond, "28nm FDSOI Technology Platform for High-Speed Low-Voltage Digital Applications," in Symposium on VLSI Technology, June 2012, pp. 133–134.
- [74] P. Flatresse, B. Giraud, J. P. Noel, B. Pelloux-Prayer, F. Giner, D. K. Arora, F. Arnaud, N. Planes, J. L. Coz, O. Thomas, S. Engels, G. Cesana, R. Wilson, and P. Urard, "Ultra-Wide Body-Bias Range LDPC Decoder in 28nm UTBB FDSOI Technology," in *IEEE International Solid-State Circuits Conference*, February 2013, pp. 424–425.
- [75] D. Jacquet, F. Hasbani, P. Flatresse, R. Wilson, F. Arnaud, G. Cesana, T. D. Gilio, C. Lecocq, T. Roy, A. Chhabra, C. Grover, O. Minez, J. Uginet, G. Durieu, C. Adobati, D. Casalotto, F. Nyer, P. Menut, A. Cathelin, I. Vongsavady, and P. Magarshack, "A 3 GHz Dual Core Processor ARM Cortex<sup>TM</sup>-A9 in 28 nm UTBB FD-SOI CMOS With Ultra-Wide Voltage Range and Energy Efficiency Optimization," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 4, pp. 812–826, April 2014.
- [76] TriQuint. Duplexers. [Online]. Available: http://www.triquint.com/products/all/filters/duplexers

# Papers

# Paper I

# Paper I

## A Digitally Assisted Non-Linearity Suppression Scheme for RF front ends

R. Gangarajaiah, M. Abdulaziz, L. Liu, and H. Sjöland, "A Digitally Assisted Non-Linearity Suppression Scheme for RF front ends," © 2014 IEEE, reprinted from the Proceedings of IEEE 25th Annual International Symposium on Personal, Indoor and Mobile Radio Communications, Washington DC, USA, September 2014, pp. 623–627.

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund university's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications\_standards/publications/rights\_link.html to learn how to obtain a License from RightsLink.

# Digitally Assisted Adaptive Non-Linearity Suppression Scheme for RF front ends

Rakesh Gangarajaiah, Mohammed Abdulaziz, Liang Liu and Henrik Sjöland Department of Electrical and Information Technology, Lund University, Sweden Email: {rakesh.gangarajaiah,mohammed.abdulaziz.liang.liu,henrik.sjoland}@eit.lth.se

Abstract—This paper presents a robust and low-complexity non-linearity suppression scheme for radio frequency (RF) transceiver building blocks to efficiently mitigate intermodulation distortion. The scheme consists of tunable RF components assisted by an auxiliary path equipped with an adaptive digital signal processing algorithm to provide the tuning control. This proposed concept of digitally-assisted tuning is capable of handling a large range of non-linear behaviours without any complexity increase in the expensive RF circuitry and is robust to process, voltage and temperature variations. A case study on the third order intermodulation of the channel select filter for a full 10 MHz Long Term Evolution (LTE) reception bandwidth is used to demonstrate the feasibility and effectiveness of the technique.

Index Terms—Adaptive signal processing, Interference cancellation, Intermodulation distortion, Nonlinear circuits.

#### I. INTRODUCTION

The dramatic increase of wireless connections within limited spectrum has forced many radio devices to operate close in frequency. This poses very stringent requirements for building blocks like the low noise amplifier (LNA), mixer and channel select filter (CSF) in a wireless transceiver, where high linearity is needed to avoid the intermodulation distortion and the resulting interference between devices operating in close proximity. However, implementing such high-linearity building blocks is very expensive in terms of power consumption. Moreover, the constantly reduced feature size of complementary metal oxide semiconductor (CMOS) technology leads to reduced oxide thickness and lower supply voltages, making the non-linearity an even more critical issue. There is thus, an urgent demand for a robust and cost-efficient scheme to tackle the non-linearity in radio frequency (RF) circuits and reduce the interference in wireless reception.

Several techniques have been reported in literature addressing the intermodulation interference problem. One way of achieving high linearity is to implement tunable RF components and optimize the tuning so that the non-linearities cancel at the source [1]–[6]. For example, the authors in [1] used multiple transistors for linearization of a main transistor by controlling parameters such as gate width and over drive voltage. Unfortunately, the optimal tuning point of the analog circuit shifts significantly depending on process, voltage and temperature (PVT) variations, requiring frequent and inconvenient re-tuning to provide high linearity. Recently, the concept of using an auxiliary path to recreate and cancel non-linearities has been proposed to increase the robustness.

An adaptive interference cancellation technique operating in the analog domain was introduced in [7]. This method requires additional power hungry analog building blocks which are themselves subject to non-linearities and PVT variations. In [8], [9], intermodulation generation is performed in the analog domain and powerful digital signal processing is employed to provide robust intermodulation cancellation. A drawback is that the alternate paths need to operate continuously leading to increased power consumption.

To address the aforementioned problems in existing nonlinearity suppression techniques, we propose to exploit the advantages of both tunable RF components and adaptive digital signal processing. More specifically, we make RF components tunable to provide non-linearity suppression by eliminating this imperfection at its source. Furthermore, a digital auxiliary path is introduced, implementing adaptive signal processing algorithms for detecting errors in the tuning point of the RF components and performing the corresponding adjustments. This digitally-assisted adjustment scheme provides robustness to PVT variations. Furthermore, unlike the approach in [8], [9], this method does not need to operate continuously and can thus be powered down when the optimal operating point has been reached, leading to significant power savings. To verify the effectiveness of the proposed method, we designed a CSF for a Long Term Evolution (LTE) receiver with 10 MHz baseband bandwidth, where a bias voltage could be tuned to mitigate the third order non-linearities. Fixed-point simulation results demonstrate that the proposed method can use a low resolution analog to digital converter (ADC) and a very low complexity adaptive algorithm in the auxiliary path. Furthermore, the auxiliary path design is fast enough to tune the CSF once every Orthogonal Frequency Division Multiplexing (OFDM) symbol, achieving tuning to optimal linearity within a time period of a few OFDM symbols.

#### II. BACKGROUND

A typical LTE Frequency Division Duplexing (FDD) receiver is shown in Fig. 1, where the duplexer is used to separate the transmit (Tx) and receive (Rx) bands. A non-ideal duplexer results in Tx power leakage into the Rx, which is typically the strongest source of interference in FDD receivers. The figure also shows an external interferer which can be from another device such as a nearby cellular phone.

The LNA and mixer are used to amplify and down convert the received signal into baseband, and a CSF is then employed



Fig. 1: Typical LTE-A receiver

to suppress the Tx leakage and the external interference. An ideal analog front end is linear, and should, except for the frequency translation performed by the mixer, not output any signals at frequencies not present in its input. Non-linear systems, however, produce harmonics of the input signal as well. For example, if the input signal is a sinusoid with frequency  $f_1$ , then the output will have components not only at  $f_1$ , but also at  $2f_1, 3f_1, 4f_1$ , etc. If the input signal has rich frequency content, non-linearities will also cause intermodulation (IM). For example, if an input signal consists of two sinusoids with different frequencies  $f_1$  and  $f_2$ , then it can be shown that second and third order IM terms will be produced at [10]:

$$f_1 - f_2$$
,  $f_1 + f_2$ ,  $2f_1 - f_2$ ,  $2f_1 + f_2$ ,  $2f_2 - f_1$ ,  $2f_2 + f_1$ .

Some of the intermodulation terms may appear at frequencies that fall inband and contaminate the signal. The CSF is typically a bottleneck for receiver linearity, and elimination of its IM distortion would thus improve the receiver performance significantly.

Several authors have proposed different solutions for reducing inband IM to enhance Rx signal fidelity. Since the Tx leakage is often the main source of interference, previous works such as [11] have used digital cancellation techniques, where the path delay of the IM due to Tx leakage is estimated and compensated for, in the digital baseband. This method will not be able to cancel IM created due to external interferers and hence is not suitable for scenarios with strong external signals.

The author in [9] utilizes auxiliary paths to generate  $3^{\rm rd}$  and  $5^{\rm th}$  order IM by using an analog cubic term generator to avoid using a high resolution wideband ADC, which would consume more power than the original circuits. Digital adaptive algorithms are then used to synchronize the auxiliary path with the main path and cancel the IM terms in the main path by using digital subtraction. This approach requires the auxiliary paths to operate continuously for IM cancellation and is aimed at non-tunable analog components.

The use of adaptive filters [12], in particular the least mean squares (LMS) algorithm, is very common in interference cancellation techniques utilizing an auxiliary path. The LMS



Fig. 2: LMS system with algorithm

algorithm tunes the adaptive filter coefficients by observing the error between the estimated and the actual signal with the goal of minimizing the mean square error. The decision on tuning the filter coefficients is solely based on the error at the current time, and the rate of the filter update is configurable providing a trade off between convergence time and the variance of the error. Fig. 2a shows a system employing an LMS filter to predict the output signal y(n) and Fig. 2b details the operation of the normalized LMS algorithm. The error signal e(n) between the predicted output z(n) and the actual output y(n) is used as a parameter to tune the update vector  $\mathbf{c}(n)$ . One of the main advantages of the normalized LMS algorithm is its capability to handle a wider amplitude range of the input x(n) enabling a stable implementation even without complete knowledge of the input signal statistics.

#### III. DIGITALLY ASSISTED NON-LINEARITY CANCELLER

Some analog circuits can be linearized by tuning the operating point so that the non-linearities cancel. The proposed solution is aimed at such circuits and uses a digital adaptive algorithm to provide the required IM detection enabling on chip adaptive tuning of the bias to reach linearity. To the best of our knowledge, a complete solution for digitally assisted linearization has not yet been published.

The analog baseband (one of the I and Q channels) of a direct-conversion LTE receiver with a CSF is depicted in Fig. 3a. The received signal x(t), obtained by down converting the RF signal is passed through the CSF to suppress out of channel interference before the analog to digital conversion. Fig. 3a also shows the frequency spectrum with the desired signal and a strong out of band interferer from the Tx operating in FDD. IM due to non-linearities falls in channel, corrupting the Rx signal. The CSF attenuates the Tx interferer, and further filtering in the main ADC and the decimator removes any remaining out of band Tx interference. The in channel IM interference, however, is unaffected and is passed onto the digital baseband.

The operation of an ideal linear CSF used for LTE channel selection can be modelled as

$$y(t) = \mathbf{h} * x(t), \tag{1}$$



(b) Proposed non-linearity suppression receiver with tunable CSF non-linearities  $(\alpha, \beta, \gamma)$ 

Fig. 3: Receiver implementation with CSF

where y(t) is the output, produced by the convolution of  $\mathbf{h}$ , the filter impulse response, and the input x(t). Third order non-linearities in the CSF produce distortion in the output signal y(t), which can be modelled in Matlab using

$$y(t) = \mathbf{h} * x(t) + (\alpha \mathbf{h}_1 + \beta \mathbf{h}_2 + \gamma \mathbf{h}_3) * (x(t))^3,$$
 (2)

where y(t) is the filtered output,  ${\bf h}$  the linear component of the impulse response, and  $(\alpha {\bf h}_1 + \beta {\bf h}_2 + \gamma {\bf h}_3)$  represents the third order non-linearities. A tunable LTE CSF aimed at the full 10 MHz baseband channel, with parameters  $\alpha$ ,  $\beta$  and  $\gamma$  controlling the amount of distortion introduced is depicted in Fig. 3b. In practice the parameters  $\alpha$ ,  $\beta$  and  $\gamma$  correspond to tunable bias voltages.

The proposed control structure consists of a wideband auxiliary ADC to capture the interference signals surrounding the Rx signal. Digital multiplication is then used to generate IM distortion and in this paper we focus on the 3rd order. However, this technique can be used on other higher order non-linearities as well. Decimators are then used to reduce the sample rate to match the baseband signal y(n). A finite impulse response (FIR) implementation of the normalized LMS algorithm is utilized to produce z(n), an estimate of the output signal y(n), and the error signal e(n) is used to update the LMS filter coefficients. The LMS adaptive filter is linear and is capable of estimating the linear signal components with a much higher degree of accuracy compared to the non-linear ones. Hence the error e(n) can be used as a measure of the non-linearities in the output y(n). The output k(n) of the digital cubing unit, contains the non-linear components which



Fig. 4: FFT of CSF input, output and LMS error signals

are used to correlate with e(n) to measure the amount of inband IM present in y(n). The output of the correlator is then used to tune the CSF towards linearity. This process is repeated iteratively until an optimal point is found and later the auxiliary path is powered down.

#### IV. SIMULATION SETUP

We aim at compensating 10 MHz CSFs for LTE receivers. A sampling rate of 30.72 MHz is used as the baseband operating frequency and a high resolution linear ADC is assumed to operate with an oversampling ratio of 8 in the main path in Fig. 3, corresponding to an output rate of 245.76 M samples/s. The system is modelled using Matlab and the CSF is implemented using a  $4^{th}$  order Butterworth filter to produce the linear component of (1). The non-linearities are generated by filtering the cubed input signal  $(x(n))^3$  through three Butterworth filters of different orders to mimic the effect of non-linearities in a multistage analog CSF. Digital decimation filters provided by Matlab were used to perform the different decimation operations. The root mean square (RMS) power of the distortion was controlled by the parameters  $\alpha$ ,  $\beta$  and  $\gamma$  in (1) to simulate at different distortion levels. OFDM signalling was used for generating the input signal and 2048 subcarriers with 15 KHz spacing were assumed for data transmission [13].

Fig. 4a shows a two tone test performed on the model in Fig. 3b with two subcarriers at 2.91 MHz and 7.5 MHz respectively,

and two out of band interference signals at  $20.01\,\mathrm{MHz}$  and  $38.01\,\mathrm{MHz}$ . The Fast Fourier Transform (FFT) of the output y(n), affected by third order intermodulation (IM3) distortion with inband IM3 is also shown. The error floor is significantly increased due to the delay introduced in the Matlab model of the CSF and also due to fact that windowing was not used when performing the FFT, resulting in spectral leakage.

Fig. 4b shows the FFT of the the LMS error e(n) and the signal k(n) produced from the cubing unit in Fig. 3b. It has to be noted that the LMS operates with the baseband operating frequency of 30.72 MHz corresponding to a full 10 MHz baseband configuration and hence Fig. 4b depicts bins upto 15.36 MHz. The LMS filter is capable of producing the inband Rx component of y(n) more accurately than the inband IM signals, which can be seen by observing the level of e(n) at the inband subcarrier frequencies of 2.91 MHz and 7.5 MHz. The power of the IM due to out of channel interferers present at  $2 \times 20.01 \,\mathrm{MHz} - 38.01 \,\mathrm{MHz} = 2.01 \,\mathrm{MHz}$  is significantly higher than the error at the inband subcarrier frequencies. Similarly it can be noticed that the errors at 1.68 MHz due to IM3 of the in channel tones, and at 5 MHz due to IM3 of in channel and out of channel interferers are significantly higher than the error at in channel subcarrier frequencies. The LMS filter is linear and thus performs a much better estimation of linear components of the signal. Hence the error signal e(n)contains a higher error component at the IM frequencies, and a correlation of signals e(n) and k(n) can be used as a measure of inband IM.

One of the main advantages of the proposed digital control structure is that it needs to be active only over a short period of time to calibrate the CSF, after which it can be turned off resulting in higher power savings compared to the methods proposed in [8], [9].

#### V. RESULTS

To verify the functionality and feasibility of the proposed technique, fixed point simulations were performed to obtain an estimate of the hardware cost of the proposed technique. A very difficult scenario was considered with a full bandwidth of 10 MHz for the CSF with strong continuous wave blockers present at 20 MHz and 38 MHz. Random data was used to generate a series of OFDM symbols which are processed by the CSF and the auxiliary path. Fig. 5 shows the frequency spectrum of the input and output signals of the CSF with two strong interferers and the inband distortion generated due to IM3.

The digital control structure has several parameters such as correlation symbol length, LMS filter length and tap word lengths, downsampling ratio, and precision of the auxiliary ADC. The effects of these parameters were studied by performing simulations with input signals with a frequency spectrum shown in Fig. 5. The RMS power of the distortion was controlled by tuning the parameters  $\alpha$ ,  $\beta$  and  $\gamma$  of Fig. 3b. A correlation symbol length of 2000 samples was chosen, enabling the digital loop to perform bias adjustments once every OFDM symbol, with a margin of 48 samples left for the



Fig. 5: Input and Output signal spectrum of CSF

LMS convergence. This enables the control structure to tune the CSF to the optimal point within a few OFDM symbols. Shorter LMS filter lengths are favourable as it provides faster convergence and also results in fewer hardware multipliers. Fig. 6a depicts the effect of changing the LMS filter length on the correlation obtained for different levels of distortion. The distortion is measured by a ratio of RMS powers of the inband distortion to the desired Rx signal, with negative distortion levels indicating IM which has a 180° phase shift with respect to the Rx signal. It is seen that a minimum correlation is obtained when the RMS power of distortion is minimum and hence the correlation can be used as a measure of in channel IM. Furthermore, filter lengths as low as 9 can be chosen to detect very low levels of IM. The LMS coefficients were also analyzed and a width of 10 bits was chosen.

A wideband auxiliary ADC is required to capture Tx interference signals and blockers which are far away from the Rx signal. The hardware cost of implementing a high precision wideband ADC is significant, and the power consumption of such an ADC would make the digital LMS loop an inefficient approach compared to the ones proposed in [9]. Simulations were thus performed to obtain the minimum required resolution for the auxiliary ADC, and Fig. 6b shows the correlation coefficient obtained for different resolutions. A precision of 6 bits was chosen as it enables correct detection of the distortion in the inband Rx signal even at low distortion levels. Several implementations of low power ADCs have been presented in [14]-[17], and such an implementation can be utilized in the auxiliary path. It has to be noted that some of these implementations target a very large bandwidth, and the power consumption can be further reduced by implementing lower bandwidth ADCs. Furthermore, the proposed auxiliary path will need to run only when the re-tuning of the analog part is required.

Table I provides details of precision required for the signals in the auxiliary path to be able to effectively detect and tune an LTE CSF. These parameters are obtained by performing a similar analysis as the one used to obtain the LMS filter length and auxiliary ADC width. A higher precision multiplier is needed for the correlator unit to enable detection of very low level distortion.



(a) LMS filter order and correlation



Fig. 6: LMS filter length and ADC width analysis

TABLE I: Configuration of the auxiliary path components

| Component              | Configuration |
|------------------------|---------------|
| Auxiliary ADC          | 6 bits        |
| LMS filter length      | 9 taps        |
| LMS coefficient width  | 10 bits       |
| LMS output width       | 10 bits       |
| Digital cubing unit    | 6 bits        |
| Correlator multipliers | 16 bits       |

#### VI. CONCLUSION

A digitally assisted non-linearity suppression scheme capable of tuning a non-linear analog building block for optimal performance is proposed. As a proof of concept, simulations have been performed with an LTE signal and a non-linear CSF with 10 MHz baseband bandwidth. The proposed technique is capable of effectively tuning the filter for minimum distortion. Scenarios with strong channel interference and full bandwidth Rx signal have been investigated to obtain an estimate of the hardware overhead of the auxiliary digital control structure. The results show that the proposed scheme can be implemented with simple components resulting in a low hardware cost. Furthermore, the auxiliary control structure can be powered down once tuning is achieved resulting in power savings.

#### ACKNOWLEDGMENT

This work is a part of the DARE project and the authors would like to thank the Swedish Foundation for Strategic Research for funding.

#### REFERENCES

- B. Kim, J.-S. Ko, and K. Lee, "A new linearization technique for MOSFET RF amplifier using multiple gated transistors," *IEEE Microw.* and Guided Wave Lett., vol. 10, pp. 371–373, Sep 2000.
- [2] K.-H. Liang, C.-H. Lin, H.-Y. Chang, and Y.-J. Chan, "A New Linearization Technique for CMOS RF Mixer Using Third-Order Transconductance Cancellation," *IEEE Microw. and Wireless Compon. Lett.*, vol. 18, no. 5, pp. 350–352, May 2008.
- [3] W.-H. Chen, G. Liu, B. Zdravko, and A. Niknejad, "A Highly Linear Broadband CMOS LNA Employing Noise and Distortion Cancellation," *IEEE J. Solid-State Circuits*, vol. 43, pp. 1164–1176, May 2008.
- [4] A. Nejdel, M. Törmänen, and H. Sjöland, "A 0.7 to 3 GHz wireless receiver front end in 65-nm CMOS with an LNA linearized by positive feedback," Analog Integr. Circuits and Signal Process., vol. 74, pp. 47– 57, 2013.
- [5] W. Huang and E. Sanchez-Sinencio, "Robust highly linear high-frequency CMOS OTA with IM3 below -70 dB at 26 MHz," *IEEE Trans. Circuits Syst. I*, vol. 53, pp. 1433–1447, July 2006.
- [6] A. Lewinski and J. Silva-Martinez, "OTA linearity enhancement technique for high frequency applications with IM3 below -65 dB," IEEE Trans. Circuits Syst. II, vol. 51, pp. 542–548, Oct 2004.
- [7] V. Aparin, G. Ballantyne, C. Persico, and A. Cicalini, "An integrated LMS adaptive filter of TX leakage for CDMA receiver front ends," *IEEE J. Solid-State Circuits*, vol. 41, pp. 1171–1182, May 2006.
- [8] E. Keehr and A. Hajimiri, "A rail-to-rail input receiver employing successive regeneration and adaptive cancellation of intermodulation products," in *IEEE Radio Freq. Integr. Circuits Symp.*, May 2010, pp. 47–50
- [9] E. Keehr and A. Hajimiri, "Equalization of Third-Order Intermodulation Products in Wideband Direct Conversion Receivers," *IEEE J. Solid-State Circuits*, vol. 43, pp. 2853–2867, Dec 2008.
- [10] T. H. Lee, The Design of CMOS Radio-Frequency Integrated Circuits, 2nd ed. Cambridge University Press, 2004.
- [11] D. Filipovic and C. Komninakis, "Intermodulation distortion detection and mitigation," Jan. 2011, US Patent 7,876,867.
- [12] S. Haykin, Adaptive filter theory, 4th ed. Prentice Hall, 2002.
- [13] E. Dahlman, S. Parkvall, and J. Skold, 4G: LTE/LTE-Advanced for Mobile Broadband, 1st ed. Academic Press, 2011.
- [14] C. Sandner, M. Clara, A. Santner, T. Hartig, and F. Kuttner, "A 6-bit 1.2-GS/s low-power flash-ADC in 0.13-\(\mu\)m digital CMOS," *IEEE J. Solid-State Circuits*, vol. 40, pp. 1499–1505, July 2005.
- [15] C.-Y. Chen, M. Le, and K. Y. Kim, "A Low Power 6-bit Flash ADC With Reference Voltage and Common-Mode Calibration," *IEEE J. Solid-State Circuits*, vol. 44, pp. 1041–1046, April 2009.
- [16] M. Chahardori, M. Sharifkhani, and S. Sadughi, "A 4-Bit, 1.6 GS/s Low Power Flash ADC, Based on Offset Calibration and Segmentation," *IEEE Trans. Circuits Syst. I*, vol. 60, pp. 2285–2297, Sept 2013.
- [17] B. Verbruggen, J. Craninckx, M. Kuijk, P. Wambacq, and G. Van der Plas, "A 2.2 mW 1.75 GS/s 5 Bit Folding Flash ADC in 90 nm Digital CMOS," *IEEE J. Solid-State Circuits*, vol. 44, pp. 874–882, March 2009.

# Paper II

### Paper II

### A Digitally Assisted Non-Linearity Mitigation System for Tunable Channel Select Filters

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund university's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications\_standards/publications/rights\_link.html to learn how to obtain a License from RightsLink.

R. Gangarajaiah, M. Abdulaziz, H. Sjöland, P. Nilsson, and L. Liu, "A Digitally Assisted Non-Linearity Mitigation System for Tunable Channel Select Filters," © 2015 IEEE, reprinted from the *IEEE Transactions on Circuits and Systems-II: Express Briefs*, January 2016.

# A Digitally Assisted Nonlinearity Mitigation System for Tunable Channel Select Filters

Rakesh Gangarajaiah, Student Member, IEEE, Mohammed Abdulaziz, Student Member, IEEE, Henrik Sjöland, Senior Member, IEEE, Peter Nilsson, Senior Member, IEEE, and Liang Liu, Member, IEEE

Abstract—This brief presents a low-complexity system for digitally assisting a channel select filter (CSF) to mitigate both evenand odd-order nonlinearities. The proposed solution is scalable and can be utilized for nonlinearity mitigation in different analog transceiver blocks. The system consists of an auxiliary path with a low-resolution analog to digital converter (ADC) enabling digital recreation and measurement of the distortion in the main path and relies on an adaptive digital signal processing algorithm to detect and tune the analog components to their optimal settings. The system provides robustness against process, voltage, and temperature variations, and the digital part requires an equivalent logic of only 42 k gates in CMOS technology, enabling cost-efficient implementation on integrated circuits. The operation of the system has been verified by using a tunable CSF capable of receiving a 10-MHz baseband signal interfaced to an external ADC. The results demonstrate that the proposed system is capable of tuning the CSF to its optimal bias voltage, providing a third-order intermodulation reduction of 14.5 dB.

Index Terms—Adaptive signal processing, interference cancellation, intermodulation (IM) distortion, nonlinear circuits.

#### I. INTRODUCTION

TECHNOLOGY scaling in modern complementary metaloxide-semiconductor (CMOS) processes has resulted in high-performance and low-power digital signal processing both due to the smaller feature size of transistors and the capability to operate at lower supply voltages. While scaling has immensely benefited digital circuits, it has created both advantages and problems in their analog counterparts, problems mainly in terms of linearity. Furthermore, an increase in wireless connections within a limited spectrum has forced many radio devices to operate close in frequency, resulting in powerful interference in close proximity to the signal of interest. This has resulted in increased linearity requirements on the analog building blocks of a wireless transceiver, such as the low-noise amplifier (LNA), mixer, and channel select filter (CSF), and these requirements are expected to increase with newer 5G technologies. The straightforward method of implementing high-linearity analog circuits results in increased area and power, both of which have to be kept low for a cost-effective wireless device. Consequently, an increasing number of receiver operations are performed in the digital baseband where linearity is guaranteed. Nevertheless, some of the analog building blocks are essential,

Manuscript received July 14, 2015; revised September 11, 2015; accepted November 21, 2015. Date of publication November 26, 2015; date of current version December 22, 2015. This brief was recommended by Associate Editor B. Chi.

The authors are with the Department of EIT, Lund University, 22100 Lund, Sweden (e-mail: rakesh.gangarajaiah@eit.lth.se).

Color versions of one or more of the figures in this brief are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSII.2015.2504272

and designers are looking at techniques to reduce distortion due to interference.

Several techniques have been reported in literature addressing the intermodulation (IM) mitigation problem, and one strategy is to implement radio frequency (RF) components which can be calibrated to operate at their optimum bias points [1]-[4]. Unfortunately, this bias point shifts significantly depending on process, voltage, and temperature (PVT) variations, requiring frequent recalibration to continue operating in the high-linearity mode. Recently, the concept of using an auxiliary path to recreate and cancel nonlinearities has been proposed to increase the robustness of analog circuits. An adaptive interference cancellation technique operating in the analog domain was introduced in [5]. One of the aspects of this method is the use of analog components to perform calibration, which themselves might suffer from PVT variation. In [6] and [7], IM generation is performed in the analog domain, and a powerful digital signal processing is employed for IM cancellation. In order to minimize the hardware in the analog circuitry and to better generate all of the required IM terms for cancellation in an LNA, a mixed domain approach is suggested in [6]. The authors in [8] develop a complete analytical model for different nonlinearities in RF receivers and verify the effectiveness of adaptive algorithms for feedforward cancellation on a software-defined radio. A method suitable for an analog to digital converter (ADC) using digital downsampling is presented in [9]. In all of these methods, the auxiliary paths need to operate continuously, leading to increased power consumption. A method which can satisfy the conflicting demands of high linearity and low power consumption is needed. In [10], the authors use the advantages of tunable analog circuits which remove IM at the source and harness the power of adaptive digital signal processing capable of tuning the analog circuits toward optimal operation. Unlike the approach in [6]-[9], this method does not need to operate continuously and can be powered down when the optimal bias point has been reached, leading to significant power savings. Furthermore, this method can be applied to RF blocks which can be calibrated to mitigate IM distortion, provided that a lowresolution linear alternate path can be realized.

Based on this concept, this brief presents the complete hardware design and implementation of a system for adaptive nonlinearity tuning of an analog component. A fully functional system is built around a tunable CSF with 10-MHz baseband signal bandwidth [11]. The performance of the proposed system, configured to mitigate the third-order IM, is evaluated under different input conditions, and measurement results are presented. The system is implemented in a 65-nm CMOS process, and postlayout simulation results are shown to evaluate the overhead in terms of area and power.



Fig. 1. Conventional receiver.

In detail, the digital part of the system performing this calibration consists of an alternate-path ADC and a digital least mean squares (LMS) filter. The system has been implemented on a Xilinx Kintex-7 field-programmable gate array (FPGA) and interfaced to the CSF using an external ADC. Several experiments have been performed by using different resolutions on the auxiliary ADC, and the results show that the proposed method is capable of tuning the CSF toward optimal operation even when operating with a 4-bit ADC. The system is capable of detecting and tuning the CSF for mitigating both even- and odd-order IMs using the same hardware blocks with minimal reconfiguration. The tuning can be achieved within 20 ms or can be turned off once the desired level of linearity has been reached, leading to significant power savings.

The remainder of this brief is organized as follows. Section II gives a background of the system, and Section III presents the implementation aspects of the digital tuning method. Different test scenarios and performance results are discussed in Section IV, and the conclusion is presented in Section V.

#### II. BACKGROUND

A traditional receiver chain in a wireless device is shown in Fig. 1, with the analog module corresponding to any or all of the building blocks in the RF front-end. The analog modules are generally prone to nonlinearities and are nontunable, which, in the presence of a strong interference and PVT variations, results in IM distortion, which can corrupt the wanted signal. The problems due to nonlinearities are especially aggravated in analog front-ends operating in the frequency division duplexing mode when the receive (Rx) and transmit (Tx) bands are closely located in frequency [12]. These problems are expected to be more severe in devices aiming to achieve 5G data rates when using wideband signal reception, and hence, a low IM distortion is a necessity.

Even-order nonlinearities in the analog domain are usually well suppressed through cancellation by the use of differential signals, whereas cancellation of odd-order nonlinearities is a challenging task. A method of using parallel CMOS devices operating in a subthreshold region for canceling a third-order IM (IM3) is presented in [1]. Another technique of cancellation in the transconductance stage is presented in [11], where the control voltage of the different stages in a CSF can be tuned for optimum cancellation of IM3. The schematic of this CSF along with the control signal for IM3 tuning is shown in Fig. 2. However, supply and temperature variation shift the optimal control voltage, and so, regular retuning of the filter is needed. In this brief, we present the details of the hardware implementation of a digital loop capable of performing this tuning and demonstrate its effectiveness in different test scenarios. The proposed system, together with a digital to analog converter (DAC), is capable of performing run-time tuning of the filter.



Fig. 2. Tunable four-stage CSF and schematic with bias control of a single stage.



Fig. 3. Proposed nonlinearity suppression receiver with a tunable CSF.

# III. DIGITALLY ASSISTED NONLINEARITY TUNING SYSTEM

#### A. System Overview

Fig. 3 depicts a run-time tuning mechanism capable of calibrating an analog module, in this example a CSF. The main path of the receiver is assumed to consist of different RF components such as an LNA and a mixer which produce the amplified and downconverted signal x(t), which is fed into the tunable CSF. An ADC operating with an oversampling ratio (OSR) of 8 is assumed to digitize the CSF output, which is then downsampled by a decimator to produce y(n) at the required digital baseband rate. The basic idea of the tuning mechanism is as follows. An auxiliary ADC is used to capture the in-band (IB) and out-of-band (OOB) interferences, enabling the digital re-creation of any Xth order distortion which we aim to cancel. A linear adaptive algorithm, which minimizes the first-order error between the main-path signal y(n) and the signal x'(n), is used to extract the error signal e(n), which mainly contains the nonlinear components of the main-path signal. The level of correlation between the digitally recreated IM distortion k(n) and error e(n) is used as a measure of the nonlinearity, with a higher correlation value indicating larger nonlinearities. The system controller, along with the DAC, can then tune the analog component toward optimal operation. Note that the error signal e(n)contains both even- and odd-order nonlinear components in the main-path signal, and hence, by reconfiguring the IM generator to produce either the even- or odd-order IM, the corresponding nonlinearity can be tuned by minimizing the correlation value. The procedure for tuning a component is as follows. When calibration is started, the digital loop performs a scan over a predefined range of bias voltages, where the performance of the RF component is evaluated and the correlation values are stored. Once the scan is completed, the system controller chooses the bias voltage which minimizes IM distortion. A new calibration scan is initiated every few minutes or when significant changes in operating temperature or supply voltage are detected. For example, if IM3 and IM5 have to be tuned, the digital loop is first configured to tune for IM3, followed by tuning for IM5. The system controller will store the correlation values for these runs and will make a decision by comparing the values to determine the strongest IM. The bias voltage which minimizes the total IM distortion is then chosen at the end of the calibration scans. If another module such as the LNA is to be tuned, the auxiliarypath reference signal could be obtained by a highly linear but noisy measurement receiver, commonly found in modern-day transceiver chips. The system is capable of adaptive detection of blockers and can tune an analog component to its optimal bias region, after which it can be turned off to save power.

The digital loop customized for IM3 tuning can be mathematically described as

$$\mathbf{z}(\mathbf{n}) = F(\mathbf{x}'(\mathbf{n}), \mathbf{e}(\mathbf{n}))$$

$$\mathbf{e}(\mathbf{n}) = \mathbf{y}(\mathbf{n}) - \mathbf{z}(\mathbf{n})$$

$$\mathbf{k}(\mathbf{n}) = \mathbf{x}'(\mathbf{n})^3$$

$$\mathbf{C} = \mathbf{k}(\mathbf{n}) \star \mathbf{e}(\mathbf{n})$$
(1)

where the function F() describes the operation of the adaptive filter,  $\mathbf{x}'(\mathbf{n})$  is the reference signal from the auxiliary ADC, and  $\mathbf{e}(\mathbf{n})$  is the error signal which is correlated with the output of the IM generator  $\mathbf{k}(\mathbf{n})$  to produce the correlation value  $\mathbf{C}$  for the current tuning step.

From the hardware resource perspective, the main components of the proposed system are an auxiliary ADC, an IM generator, two decimators to match the sample rates of the auxiliary-path and main-path signals, an adaptive filter, a correlation unit, and a system controller along with a DAC. In the current implementation, an off-the-shelf ADC is used in performing the tasks of the main- and auxiliary-path ADCs. The IM generator is implemented using digital multipliers and requires two multiplication units to generate IM3.

#### B. Low-Complexity Decimator Design

Decimation is an essential operation in many signal processing systems, with the main goals of providing filtering along with sample rate reduction. An efficient method of choosing between different implementations is presented in [13]. One of the requirements of the proposed method is the presence of a "reference" signal in the auxiliary path, which may have a different sample rate than the signal in the main receiver path. In order to synchronize and enable low-power implementation of the adaptive algorithm, a decimation filter chain is employed to match the sample rates of the auxiliary and main paths.

Two decimators are needed in the digital tuning loop, one for the IM generator output and one for the adaptive filter input regardless of the order of IM being tuned. The same decimators can be reused when tuning either an even-order or odd-order IM, provided that we perform tuning over only a single-order



Fig. 4. Three-stage half-band decimation filter.



Fig. 5. Tapped correlator implementation.

IM in any given calibration scan. These filters are implemented using a three-stage half-band filter (HBF) chain. The stop band attenuation is set to about 40 dB to handle a strong OOB interference and to enable performance measurement of the system with different auxiliary-path ADC resolutions ranging from 8 to 4 bits. Fig. 4 shows the architecture of the proposed decimator chain. The first stage operates at a higher frequency and is implemented with a lower order, whereas the third stage with a higher order enables a sharp cutoff. The HBF chain requires 13 nontrivial multiplications, which is equivalent to the response of a 36th-order finite impulse response (FIR) decimator with 10-bit coefficients. We choose to implement the decimator with HBF, as each of the three stages works at half the clock frequency of the previous stage, resulting in an overall lower power consumption. It has to be noted that alternate implementations of the ADCs in a single chip system may result in different decimation filters depending on the OSR of the auxiliary ADC.

#### C. Adaptive Filter and Correlation Unit

A linear adaptive filter is capable of producing an error signal  $\mathbf{e}(\mathbf{n})$  in Fig. 3, containing the nonlinear terms of the main-path signal as detailed in Section III-A. The LMS algorithm is the simplest adaptive algorithm which aims to minimize the mean-square error. The results in [10] show that a normalized LMS algorithm is capable of providing the required IM separation with a filter length of 9. We have chosen to use the standard LMS algorithm to avoid the division operation required by the normalized LMS and to use a slower update factor  $\mu=0.25$  with a filter length of 17. An unrolled FIR structure with 10-bit coefficients is implemented.

The error signal  $\mathbf{e}(\mathbf{n})$  and the output of the IM generator  $\mathbf{k}(\mathbf{n})$  are generally not time synchronized, which can result in incorrect correlation values. To overcome this and better capture the correlation peaks, a bank of five correlators is used, as shown in Fig. 5, each performing correlation over different orthogonal frequency division multiplexing (OFDM) symbol samples. The maximum correlation value obtained is then used to control the DAC, which, in turn, tunes the CSF toward higher

TABLE I HARDWARE RESOURCE UTILIZATION

| Component         | Clock | AR          | EA         | Power |
|-------------------|-------|-------------|------------|-------|
| Component         | (MHz) | FPGA Slices | ASIC Gates | (mW)  |
| IM3 Generator     | 250   | 195         | 0.7 k      | 0.18  |
| Decimator         |       |             |            | 1.56  |
| Stage 1           | 250   | 193         | 1 k        | 0.58  |
| Stage 2           | 125   | 301         | 1.4 k      | 0.56  |
| Stage 3           | 62.5  | 408         | 2.4 k      | 0.42  |
| Adaptive Filter + |       |             |            |       |
| Correlator        | 32.25 | 6665        | 32 k       | 4.90  |
| Total             |       | 8664        | 42.3 k     | 8.20  |

TABLE II COMPARISON OF AREA AND POWER

| Component           | ASIC area (65nm) | Power (mW) |
|---------------------|------------------|------------|
| Tunable CSF [11]    | $0.19mm^{2}$     | 4.2        |
| Digital tuning loop | $0.086mm^{2}$    | 8.2        |
| ADC [15]            | $0.018mm^{2}$    | 2.0        |

linearity. A reset signal is applied every 2048 samples, corresponding to one OFDM symbol period, to enable correlation calculations to restart for the next symbol. In the proposed system, we have chosen to average the correlation values of 30 OFDM symbols, corresponding to a period of 2 ms per step, to increase the robustness of the estimated correlation values. The time spent at each tuning voltage step can be programmed by configuring counters to provide a corresponding slower or faster scan over a bias voltage range.

#### D. Hardware Implementation of the Digital Tuning Loop

In order to evaluate the hardware cost of the proposed system, the digital tuning loop customized to mitigate IM3 of the CSF was implemented on a Xilinx Kintex-7 [14] FPGA. Table I shows the resource utilization on the FPGA, the area required by a corresponding implementation in 65-nm CMOS technology, and the average power consumption of the proposed tuning loop. The digital loop is implemented to cover different PVT corners in the 65-nm CMOS technology, resulting in higher robustness and guaranteed functionally correct operation. Table II shows a bigger picture of the area and power cost of the proposed system. We have chosen a state-of-the-art ADC for comparison, but full chip implementation of the proposed system could have more relaxed requirements than the ADC in [15]. Since the calibration scans are typically initiated once every few minutes and complete within 20 ms, the total power overhead for the digital loop is negligible. The Adaptive filter with the correlation unit occupies 75% of the area required for the digital tuning loop. Further area reductions can be obtained by implementing a folded LMS filter at the cost of higher operating frequency. Nevertheless, the total area which includes one IM3 generation unit, two decimators, and the unfolded adaptive filter with the correlation unit is half that of the CSF [11].

#### IV. MEASUREMENT RESULTS

#### A. Verification System Setup

The performance of the digital tuning system was tested by connecting it to the CSF using an external ADC as shown in Fig. 6. Three Rohde and Schwarz (R&S) SMIQ06B signal generators (SIG) were used to generate one IB and two



Fig. 6. Schematic and photograph of the test setup.



Fig. 7. CSF output before and after calibration.

OOB blockers, with both modulated and nonmodulated signals. The output and input signals of the CSF were fed into two separate channels on a 4DSP FMC125 ADC [16]. The ADC operated at 1.25 GHz and the samples produced were digitally downsampled by dropping three out of every four samples to yield an effective sample rate of 312.5 MHz, which is slightly higher than the sample rate considered in [10]. The samples from the ADC were synchronized by using FIFOs, and a clock generator module from Xilinx was used to generate the different clock signals. Due to a limit of the number of external pins on the Kintex-7 FPGA, the DAC was replaced by a display implemented using ChipScope, and the CSF control voltage was tuned manually. The FSEA spectrum analyzer (SA) from R&S was used to measure the actual level of IM3 at the output of the CSF, which was then compared against the values obtained from the ChipScope display.

#### B. Test Results and Discussion

A two-tone test was performed with the setup of Fig. 6 with two OOB continuous wave blockers at 49 MHz  $(F_1)$  and 25 MHz  $(F_2)$ , resulting in an IB IM3 at 1 MHz  $(2F_2-F_1)$ . A single-tone signal at 2.36 MHz was also provided as an input to check the effectiveness of the adaptive algorithm to reject IB signals. A more difficult scenario with an IB modulated signal of 2-MHz bandwidth along with two OOB blockers was also tested. The full resolution of the main-path ADC was used, the OOB blockers were set to be at -7 dBFS, and a wideband signal at -31 dBFS was used as the IB signal. The screenshots from the spectrum analyzer with this setup



Fig. 8. Control voltage versus correlation values for different auxiliary ADC resolutions together with measured IM3 values from the spectrum analyzer (SA).



Fig. 9. 16-QAM signal reception with different levels of IM3 distortion. (a) IM3 at -26 dBc, EVM = 0.0716. (b) IM3 at -40 dBc, EVM = 0.0140.

are shown in Fig. 7 with the CSF output before and after calibration. It can be seen that the IM3 level can be reduced by 14.5 dB from -42.5 to -57 dBm.

The performance and cost of the digital tuning loop are mainly determined by the auxiliary ADC. A high-resolution ADC, while providing a better tuning capability, would result in increased area and power consumption, which might overshadow the improvements obtained by tuning the analog component. On the other hand, a low-resolution ADC will not provide the required degree of IM detection. Measurements with different resolutions on the auxiliary ADC were performed to determine the minimum required resolution by simple digital dropping of the least significant bits from the auxiliary ADC. Fig. 8 shows the normalized correlation values against the control voltage of the CSF obtained from the digital control loop when operating with 8-, 6-, and 4-bit resolutions. The plot for the measured IM3 corresponds to the values read out from the spectrum analyzer and indicates the true level of IM3 distortion. It can be seen that the plot with an 8-bit ADC follows the true performance of the CSF to a high degree and also that a 4-bit ADC is capable of detecting the control voltage for close to optimum biasing. A performance improvement of about 14 dB can be obtained by using the proposed system, and the simulation results from a 16-QAM 10-MHz baseband Long-Term Evolution signal are shown in Fig. 9. IM3 levels of -26 and -40 dB are chosen to highlight the error vector magnitude (EVM) improvement pictorially, with a 14-dB change in IM3 resulting in an EVM improvement of around five times. Several low-resolution ADCs are presented in [15], [17], and [18], and one such solution could be implemented with the digital control loop for a single chip solution.

#### V. CONCLUSION

This brief has presented a low-complexity digital system for assisting a tunable CSF. It is capable of detecting and tuning the CSF to its optimal setting for minimum distortion, after which it can be shut down, resulting in minimum power consumption overhead. The proposed solution has been verified both by simulations and hardware implementation on a Xilinx Kintex-7 FPGA interfaced to the tunable CSF, and the results obtained show that the algorithm can be implemented with a low-resolution 4-bit ADC. The proposed system requires a total of only 42 k gates and is robust to PVT variations mainly due to the digital nature of the tuning circuit.

#### ACKNOWLEDGMENT

The authors would like to thank the Swedish Foundation for Strategic Research. This work is part of the DARE project.

#### REFERENCES

- B. Kim, J.-S. Ko, and K. Lee, "A new linearization technique for MOSFET RF amplifier using multiple gated transistors," *IEEE Microw. Guided Wave Lett.*, vol. 10, no. 9, pp. 371–373, Sep. 2000.
- [2] W.-H. Chen, G. Liu, B. Zdravko, and A. M. Niknejad, "A highly linear broadband CMOS LNA employing noise and distortion cancellation," *IEEE J. Solid-State Circuits*, vol. 43, no. 5, pp. 1164–1176, May 2008.
- [3] A. Nejdel, M. Törmänen, and H. Sjöland, "A 0.7 to 3 GHz wireless receiver front end in 65-nm CMOS with an LNA linearized by positive feedback," *Analog Integr. Circuits Signal Process.*, vol. 74, no. 1, pp. 49–57, Jan. 2013.
- [4] W. Huang and E. Sanchez-Sinencio, "Robust highly linear high-frequency CMOS OTA with IM3 below -70 dB at 26 MHz," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 53, no. 7, pp. 1433–1447, Jul. 2006.
- [5] V. Aparin, G. J. Ballantyne, C. J. Persico, and A. Cicalini, "An integrated LMS adaptive filter of TX leakage for CDMA receiver front ends," *IEEE J. Solid-State Circuits*, vol. 41, no. 5, pp. 1171–1182, May 2006.
- [6] E. Keehr and A. Hajimiri, "Successive regeneration and adaptive cancellation of higher order intermodulation products in RF receivers," *IEEE Trans. Microw. Theory Tech.*, vol. 59, no. 5, pp. 1379–1396, May 2011.
- [7] E. Keehr and A. Hajimiri, "Equalization of third-order intermodulation products in wideband direct conversion receivers," *IEEE J. Solid-State Circuits*, vol. 43, no. 12, pp. 2853–2867, Dec. 2008.
- [8] M. Grimm, M. Allen, J. Marttila, M. Valkama, and R. Thoma, "Joint mitigation of nonlinear RF and baseband distortions in wideband directconversion receivers," *IEEE Trans. Microwave Theory Tech.*, vol. 62, no. 1, pp. 166–182, Jan. 2014.
- [9] M. Gande, H. Venkatram, L. Ho-Young, J. Guerber, and U.-K. Moon, "Blind calibration algorithm for nonlinearity correction based on selective sampling," *IEEE J. Solid-State Circuits*, vol. 49, no. 8, pp. 1715–1724, Aug. 2014.
- [10] R. Gangarajaiah, M. Abdulaziz, L. Liu, and H. Sjoland, "Digitally assisted adaptive nonlinearity suppression scheme for RF front ends," in Proc. IEEE 25th Annu. Int. Symp. Pers., Indoor Mobile Radio Commun., Aug. 2014, pp. 623–627.
- [11] M. Abdulaziz, M. Tormanen, and H. Sjoland, "A 4th order Gm-C filter with 10 MHz bandwidth and 39 dBm IIP3 in 65 nm CMOS," in *Proc.* IEEE Eur. Solid State Circuits Conf., Sep. 2014, pp. 367–370.
- [12] Evolved Universal Terrestrial Radio Access (E-UTRA) User Equipment (UE) Radio Transmission and Reception, 3GPP Std. 36.101, 2011.
- [13] R. Crochiere and L. Rabiner, "Interpolation and decimation of digital signals—A tutorial review," *Proc. IEEE*, vol. 69, no. 3, pp. 300–331, Mar. 1981.
- [14] Xilinx Kintex-7 FPGA KC705 Evaluation Kit. [Online]. Available: http:// www.xilinx.com/products/boards-and-kits/ek-k7-kc705-g.html
- [15] Y.-Z. Lin, S.-J. Chang, Y.-T. Liu, C.-C. Liu, and G.-Y. Huang, "A 5 b 800 MS/s 2 mW asynchronous binary-search ADC in 65 nm CMOS," in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2009, pp. 80–81.
- [16] 4DSP FMC125 Multi-Channel Multi-Mode 8-bit ADC. [Online]. Available: http://www.4dsp.com/fmc125.php
- [17] B. Verbruggen, J. Craninckx, M. Kuijk, P. Wambacq, and G. Van der Plas, "A 2.2 mW 1.75 GS/s 5 bit folding flash ADC in 90 nm digital CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 3, pp. 874–882, Mar. 2009.
- [18] M. Chahardori, M. Sharifkhani, and S. Sadughi, "A 4-bit, 1.6 GS/s low power flash ADC, based on offset calibration and segmentation," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 60, no. 9, pp. 2285–2297, Sep. 2013.

# Paper III

## Paper III

### A High Speed QR Decomposition Processor for Carrier-Aggregated LTE-A Downlink Systems

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund university's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications\_standards/publications/rights\_link.html to learn how to obtain a License from RightsLink.

R. Gangarajaiah, P. Nilsson, L. Liu, and O. Edfors, "A High Speed QR Decomposition Processor for Carrier-Aggregated LTE-A Downlink Systems," © 2013 IEEE, reprinted from the *Proceedings of IEEE European Conference on Circuit Theory and Design*, Dresden, Germany, September 2013, pp. 1–4.

# A High-Speed QR Decomposition Processor for Carrier-Aggregated LTE-A Downlink Systems

Rakesh Gangarajaiah, Liang Liu, Michal Stala, Peter Nilsson, and Ove Edfors Department of Electrical and Information Technology, Lund University, Sweden Email: {rakesh.gangarajaiah,liang.liu,michal.stala,peter.nilsson,ove.edfors}@eit.lth.se

Abstract—This paper presents a high-speed QR decomposition (QRD) processor targeting the carrier-aggregated  $4\times 4$  Long Term Evolution-Advanced (LTE-A) receiver. The processor provides robustness in spatially correlated channels with reduced complexity by using modifications to the Householder transform, such as decomposing-target redefinition and matrix real-valued decomposition. In terms of hardware design, we extensively explore flexibilities in systolic architectures using a high-level synthesis tool to achieve area-power efficiency. In a 65 nm CMOS technology, the processor occupies a core area of 0.77 mm $^2$  and produces 72 MQRD per second, the highest reported throughput. The power consumed in the proposed processor is 127 mW.

#### I. INTRODUCTION

The requirement of high speed wireless connections over limited spectrum has made the use of Multiple-Input Multiple-Output (MIMO) technique a necessity. To fully utilize the potential of MIMO systems, sophisticated signal processing is required at the receiver. QR decomposition (QRD) is one of the key operations used to correctly decode multiple streams of data affected by noise and interference [1].

Several standards have been introduced to meet requirements of high data rate applications. For example, the 3GPP Long Term Evolution-Advanced (LTE-A) delivers rates of over 1 Gbps using techniques such as enhanced MIMO and Carrier Aggregation (CA). This poses critical design challenges on the implementation of baseband processing algorithms. In one of the extreme use cases of LTE-A, where five frequency bands are aggregated into a 100MHz data bandwidth, the QRD processor needs to compute up to 72 MQRD/s under fluctuating channel conditions. Insufficient antenna spacing in hand-held devices creates further complications such as spatial correlation resulting in ill conditioned channel matrix H in the baseband. Numerical stability of algorithms working on such matrices is critical and previous studies have proved that the fixed point implementation of the Householder Transform (HT) is more numerically stable than the Gram-Schmidt (GS) method [2]. Moreover, HT works with columns instead of scalar elements, and thus is better for data-level parallelism. However, previous studies have suggested that the computational complexity of HT is very high, preventing it from being used in hardware implementation [3].

In this work, we leverage the high numerical stability of the HT to produce accurate QRD even in correlated MIMO channels. Two techniques are used to reduce the complexity while achieving high throughput with reasonable hardware resources. First, we redefine the QRD target based on the requirement of

a tree-search symbol detector and avoid unnecessary matrix multiplications. Later, methods to exploit the symmetry and orthogonality properties in the Real Valued Decomposition (RVD) of H to further reduce complexity are detailed. We develop a scalable systolic VLSI architecture to implement the modified HT and utilize Calypto's Catapult tool to obtain optimized designs. This high-level synthesis tool translates C++ code into Register Transfer Level (RTL) and enables the designer to explore the effects of word widths, folding and pipelining against area and power consumption. Post-synthesis simulation results using 65 nm CMOS technology show that the proposed QRD processor achieves 72 MQRD/s, the highest reported throughput, with a gate count of 378 k gates.

#### II. BACKGROUND

Consider a MIMO system with N transmitter (Tx) and N receiver (Rx) antennas. If the transmit vector is represented as  $\mathbf{x} = \begin{bmatrix} x_1, x_2, ..., x_N \end{bmatrix}^T$ , the receive vector as  $\mathbf{y} = \begin{bmatrix} y_1, y_2, ..., y_N \end{bmatrix}^T$  with a channel  $\mathbf{H} \in \mathbb{C}^{N \times N}$ , then the system affected by random noise  $\mathbf{n}$ , can be described by

$$y = Hx + n. (1)$$

To achieve low Bit error rate (BER) the MIMO symbol detector has to minimize the error  $\parallel \mathbf{y} - \mathbf{H}\tilde{\mathbf{x}} \parallel_2$ , where  $\tilde{\mathbf{x}}$  is the estimate of the transmit vector. Efficient symbol detectors require  $\mathbf{H}$  to be decomposed into the product  $\mathbf{Q}\mathbf{R}$ , where  $\mathbf{Q}$  is a unitary matrix and  $\mathbf{R}$  is an upper triangular matrix [1]. The module which decomposes  $\mathbf{H}$  into this product is called a QRD processor. These processors can be classified into two broad categories, one which works by rotating submatrices like the Given's rotation (GR) method and the other which works on columns, namely the GS method and the HT.

The conventional HT converts  $\mathbf{H}$  into a product of N unitary  $\mathbf{Q}$  matrices and an upper triangular  $\mathbf{R}$  as shown in

$$\mathbf{H} = \mathbf{Q_1}\mathbf{Q_2}...\mathbf{Q_{N-1}}\mathbf{Q_N}\mathbf{R},\tag{2}$$

where each of the  $\mathbf{Q}_i$  matrices are of the form

$$\mathbf{Q}_i = \left(\mathbf{I} - \frac{\mathbf{v}_i \mathbf{v}_i^*}{\mathbf{v}_i^* \mathbf{z}_i}\right) \tag{3}$$

and  $\mathbf{z}_i$  is the vector to be transformed with  $\mathbf{v}_i$  being the difference vector from  $\mathbf{z}_i$  to one of the columns of the identity matrix  $\mathbf{I}$  [4]. Unfortunately the method of multiplying all the components  $\mathbf{Q}_1\mathbf{Q}_2\cdots\mathbf{Q}_N$  to produce  $\mathbf{Q}$  would lead to high computational complexity in the order of  $N^4$  and a straight forward implementation would lead to unnecessary high effort.

#### III. PROPOSED OR DECOMPOSITION

In this section we present techniques to reduce the complexity of the HT. First we look at the modified representation of the linear system. Then we discuss the RVD and detail the methods of exploiting symmetry and orthogonality to reduce complexity. Later we discuss the gains obtained by implementing HT using these properties.

#### A. Modified linear system

Using the QRD of H, the system in (1) can be written as

$$\mathbf{Q}^*\mathbf{y} = \mathbf{R}\mathbf{x} + \mathbf{Q}^*\mathbf{n}. \tag{4}$$

As mentioned before, tree search based detectors accept  $\mathbf{QR}$  instead of  $\mathbf{H}$  and work by solving equations of the form (4), hence the task of the QRD processor working with a tree based detector can be relaxed to that of producing  $\mathbf{Q^*y}$  and  $\mathbf{Rx}$ . One of the requirements for (4) to hold good is that the error  $\parallel (\mathbf{Q^*Q}) - \mathbf{I} \parallel_2$  is minimal, or in other words,  $\mathbf{Q}$  is highly unitary. Since the  $\mathbf{Q}$  produced by the HT is of the form shown in (2) and using the property that  $\mathbf{Q}$  is unitary, the product  $\mathbf{Q^*y}$  can be rewritten as

$$Q^*y = Q_N Q_{N-1}...Q_1 y. (5)$$

Using the structure of the component  $Q_i$  matrices from (3), the above equation can be rewritten as

$$\mathbf{Q}^{*}\mathbf{y} = \mathbf{Q_{N}Q_{N-1}...}\left(\mathbf{y} - \frac{\mathbf{v_{1}}\left(\mathbf{v_{1}^{*}y}\right)}{\mathbf{v_{1}^{*}z_{1}}}\right). \tag{6}$$

By calculating  $\mathbf{Q}_i\mathbf{y}$  at each stage, the problem of computing  $\mathbf{Q}^*\mathbf{y}$  by N full rank matrix multiplications followed by a matrix-vector computation, reduces to a vector-vector multiplication at each stage of the transform. The fact that the HT is inherently an iterative process computing  $\mathbf{Q}_1$  before  $\mathbf{Q}_2$  enables us to produce a highly pipelined and hardware efficient QRD processor. The complexity of computing  $\mathbf{Q}^*\mathbf{y}$  is in the order of  $N^3$  as compared to  $N^4$  for direct implementation, which results in a huge reduction in computational cost, especially as N, the number of antennas, increases. Once the product  $\mathbf{Q}_1^*\mathbf{y}$  is computed, the  $\mathbf{v}_1$  vector can be discarded or, in hardware implementation, the same registers can be reused to store and process the ensuing  $\mathbf{v}_i$  vectors, resulting in reduced storage area.

#### B. Complexity reduction due to Real Valued decomposition

Tree search based detectors prefer RVD due to the easy enumeration of possible child nodes [5]. Any matrix in  $\mathbb{C}^{N\times N}$  can be represented by an equivalent matrix in  $\mathbb{R}^{2N\times 2N}$ . One of the methods to do this is to represent each complex valued entry by an equivalent  $2\times 2$  real valued entry as shown in Fig. 1. It has to be noted that each of the  $2\times 2$  submatrices in  $\mathbb{R}^{2\times 2}$  are not only orthogonal but also that the columns of the transformed matrix are pairwise orthogonal, as highlighted in Fig. 1. Applying the HT on the real valued matrix results in reducing the first column into a real valued entry  $\alpha_1$ , which is the length of the first column, along with modifying all the other columns as indicated in Fig. 1. Due to the property



Fig. 1: Householder Real valued decomposition

of the transform, the first element in the second column is also reduced to zero. It should be noted that, since the HT is equivalent to multiplication by an orthonormal matrix, the orthogonal properties of the columns and the  $2\times 2$  submatrices remain unchanged. The second iteration of the HT only modifies the smaller  $3\times 3$  submatrix in the example shown above and changes the first element in the second column into a real entry representing the length of the second column. By construction, the second column is also the same length as the first column of the original matrix. Hence the second iteration also produces the real element  $\alpha_1$  which does not need to be computed again. Utilizing these properties, only half the number of columns in the real valued representation of the matrix need to be transformed.

#### C. Algorithm analysis

The algorithm to perform the QRD of a real matrix  $\mathbf{H}_R$  in  $\mathbb{R}^{2N \times 2N}$  using the HT is shown in Table I along with the number of real domain operations required. The first two columns of  $\mathbf{H}_R$  are essentially the same data, repeated in a systematic way and hence the Householder vectors  $\mathbf{v_1}$  and  $\mathbf{v_2}$  corresponding to the first two columns can be computed in parallel. These parallel computations enable two iterations of the HT can be performed in one run, thereby enabling 2N columns to be processed in N runs.

- 1) Operation count analysis: Using these modifications to the algorithm, the number of multiplications required for one QRD using the modified HT is in the order of  $\frac{8N^3}{3}$  whereas the GS method requires more than  $4N^3$  operations while the direct HT implementation requires  $N^4$  operations [5]. The total number of operations including square roots, divisions and additions required to implement the transform compared to the GS method and the direct HT for different matrix sizes is shown in Fig. 2. It can be seen that the computational effort required to perform QRD using the proposed HT is not only lower than the direct HT method but also significantly lower than the corresponding GS method for matrices with large N.
- 2) Stability analysis in Correlated channels: Insufficient diversity in the channel or small antenna spacing creates correlated channels, resulting in a nearly rank deficient H. The ability of the QRD processor to orthonormalize a channel under such conditions determines the performance of the

TABLE I: Complexity Analysis of RVD

| Algorithm                                                                                                             | Add.                                                                   | Mul.                                 |
|-----------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|--------------------------------------|
| for $i = 2N : -2 : 0$                                                                                                 |                                                                        |                                      |
| $\mathbf{x1} = \mathbf{H_{i:1,i}}$                                                                                    |                                                                        |                                      |
| $\alpha = sign(x_{i,i}) \parallel \mathbf{x_1} \parallel_2$                                                           | i-1                                                                    | i                                    |
| $\mathbf{v_1} = \alpha \mathbf{e_i} + \mathbf{x1_{i:1,i}}$                                                            | 1                                                                      |                                      |
| $\parallel \mathbf{v_1^* v_1} \parallel^2 = \alpha^2 + \alpha x 1_{i,i}$                                              | 1                                                                      | 1                                    |
| $\beta_1 = \frac{\mathbf{v_1}}{\ \mathbf{v_1^*}\mathbf{v_1}\ ^2}$                                                     |                                                                        |                                      |
| $\Gamma_1 = \boldsymbol{\beta}_1(\mathbf{v_1}^*\mathbf{H_{i:1,i+2:1}})$                                               | $(i-1)(\frac{i-2}{2})$                                                 | $(i + i) \left(\frac{i-2}{2}\right)$ |
| $\mathbf{H1} = \mathbf{H_{i:1,i+2:1}} - \boldsymbol{\Gamma_1}$                                                        | $i\left(\frac{i-2}{2}\right)$                                          | , ,                                  |
| $G = \frac{-x1(2)}{\alpha + x1(1)}$                                                                                   | , ,                                                                    | 1                                    |
| $\mathbf{x2} = \mathbf{H_{i:1,i+1}}$                                                                                  |                                                                        |                                      |
| $v2 = x2 - x1G + \alpha e_{i+1}$                                                                                      | i                                                                      | i-1                                  |
| $eta_2 = rac{\mathbf{v_2}}{\ \mathbf{v_2^*}\mathbf{v_2}\ ^2} = rac{\mathbf{v_2}}{\ \mathbf{v_1^*}\mathbf{v_1}\ ^2}$ |                                                                        |                                      |
| $\boldsymbol{\Gamma}_2 = \boldsymbol{\beta}_2 (\mathbf{v_2}^*\mathbf{H1_{i+1:1,i+2:1}})$                              | $(i-2)\left(\frac{i-2}{2}\right)$<br>$(i-1)\left(\frac{i-2}{2}\right)$ | $2(i-1)(\frac{i-2}{2})$              |
| $\mathbf{H} = \mathbf{H}1_{\mathbf{i+1}:1,\mathbf{i+2}:1} - \mathbf{\Gamma}_2$                                        | $(i-1)(\frac{i-2}{2})$                                                 | , ,                                  |
| $\mathbf{y} - 2\boldsymbol{eta}_1 \mathbf{v_1^*y}$                                                                    | 2i-1                                                                   | 2i                                   |
| $\mathbf{y} - 2\boldsymbol{eta}_2 \mathbf{v}_2^* \mathbf{y}$                                                          | 2i - 3                                                                 | 2(i-1)                               |
| end for                                                                                                               |                                                                        |                                      |
| Total for each iteration                                                                                              | $2i^2 + 1$                                                             | $2i^2 + i + 1$                       |
| Corrected Total for each iteration                                                                                    | $8i^2 + 1$                                                             | $8i^2 + 2i + 1$                      |



Fig. 2: Operation count for HT and GS

whole MIMO system. Fixed point simulations with different channel models show that 13 bits of normalized channel data is sufficient for the QRD processor to obtain near floating point performance in uncorrelated channel conditions. The Mean square error (MSE) in producing unitary  ${\bf Q}$  using the HT and GS algorithms implemented using 13 bits for  $4\times 4$  complex valued  ${\bf H}$  with different condition numbers is shown in Table II. The results show that HT is significantly better at producing orthonormalized  ${\bf Q}$ , especially as the condition number of the matrix increases. Effects of channel correlation on the BER using both floating point and fixed point QRD processors along with the setup used for the experiment is shown in Fig. 3. Due to degradation in BER performance in correlated channels, 4QAM is used as modulation alphabet along with coding to get an acceptable performance. It can

TABLE II: Gain in inversion accuracy

| Condition number (H) | 10     | 200    | 400    | 600    | 800    |
|----------------------|--------|--------|--------|--------|--------|
| MSE of HT            | 0.0039 | 0.0040 | 0.0040 | 0.0041 | 0.0041 |
| MSE of GS            | 0.0078 | 0.5482 | 1.4291 | 2.5380 | 3.7312 |



Fig. 3: BER curves for 4QAM in correlated channel

be seen that the performance of the HT is within 1 dB of the full floating point QRD, whereas the GS method fails to achieve acceptable BER even with high signal to noise ratio.

#### IV. HARDWARE IMPLEMENTATION

In this section, the basic architecture of the HT is presented. Later the methodology used to obtain different implementations of the QRD processor using the high level synthesis tool is discussed. Finally the hardware synthesis and power results are presented and a comparison is done with previously published QRD processors.

#### A. Architecture

Fig. 4 shows a high level architecture of the HT. The transform contains multiple arithmetic units such as multipliers, adders, square root, and division units represented by AU in the architecture. Since the algorithm is sequential, the systolic array architecture is well suited for hardware implementation. Furthermore, the operations performed in each stage are essentially the same as explained in Table I and techniques such as folding and pipelining can be used to reduce area and increase throughput. Folding enables reuse of arithmetic units, reducing area, but increases power consumption as the circuit needs to run at a higher frequency to meet fixed throughput requirements. The number of arithmetic units required are reduced at later stages of the QRD as the HT operates on lower number of elements in each successive column. Choosing an optimal number of multipliers and other combinational units is not an easy task and a flexible solution which enables Power-Area trade-offs for different technologies and throughput requirements is needed.

#### B. Methodology

Coding of the algorithm is done in C++ and fixed point libraries are used to translate the code into RTL using Catapult.



Fig. 4: High level architecture

Constraints such as synthesis technology, clock frequency, area and latency requirements are provided to the tool. The tool finds a feasible schedule to implement the algorithm and optimizes the design to find the best combination of hardware resources to meet the constraints. Constraints on clock frequency, area and pipelining can be used to explore the design space to find an optimal solution to meet the design goal. The design is then synthesized using Design Compiler and power estimates are obtained using Primetime.

#### C. Experimental setup

Two designs of a 4 × 4 MIMO system were considered for implementing the ORD using the HT. The first design is synthesized to produce a throughput of 15 MQRD/s, which would correspond to an LTE-A system running at 20 MHz bandwidth without CA and another design to produce 72 MQRD/s, which corresponds to an LTE-A system running with a five band CA. The designs are taken through the flow described in the previous section and the resulting normalized values of Power, Area, and their product PA for different folding factors is shown in Fig. 5. The absolute values for these parameters can be obtained by using Table III. The design with a throughput requirement of 15 MQRD/s is synthesized for different frequencies ranging from 15 MHz to 135 MHz. Area reduces as folding increases since multipliers and other combinational units are reused, but the power consumption increases due to higher operating frequency. Similar trends are seen for the design producing 72 MORD/s.



Fig. 5: PA analysis for different designs

#### D. Results

Two designs capable of producing a throughput of 72 MQRD/s, highlighted in Fig. 5 are presented in Table III along with previously published designs. One of the designs is synthesized from the fully unfolded RTL and the other one is obtained from a version with a folding factor of three. The

TABLE III: Comparison with previous works

| Items                 | Patel [6]   | Huang [5] | Miyaoka [7] | This Work         |                 |
|-----------------------|-------------|-----------|-------------|-------------------|-----------------|
| Matrix type           | 4x4 complex | 8x8 Real  | 4x4 complex | 8x8 Real Unfolded | 8x8 Real Folded |
| Technology            | 130nm       | 180nm     | 90 nm       | 65nm              | 65 nm           |
| Max Freq              | 270MHz      | 100MHz    | 300MHz      | 72MHz             | 225MHz          |
| Gate count            | 36k         | 152  k    | 334  k      | 378  k            | 264 k           |
| Throughput [MQRD/s]   | 6.75        | 25        | 50          | 72                | 72              |
| Normalized Throughput | 13.5        | 69        | 69          | 72                | 72              |
| N.H.E.                | 375         | 454       | 150         | 190               | 272             |
| Normalized Power      | -           | 60mW      | -           | 127mW             | 252mW           |
| Energy/MQRD           | -           | 2.4mJ     | -           | 1.7mJ             | 3.36 mJ         |

power numbers are obtained by post synthesis simulations. The results are normalized to 65nm technology and a power supply of 1 Volt. The Normalized Hardware Efficiency (NHE) is defined as the ratio of normalized throughput over the gate count [5]. The energy consumption is calculated as the ratio of normalized power over the throughput. The current work has the highest reported throughput while consuming lower energy than the design presented in [5] in the fully unfolded configuration.

#### V. CONCLUSION

In this paper, modifications to the standard Householder Transform (HT) are proposed which enable QRD to be performed with lower computational complexity than Gram-Schmidt (GS) method. The proposed design is able to meet the requirements of a full CA LTE-A system producing a throughput of 72 MQRD/s. Simulation results have also shown that using the HT instead of the GS method results in performance gain of over 2 dB at Signal to Noise Ratio (SNR) levels of around 25 dB in correlated channels. RTL implementation results shows that the high level synthesis tool is very effective in evaluating designs for Area-Power trade-offs. The implemented design has the highest reported throughput, while consuming comparable energy and area.

#### ACKNOWLEDGMENT

This work is a part of the DARE project and the authors would like to thank Lund University and the funding organization, Stiftelsen för Strategisk Forskning.

#### REFERENCES

- M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "K-best MIMO detection VLSI architectures achieving up to 424 Mbps," in *Proc. IEEE Int. Symp. Circuits Syst.*, ISCAS 2006.
- [2] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.).
   Baltimore, MD, USA: Johns Hopkins University Press, 1996.
   [3] Y. T. Hwang and W. D. Chen, "Design and implementation of a high-
- [3] Y. T. Hwang and W. D. Chen, "Design and implementation of a high-throughput fully parallel complex-valued QR factorisation chips," *IET Circuits, Devices Syst.*, vol. 5, no. 5, pp. 424–432, 2011.
- [4] K.-L. Chung and W.-M. Yan, "The complex Householder transform," IEEE Trans. Signal Process., vol. 45, no. 9, sep 1997.
- [5] Z.-Y. Huang and P.-Y. Tsai, "Efficient Implementation of QR Decomposition for Gigabit MIMO-OFDM Systems," *IEEE Trans. Circuits Syst. I*, vol. 58, no. 10, pp. 2531–2542, oct. 2011.
- [6] D. Patel, M. Shabany, and P. Gulak, "A low-complexity high-speed QR decomposition implementation for MIMO receivers," in *Proc. IEEE Int. Symp. Circuits Syst.*, ISCAS 2009.
- [7] Y. Miyaoka, Y. Nagao, M. Kurosaki, and H. Ochi, "Sorted QR decomposition for high-speed MMSE MIMO detection based wireless communication systems," in *Proc. IEEE Int. Symp. Circuits Syst., ISCAS* 2012.

# Paper IV

## Paper IV

Low Complexity Adaptive Channel Estimation and QR Decomposition for an LTE-A Downlink

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund university's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications\_standards/publications/rights\_link.html to learn how to obtain a License from RightsLink.

R. Gangarajaiah, P. Nilsson, O. Edfors, and L. Liu, "Low Complexity Adaptive Channel Estimation and QR Decomposition for an LTE-A Downlink," © 2014 IEEE, reprinted from the *Proceedings of IEEE 25th Annual International Symposium on Personal, Indoor and Mobile Radio Communications*, Washington DC, USA, September 2014, pp. 459–463.

# Low Complexity Adaptive Channel Estimation and QR Decomposition for an LTE-A Downlink

Rakesh Gangarajaiah, Peter Nilsson, Ove Edfors and Liang Liu
Department of Electrical and Information Technology, Lund University, Sweden
Email: {rakesh.gangarajaiah,peter.nilsson,ove.edfors,liang.liu}@eit.lth.se

Abstract—This paper presents a link adaptive processor to perform low-complexity channel estimation and QR decomposition (QRD) in Long Term Evolution-Advanced (LTE-A) receivers. The processor utilizes frequency domain correlation of the propagation channel to adaptively avoid unnecessary computations in the received signal processing, achieving significant complexity reduction with negligible performance loss. More specifically, a windowed Discrete Fourier transform (DFT) algorithm is used to detect channel conditions and to compute a minimum number of sparse subcarrier channel estimates required for low complexity linear QRD interpolation. Furthermore, the sparsity of subcarrier channel estimates can be adaptively changed to handle different channel conditions. Simulation results demonstrate a reduction of 40%-80% in computational complexity for different channel models specified in the LTE-A standard.

Keywords—Wireless communication, MIMO, Channel estimation, OFDM, Adaptive signal processing.

#### I. INTRODUCTION

The requirement for high speed wireless communication, limited frequency bands and fluctuating channel conditions has made Multiple-Input Multiple-Output (MIMO) a widely adopted technique in many radio standards, including the 3GPP Long Term Evolution-Advanced (LTE-A). To fully utilize the capabilities of LTE-A systems, sophisticated signal processing operations are required. Among others, accurate channel estimation and the following channel matrix QR decomposition (QRD) are indispensable for advanced MIMO signal detectors, e.g., the K-Best detector [1], to recover transmitted information from noisy received signals. Different algorithms for channel estimation are presented in [2] [3] and the most frequently used QRD algorithms are detailed in [4]. These essential signal processing operations of a MIMO system come with the price of high computational complexity, preventing accurate implementation of the corresponding algorithms in power and area limited handheld devices.

MIMO is usually combined with Orthogonal Frequency Division Multiplexing (OFDM) to provide high spectral efficiency. As a result, channel estimation and QRD have to be performed on a tone-by-tone basis, making the complexity a more critical issue. To alleviate this complexity problem, authors in [5] present an interpolation technique, where channel estimation and QRD are performed only on pilot tones and the QRD for tones in between pilots is obtained by interpolation. However, the algorithm is implemented for a fixed interpolation distance and also requires a translation into another domain for performing QRD interpolation. Moreover, such a static interpolation strategy may degrade performance in highly

frequency selective channels while resulting in unnecessary computations in channels with low frequency selectivity.

Mobile devices operate in fluctuating channel scenarios depending on their surroundings and the LTE-A standard classifies wireless channels into three main categories based on the frequency selectivity, namely the EPA, EVA and ETU. The EPA channel has a very low frequency selectivity and requires fewer subcarrier channel estimates than the highly frequency selective ETU channel to reach a target system bit error rate (BER). Hence, a fixed solution such as the tone-by-tone approach or the one presented in [5] is not efficient in terms of the number of computations performed. To alleviate the aforementioned problem, we propose an adaptive solution which takes into account the dynamic nature and frequency selectivity of the wireless channel. In detail, the proposed solution utilizes a windowed Discrete Fourier transform (DFT) based channel estimator to produce only a required number of subcarrier channel estimates enabling very low complexity linear QRD interpolation. The windowed DFT method also provides a simple way of detecting operating channel conditions, enabling the adaptive processor to optimize the interpolation distance. measured in subcarriers, to reach a desired BER with the lowest computational efforts. To verify the proposed scheme, we simulated a simplified LTE-A downlink system with a  $4 \times 4$  MIMO setup. Simulations performed with the EVA and ETU channels show that the proposed method offers significant complexity saving over the traditional tone by tone method, with minor performance loss.

#### II. BACKGROUND

A MIMO system with M transmitter (Tx) and receiver (Rx) antennas can be modelled as

$$y = Hx + n , (1)$$

where  $\mathbf{y} = [y_1, y_2, ..., y_M]^T$  is the received data vector,  $\mathbf{H} \in \mathbb{C}^{M \times M}$  is the channel gain matrix between the antennas,  $\mathbf{x} = [x_1, x_2, ..., x_M]^T$  is transmit data vector and  $\mathbf{n}$  is the additive white Gaussian noise (AWGN). To achieve a low BER, the MIMO symbol detector has to minimize the error  $\|\mathbf{y} - \tilde{\mathbf{H}}\tilde{\mathbf{x}}\|_2$ , where  $\tilde{\mathbf{x}}$  is the estimate of the transmit vector and  $\tilde{\mathbf{H}}$  is the estimated channel matrix obtained from a set of predefined pilot tones. Fig. 1 shows the structure of these predefined pilots in a  $4 \times 4$  LTE-A system and the redundancy in pilots enables good performance even under high frequency selectivity.

Estimation of the channel gain matrix  $\tilde{\mathbf{H}}$  is the first major step performed in order to recover data from (1) and



Fig. 1: Pilot structure of a  $4 \times 4$  LTE-A system

accurate estimation enables data detection with a low BER. Furthermore, since there are  $M^2$  paths between the Tx and Rx in  $M \times M$  MIMO systems, estimators with lower complexity are preferred. Once the estimate  $\tilde{\mathbf{H}}$  is obtained, advanced decoders such as the K-best decoder solve (1) by decomposing the channel gain matrix  $\tilde{\mathbf{H}}$  into a product of a unitary  $\mathbf{Q}$  and an upper triangular matrix  $\mathbf{R}$ . The unitary matrix is used to rotate the received vector and the resulting linear system is solved by using tree search techniques as

$$y = \mathbf{Q}\mathbf{R}\mathbf{x} + \mathbf{n}$$

$$\mathbf{Q}^*\mathbf{y} = \mathbf{R}\mathbf{x} + \mathbf{Q}^*\mathbf{n}$$

$$\hat{\mathbf{y}} = \mathbf{R}\mathbf{x} + \hat{\mathbf{n}}$$
(2)

where  $\mathbf{Q}^*$  is the Hermitian transpose of the unitary  $\mathbf{Q}$  matrix. The complexity of popular QRD algorithms, measured in number of multiplications, is  $\mathcal{O}(M^3)$  [4]. Hence, in LTE-A receivers, algorithms producing accurate channel estimation and QRD with low complexity are needed to utilize the full potential of MIMO systems.

The traditional approach for solving the signal detection problem has been by considering channel estimation and ORD as separate entities as shown in Fig. 2(a). Channel estimation is performed as an independent operation followed by QRD for all the data tones. Several methods of channel estimation such as the Least Squares (LS), Robust Minimum Mean Square Estimator (RMMSE), DFT and Matching Pursuit (MP) [2] [3] haven been proposed. DFT based estimators are a class of low complexity estimators which utilize the DFT and inverse DFT operations to perform noise filtering, but suffer from a high noise floor due to spectral leakage [6]. Authors in [7] suggest a method to utilize windows to weigh the data to minimize the spectral leakage enabling better performance. The advantage of DFT based estimators is their efficient hardware implementation through Fast Fourier Transform (FFT) algorithms. The complexity of this implementation measured in terms of total multiplications is  $\mathcal{O}\left(\hat{M}^2\left(N_plog_2\left(N_p\right)+N_dlog_2\left(N_d\right)\right)\right)$ where  $N_p$  is the number of pilot tones and  $N_d$  is the number of channel estimates produced for an  $M \times M$  MIMO system.

QRD which is performed on the channel estimates can be implemented by several techniques such as the Given's rotation (GR) [8], the Gram-Schmidt (GS) [9] or the Householder transform , but all have a complexity  $\mathcal{O}(M^3N_d)$ . Even though both channel estimation and QRD are computationally intensive, it has to be noted that in a typical LTE-A system using DFT based channel estimators needs more computations for channel estimation than the QRD.



Fig. 2: Data detection flow: Traditional Vs Proposed

To reduce the high complexity of the above tone by tone approach, interpolation techniques have been used as shown in Fig. 2(a). The channel estimates at the pilot positions are estimated, followed by QRD interpolation for the data tones in between these pilot positions. A theoretical background when using the GS method to obtain lossless QRD interpolation is provided in [10] and a hardware implementation for an LTE-A frame structure utilizing only pilot positions to perform QRD interpolation is presented in [5]. These techniques rely on mapping the Q and R matrices into a sub space where polynomial interpolation can be applied and the de-mapping the interpolated matrices. Even though these methods are more efficient, they do not utilize channel properties such as frequency selectivity to further optimize the number of subcarrier channel estimates and QRDs performed. The fixed architecture of interpolating over pilot positions offers little flexibility and results in many redundant computations in channels with low frequency selectivity and a performance loss in highly frequency selective channels.

#### III. LINK ADAPTIVE QR DECOMPOSITION

Mobile devices experience high frequency selectivity in rich multipath environments requiring channel estimation at data tones which are closely spaced whereas in low frequency selectivity environments subcarrier channel estimates can be produced at tones which are farther apart. Moreover, when operating in higher received Signal to Noise Ratios (SNRs), a more accurate channel estimation and QRD is required whereas in lower SNRs the gain obtained by performing accurate channel estimation and QRD is lost due to the high noise levels. Hence, by examining the current channel conditions, a reduction in the total number of computations can be obtained by tuning the channel estimator and QRD processor to execute only the minimum number of required computations to reach a desired level of system performance.

Fig. 2(b) depicts such an adaptive approach where the operating channel conditions such as the frequency selectivity are obtained by examining the LS estimates of the pilot tones with the Signal to Noise Ratio (SNR) estimates obtained from external processing units [11]. Based on these channel parameters, an adaptive channel estimator is utilized to produce estimates at tones much farther apart than the pilots when operating in an EPA channel or at tones much closer than the pilot positions when operating in the ETU channel. This estimated channel data is used by a low complexity QRD



Fig. 3: Average power in taps of a 5 MHz LTE-A downlink

interpolation method to approximate the Q and R matrices of the intermediate data tones. The proposed link adaptive processor incorporates these ideas and the following sections introduce the different components of this processor along with the methodology used to select the optimal distances for the low complexity QRD interpolation. A DFT based channel estimator is used in the link adaptive processor due to its reconfigurability and efficient hardware implementation and a method of mitigating the spectral leakage is discussed next.

#### A. Windowed DFT channel estimator

DFT based estimators perform an inverse DFT operation on the pilot tones, weigh the taps according to a predefined strategy aimed at filtering noise and perform a DFT to produce the channel estimates at the data positions. Such an estimator producing  $N_d$  estimates with  $N_p$  pilots can be represented as

$$\mathbf{h_d} = \mathbf{F_d} \mathbf{W} \mathbf{F_p^H} \mathbf{h_p}, \tag{3}$$

where  $\mathbf{h_d}$  is the vector of channel estimates at data tones,  $\mathbf{F_d}$  is a  $N_d \times N_d$  DFT matrix,  $\mathbf{W}$  is the  $N_d \times N_p$  filtering matrix,  $\mathbf{F_p}$  is the  $N_p \times N_p$  DFT matrix and  $\mathbf{h_p}$  is the vector of channel estimates at pilot tones. These estimators perform well at lower SNRs and are hardware efficient when implemented using the FFT algorithm but suffer from a high noise floor due to spectral leakage at higher SNRs [6].

The basic requirement of the proposed link adaptive processor is the ability to recognize the operating channel conditions, which can be achieved by either analyzing the cyclic prefix or by an inverse DFT operation on the pilot tones. Fig. 3 shows the average power in the different taps of LTE-A channels obtained by using a 64 point inverse FFT operation on the pilot tones in a 5 MHz bandwidth downlink. We notice that by analyzing the distribution of power in the first few taps of the inverse FFT output, the current channel conditions can be detected. Furthermore, the inverse DFT operation is part of the DFT based estimator represented by (3) and enables the detection of channel conditions at no additional cost.

Fig. 3 also shows the spectral leakage due to the non sample spaced channels which causes degraded performance at higher SNRs. Utilizing windows to reduce spectral leakage is a well-known method and [7] describes a method to improve the performance of DFT based estimators at higher SNRs by using Hann windows. An inverse window operation is required to remove the effects of windowing and [7] uses a division by the Hann window to achieve this. The overlap-add method



Fig. 4: Overlap add method to reduce spectral leakage

is another way of removing windowing effects and the link adaptive processor uses this method as depicted in Fig. 4. The data from the first and last pilots in the 5 MHz spectrum is extended to produce 128 points and the Hann windowing reduces spectral leakage. Finally, the windowing effects on the channel response is removed by using the overlap-add method to produce interpolated channel estimates at the data tones.

#### B. QR interpolation

The windowed DFT based channel estimator enables the detection of channel conditions such as frequency selectivity, which can be used to estimate the distance, measured in subcarriers, for QRD interpolation. The effects of interpolation errors on the system BER can be minimized by adaptively changing the interpolation distance depending on channel conditions, for example, by choosing subcarriers which are close when operating in highly frequency selective channels.

All unitary  $\mathbf{Q}$  matrices of size  $M \times M$  are part of the unitary group  $\mathbf{U}(M)$  and any unitary matrix  $\mathbf{Q}_1$  can be transformed into another unitary matrix  $\mathbf{Q}_2$  by using a rotation matrix of the form  $\mathbf{Q}_2\mathbf{Q}_1^*$ . Authors in [12] present a method of obtaining the intermediate  $\mathbf{Q}$  matrices between  $\mathbf{Q}_1$  and  $\mathbf{Q}_2$  using

$$\mathbf{Q}(s) = (\mathbf{Q}_2 \mathbf{Q}_1^*)^s \mathbf{Q}_1 \quad s \in \mathbb{R} \mid 0 \le s \le 1.$$
 (4)

If  $\parallel \mathbf{Q}_1 - \mathbf{Q}_2 \parallel_F = \epsilon$ , where  $\epsilon$  is a small constant, a linear interpolation of the form

$$\mathbf{Q}(s) = (1 - s) \times \mathbf{Q}_1 + s \times \mathbf{Q}_2,\tag{5}$$

can be used to approximate (4). The corresponding  $\mathbf{R}(s)$  matrix can also be approximated by

$$\mathbf{R}(s) = (1-s) \times \mathbf{R}_1 + s \times \mathbf{R}_2. \tag{6}$$

These approximations lead to errors in QRD interpolation and a strategy to choose the correct interpolation distances, to minimize the effect of these errors on BER for different channel scenarios is presented in the next section.

#### C. Coherence Bandwidth and Interpolation error

The coherence band width  $(B_{coh})$  of a wireless channel is defined as the bandwidth over which the correlation of channel gains is higher than a specified limit [13]. QRD of channel gain matrices  $\mathbf{H}_1$  and  $\mathbf{H}_2$  of two correlated subcarriers results in  $\mathbf{Q}_1$  and  $\mathbf{Q}_2$  such that  $\parallel \mathbf{Q}_1 - \mathbf{Q}_2 \parallel_F = \epsilon$ , where  $\epsilon$  is a small constant. This enables  $B_{coh}$  to be used as a parameter



Fig. 5: Interpolation error, Subcarrier distances and Correlation

to evaluate the interpolation error due to approximations in (6). The LTE-A standard uses three main channel models and each channel exhibits a different  $B_{coh}$  enabling the link adaptive processor to choose between different bandwidths over which QRD interpolation can be performed. Fig. 5(b) shows the distances in sub carriers for different levels of correlation for the three models. In the EPA model, gain matrices  $\mathbf{H}_1$  and  $\mathbf{H}_2$  which are 24 subcarriers apart show a correlation of 75% whereas the EVA and ETU models show 75% correlation for channels around 15 subcarriers apart.

The uncoded BER of a wireless system operating in a frequency selective channel affected by AWGN is inversely proportional to the SNR available at the receiver. The total noise n in a wireless receiver employing interpolation techniques can be expressed as

$$n = n_{sys} + n_{interp} \tag{7}$$

where  $n_{sys}$  is the system noise and  $n_{interp}$  is the noise introduced due to linear interpolation. A parameter

$$\gamma = \frac{n_{interp}}{n_{sys}},\tag{8}$$

can be used to decide the amount of interpolation error that can be introduced depending on the receiver SNR. The effects of interpolation error on BER can be minimized by keeping  $\gamma$  small, which enables the link adaptive processor to increase interpolation distances at lower SNRs and to adaptively lower the distances at higher SNRs. Fig. 5(a) shows the dependence of the error  $n_{interp}$  obtained by interpolating  $\mathbf{Q}$  matrices using (5) between sub carriers spaced at different correlation levels for the three channel models used in LTE-A. The average error due to interpolation in an EPA channel model is in the order of  $10^{-3}$  at correlation levels of 75% whereas the error in the fast fading ETU model is almost double that of the EPA model.

Using Fig. 5(a) and Fig. 5(b) the following strategy illustrates how a link adaptive processor can be used to reduce total computational complexity. An SNR of 20 dB at the receiver in a  $4\times4$  MIMO system receiving 16 QAM data would correspond to a  $n_{sys}$  of  $10^{-2}$ . A link adaptive processor configured to operate with a value of  $\gamma=0.1$  can operate with  $n_{interp}=10^{-3}$ . If the current operating channel is detected to be EPA, the processor will choose a correlation value of 75% in Fig. 5(a) corresponding to an interpolation distance of 24



Fig. 6: BER of an uncoded 16QAM system

subcarriers in Fig. 5(b). Using a similar strategy, if operating in an ETU channel, correlation of 85% is chosen leading to interpolation distances of around 10 subcarriers. A lower value of SNR will result in higher  $n_{sys}$  enabling the link adaptive processor to choose subcarriers which have a lower correlation value leading to increased interpolation distances.

#### IV. RESULTS

#### A. Performance

The methodology described in the previous sections enables us to choose interpolation distances adaptively to reduce complexity while maintaining the required level of BER. Table I shows an example implementation with different interpolation distances chosen for the three LTE-A channel models depending on the SNR available at the receiver. The interpolation distances are chosen so that minimal loss to BER is introduced when compared to the performance with perfect channel state information (CSI).

TABLE I: Interpolation distances measured in subcarriers

| SNR (dB)<br>Model | ≤ 10 | 11 - 15 | 16-20 | 21 - 25 | 26-30 | > 30 |
|-------------------|------|---------|-------|---------|-------|------|
| EPA               | 48   | 32      | 24    | 16      | 8     | 8    |
| EVA               | 32   | 24      | 16    | 8       | 8     | 4    |
| ETU               | 32   | 24      | 16    | 8       | 4     | 2    |

Fig. 6 shows two sets of uncoded BER curves for a 4 × 4 MIMO system with a K-Best decoder with K=10 operating in the EVA channel. The first set of curves are obtained by using perfect CSI and full channel QRD along with the proposed adaptive QRD (AQRD) with distances chosen from Table I. These curves enable us to analyze the effects of  $n_{interp}$  on BER and it can be seen that the performance loss is negligible when choosing the proposed interpolation distances. The second set of curves are obtained by using different channel estimation techniques and QRD interpolation methods. The DFT based estimator [2] with the proposed adaptive QRD shows significant degradation, with an error floor visible at higher SNRs due to spectral leakage. Use of the proposed windowed DFT based estimator and adaptive QRD improves the performance at higher SNRs and is on par with performance of a receiver employing the RMMSE estimator [14] with the QRD interpolator from [5].

#### B. Complexity analysis

A DFT based channel estimator used in the traditional flow of Fig.2(a) for an  $M \times M$  MIMO system would require  $(N_plog_2(N_p) + N_dlog_2(N_d))M^2$  multiplications. Using the proposed windowed DFT instead and choosing the number of channel estimations X, changes the multiplication count of channel estimation to  $(3N_plog_2(N_p) + Xlog_2(X))M^2$ . Furthermore, the complexity of QRDs computed also reduces from  $N_dM^3$  to  $XM^3$ .

Fig. 7 shows the savings obtained by using the link adaptive processor over the traditional tone by tone approach, with interpolation distances chosen from Table I. The link adaptive processor is designed to closely follow the BER obtained when operating with perfect CSI. Higher savings are obtained at lower SNRs as farther interpolation distances can be chosen, whereas this gain reduces at higher SNRs. Furthermore, EPA channels need significantly lower number of computations resulting in higher savings when compared to ETU channels. The shaded region in Fig. 7 indicates the range of possible reductions when operating in different channel conditions.

#### C. Hardware requirements

The main advantage of DFT based channel estimators is the efficient hardware implementation. A typical  $N_d$  point radix-2 FFT is implemented using a pipelined structure [15]. The order of the output samples from the pipelined decimation in frequency (DIF) FFT algorithm is bit reversed with the first  $X=2^t$  where  $t\in\mathbb{Z}$  output bins spaced at distances of  $\frac{N_d}{2X}$  and the next X outputs resulting in all bins at a resolution of  $\frac{N_d}{2X}$ , enabling selective channel estimation at only the desired frequency bins. For example, when operating in the EPA channel with an SNR of 15 dB, from Table I, channel estimates which are spaced 32 subcarriers apart are needed. The first 16 outputs from a pipelined 512 point radix-2 DIF FFT are at the bins  $[0,32,64\ldots]$  corresponding to the desired channel estimates. This enables circuit level optimizations such as clock gating which can be activated once the desired channel estimates are calculated leading to power savings.

QRD interpolation can be implemented using the methods described in [8] [9] and the proposed linear QRD interpolation can be performed using only fixed multiplications and additions leading to a negligible increase in complexity.

#### V. CONCLUSION

The proposed link adaptive processor is capable of decreasing the complexity of both channel estimation and QRD, which are two important baseband signal processing operations. The processor utilizes a windowed DFT based channel estimator which not only suppresses the error floor due to spectral leakage but is also capable of producing channel estimates at specified interpolation distances. Channel properties are used to identify these interpolation distances which add a minimal loss to BER while enabling the use of a simple QRD interpolation technique. The BER performance of the proposed link adaptive processor is on par with existing systems while providing a reduction in complexity of upto 40% in higher SNRs and 80% in lower SNRs. Hence, the link adaptive processor is an attractive solution for adaptive processing in varying channel conditions for mobile LTE-A receivers.



Fig. 7: Range of possible reduction in multiplications

#### ACKNOWLEDGMENT

This work is a part of the DARE project and the authors would like to thank Lund University and the Swedish Foundation for Strategic Research.

#### REFERENCES

- M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "K-best MIMO detection VLSI architectures achieving up to 424 Mbps," in Proc. IEEE Int. Symp. on Circuits Syst., May 2006, pp. 1151–1154.
- [2] J.-J. van de Beek, O. Edfors, M. Sandell, S. Wilson, and P. Ola Borjesson, "On channel estimation in OFDM systems," in *IEEE 45th Veh. Technology Conf.*, Jul 1995, pp. 815–819.
- [3] J. Löfgren, L. Liu, O. Edfors, and P. Nilsson, "Improved Matching-Pursuit Implementation for LTE Channel Estimation," *IEEE Trans. Circuits Syst. I*, vol. 61, pp. 226–237, Jan 2014.
- [4] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
- [5] P.-L. Chiu, L.-Z. Huang, L.-W. Chai, and Y.-H. Huang, "Interpolation-Based QR Decomposition and Channel Estimation Processor for MIMO-OFDM System," *IEEE Trans. Circuits Syst. I*, vol. 58, pp. 1129 1141, May 2011.
- [6] O. Edfors, J.-J. van de Beek, M. Sandell, S. Wilson, and P. Ola Borjesson, "Analysis of DFT-based channel estimators for OFDM," Lulea Univ. Technol., Tech. Rep. vol. 17, 1996.
- [7] B. Yang, Z. Cao, and K. Letaief, "Analysis of low-complexity windowed DFT-based MMSE channel estimator for OFDM systems," *IEEE Trans. Commun.*, vol. 49, pp. 1977–1987, Nov 2001.
- [8] Z.-Y. Huang and P.-Y. Tsai, "Efficient Implementation of QR Decomposition for Gigabit MIMO-OFDM Systems," *IEEE Trans. Circuits Syst.* 1, vol. 58, oct. 2011.
- [9] P. Luethi, C. Studer, S. Duetsch, E. Zgraggen, H. Kaeslin, N. Felber, and W. Fichtner, "Gram-Schmidt-based QR decomposition for MIMO detection: VLSI implementation and comparison," in *IEEE Asia Pacific Conf. Circuits Syst.*, Nov 2008, pp. 830–833.
- [10] D. Cescato and H. Bolcskei, "QR Decomposition of Laurent Polynomial Matrices Sampled on the Unit Circle," *IEEE Trans. Inf. Theory*, vol. 56, pp. 4754–4761, Sept 2010.
- [11] L. Wilhelmsson, I. Diaz, T. Olsson, and V. Owall, "Analysis of a novel low complex SNR estimation technique for OFDM systems," in *IEEE Wireless Commun. and Networking Conf.*, March 2011, pp. 1646–1651.
- [12] N. Czink, B. Bandemer, C. Oestges, T. Zemen, and A. Paulraj, "Subspace Modeling of Multi-User MIMO Channels," in *IEEE Veh. Technology Conf.*, Sept 2011, pp. 1–5.
- [13] A. Molisch, Wireless Communications. Wiley, 2005.
- [14] F. Foroughi, J. Lofgren, and O. Edfors, "Channel estimation for a mobile terminal in a multi-standard environment (LTE and DVB-H)," in 3rd Int. Conf. Signal Process. and Commun. Syst., Sept 2009, pp. 1–9.
- [15] S. He and M. Torkelson, "Design and implementation of a 1024-point pipeline FFT processor," in *IEEE Proc. Custom Integrated Circuits Conf.*, May 1998, pp. 131–134.

Paper V

## Paper V

### An Adaptive QR Decomposition Processor for Carrier Aggregated LTE-A in 28 nm FD-SOI

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund university's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications\_standards/publications/rights\_link.html to learn how to obtain a License from RightsLink.

R. Gangarajaiah, O. Edfors, and L. Liu, "An Adaptive QR Decomposition Processor for Carrier Aggregated LTE-A in 28 nm FD-SOI," © 2017 IEEE, accepted for publication in the *IEEE Transactions on Circuits and Systems-I: Regular Papers*, 2017.

# An Adaptive QR Decomposition Processor for Carrier Aggregated LTE-A in 28nm FD-SOI

Rakesh Gangarajaiah, Student Member, IEEE, Ove Edfors, Member, IEEE, and Liang Liu, Member, IEEE

Abstract-This paper presents an adaptive QR decomposition (ORD) processor for five-band carrier aggregated LTE-A downlinks. The design uses time and frequency correlation properties of wireless channels to reduce QRD computations while maintaining an uncoded bit error rate loss below 1 dB. An analysis on the performance of a linear interpolating ORD is presented and optimum distances for different channel conditions are suggested. The Householder transform suited for spatially correlated scenarios is chosen and modified for concurrent vector rotations resulting in high throughput. Based on these, a parallel hardware architecture suitable for easy reconfigurability and low power is developed and fabricated in 28 nm FD-SOI technology. The QRD unit occupies 205 k gates of logic and has a maximum throughput of 22 M QRD/s while consuming 29 mW of power. On a circuit level, the back gate feature is leveraged to double operational frequency in low time-frequency correlation channels or to lower power consumption to 1.9 mW in favorable conditions. The proposed system provides designers with multiple levels of adaptive control from architectural to circuit level for powerperformance trade-offs and is well suited for mobile devices operating on limited battery energy.

Index Terms-LTE-A, QRD, Adaptive processing.

# I. INTRODUCTION

Spatial multiplexing by the multiple-input multiple-output (MIMO) technique is supported by the 3GPP Long Term Evolution-Advanced (LTE-A), either to increase effective signal to noise ratio (SNR) at the user equipment (UE) or to create parallel channels from the evolved NodeB (eNB). To further increase communication speeds, carrier aggregation (CA) has been introduced, where up to five bands of 20 MHz bandwidth (BW) are combined to increase downlink capacity, resulting in theoretical speeds of around 3 Gbps [1], [2]. While the standard provides the necessary framework to achieve these high data rates, complex signal processing is required in a mobile UE to fully utilize the potential of these wireless technologies.

One of the prerequisites for high quality MIMO communication is efficient signal detection, accomplished by algorithms ranging from the zero-forcing (ZF) linear detector to the more complicated non-linear sphere decoder and QRD-M detectors [3]. A common component to facilitate the detection process is a QR decomposition (QRD) processor which transforms the estimated channel matrix into a product of a unitary matrix and an upper triangular matrix. There are three main algorithms for QRD, namely the Gram-Schmidt (GS), the Given's rotation (GR) and the Householder transform (HT) [4]. The last is better suited for correlated matrices [5] which become more common in  $8\times 8$  MIMO configurations.

The QRD is computationally expensive, has significant effect on system performance, and requires a large through-

put proportional to the UE bandwidth. This raises critical design challenges when implementing a high throughput-high accuracy QRD processor that is also energy efficient. In wireless communication, the achievable data rates and system performance are affected by channel conditions. A fixed design aimed at the worst case, e.g., for wide BW and low SNR scenario, results in an inefficient solution. One approach to deal with the conflicting requirements on power and performance is adaptive channel-aware signal processing with the corresponding reconfigurable hardware. This will enable the UE to maximize energy efficiency by performing only the minimum computations required to reach a desired quality of service (QoS), while retaining the scalability to handle wider BW allocations in good channel conditions.

We present such an adaptive processor supporting the maximum resource blocks (RB) assignment with five-band CA, but optimize it for normal operating scenarios. We use linear interpolation to reduce computation count and control the magnitude of introduced errors by adaptively changing the interpolation distance based on channel conditions and SNR. A more complicated algorithm to perform lossless QRD interpolation was proposed in [6], [7]. Although these methods reduce computation count compared with the brute force method of performing a QRD for each tone of every symbol, they still result in a high number of computations, especially in CA scenarios due to their non-adaptive nature. Consequently, a method using the frequency domain (FD) correlation of LTE-A channels to further reduce complexity was presented in [8]. Expanding on this, we present an analysis into time domain (TD) correlations of LTE-A channels and propose an architecture for a linear interpolating QRD processor. On an algorithmic level, we choose the HT based QRD and modify the implementation to produce parallel vector rotations facilitating high throughput. On the architectural level, optimizations for hardware re-use are explored and an implementation favoring higher power reduction with state-of-the-art 28 nm Fully-Depleted silicon on insulator (FD-SOI) technology is chosen [9]. On a circuit level, forward body biasing (FBB) to adaptively increase throughput in high RB assignment scenarios together with clock gating in good channel conditions are used to conserve power. The full system consists of one QRD unit capable of decomposing  $4 \times 4$  complex matrices, two rotation banks, an interpolation unit, a functional control unit and test logic. The design requires 1.2 mm<sup>2</sup> of core area and the QRD unit occupies 205 k gates. Measurement results show that the QRD unit consumes 29 mW when producing 22 M QRD/s and can decode CA signals with a power varying from 1.9 mW in good channels to 24.3 mW in fast varying channels.

The rest of the paper is organized as follows. In Section II, the system model with LTE-A channels and RB structure are introduced. Section III presents the adaptive QRD scheme with its performance evaluation. Section IV details the VLSI architectures and hardware aspects. Measurement results are discussed in Section V followed by conclusions in Section VI.

#### II. BACKGROUND

In this section we introduce a simplified MIMO system model and present the need for QRD followed by a brief discussion on the LTE-A resource block structure and channel models.

#### A. System Model

A typical MIMO system with 4 antennas at both the eNB and the mobile UE is shown in Fig. 1. Spatially multiplexed data transmitted from the eNB, propagates through a noisy multi-path channel and is finally received by the UE. The analog front end (AFE) digitizes the signal which is synchronized, equalized and demodulated for the application layer. Figure 1 also shows where the QRD unit operates in a typical LTE-A receiver chain, processing data from the channel estimation (CE) unit and feeding data to the MIMO decoder. The operation of such a system with M transmitter (Tx) and receiver (Rx) antennas can be modeled as

$$\mathbf{y}_i = \mathbf{H}_i \mathbf{x}_i + \mathbf{n}_i, \quad \forall i \in \{1, 2, \dots, BW_{sc}\}$$

where  $\mathbf{y}_i = [y_1, y_2, ..., y_M]^\mathsf{T}$  is the received data vector,  $\mathbf{H}_i \in \mathbb{C}^{M \times M}$  is the channel matrix,  $\mathbf{x}_i = [x_1, x_2, ..., x_M]^\mathsf{T}$  the transmit data vector, and  $\mathbf{n}_i$  the additive white Gaussian noise on sub-carrier i in an orthogonal frequency division multiplexing (OFDM) system with  $BW_{sc}$  sub-carriers.

A maximum-likelihood (ML) MIMO detector tries to minimize  $\|\mathbf{y}_i - \hat{\mathbf{H}}_i \hat{\mathbf{x}}_i\|_2$ , where  $\hat{\mathbf{x}}_i$  is the transmit vector estimate and  $\hat{\mathbf{H}}_i$  is the channel matrix estimated from a set of pilot tones. In practice, the K-Best detector and the sphere decoder are widely used to achieve ML performance with lower complexity. To simplify implementation of the K-Best detector,  $\hat{\mathbf{H}}_i$  is decomposed to a unitary matrix  $\mathbf{Q}_i$  and an upper triangular matrix  $\mathbf{R}_i$  by the process of QRD. The received data vector is then rotated by the conjugate transpose of  $\mathbf{Q}$ , the matrix  $\mathbf{Q}^*$ , thereby transforming (1) into

$$\begin{aligned} \mathbf{y} &= \mathbf{Q}\mathbf{R}\mathbf{x} + \mathbf{n} \\ \mathbf{Q}^*\mathbf{y} &= \mathbf{R}\mathbf{x} + \mathbf{Q}^*\mathbf{n} \\ \mathbf{y}_{rot} &= \mathbf{R}\mathbf{x} + \widehat{\mathbf{n}}, \end{aligned} \tag{2}$$

where the sub-carrier index i has been dropped for brevity of notation. While the QRD enables efficient hardware realization of sphere decoders, its accuracy affects the detection performance to a large extent. For instance, the orthonormality of  ${\bf Q}$  is crucial to preserve the i.i.d. properties of  $\widehat{\bf n}$ , which becomes particularly challenging for ill-conditioned channel matrices. The QRD of a rank p matrix has complexity in the order of  $\mathcal{O}(p^3)$ , and calculating it on a per-tone per-symbol basis results in a very high number of complex computations.



Figure 1: A typical LTE-A MIMO system with a QRD unit

# B. LTE-A Resource Block and Channel Models

A typical LTE-A RB structure for one Tx antenna port is shown in Fig. 2. Every RB has 7 OFDM symbols corresponding to a time period of 0.5 ms and 12 sub-carriers. Pilot tones are added at regular intervals in both time and frequency for CE and synchronization. The minimum downlink resource assignment is two RBs and multiple RBs can be allocated depending on service requirements. LTE-A channels are classified into the Extended Typical Urban (ETU), Extended Vehicular A (EVA), and Extended Pedestrian A (EPA) models [10]. The ETU channel has a large delay spread and thus high frequency selectivity, whereas the EPA channel with a short delay spread has low frequency selectivity. The EVA model is intended for channel scenarios with a medium delay spread, extending to around half the length of the cyclic prefix. The channels are also categorized based on the Doppler shift, namely the 5 Hz model for low mobility of 2.5 km/h, 70 Hz model corresponding to a UE mobility of around 36 km/h and a high speed scenario with a Doppler shift of 300 Hz at speeds of 150 km/h when operating at a carrier frequency of 2.1 GHz.

The pilot spacing in Fig. 2 is designed to handle worst case fluctuations in both time and frequency which can lead to a high degree of redundancy when operating in channels with low frequency selectivity and Doppler shift. The redundancy can be leveraged by the UE to adaptively reduce computation load in favorable channels, e.g., by performing the minimum required computations to reach a desired QoS. This adaptive channel aware methodology can be applied to different processing blocks for power saving, including CE, QRD, and MIMO detection.

In this paper, we apply this concept to the QRD processor design, where the frequency and time correlation properties of wireless channels are exploited to perform adaptive QRD that can significantly reduce power consumption with a small performance loss.

# III. PROPOSED ADAPTIVE QRD SCHEME

In this section, we propose a hardware friendly adaptive QRD scheme, based on linear interpolation, which can conveniently tune the trade-off between computational complexity and processing accuracy with a single parameter. Then we present a framework to obtain a suitable tuning parameter under different channel conditions, system setups, and QoS requirements. The proposed scheme is evaluated and compared to the brute force approach to investigate the computation-performance gains.



Figure 2: LTE-A Resource Block structure with data flow in the proposed method when operating in the interpolation by 4 mode

### A. QRD with Adaptive Linear Interpolation

Interpolation-based QRD [6] has been proposed to reduce the computational complexity and a hardware implementation of this concept was discussed in [11], where a fixed interpolation distance was adopted. These methods rely on mapping and demapping a set of intermediate matrices to different domains, which may introduce extra hardware cost and are non-adaptive in nature. To enable low complexity channel adaptation, this paper takes advantage of the fact that wireless systems are error tolerant (to some extent) and adopts a lossy, but much simpler linear interpolation strategy.

A unitary matrix  $\mathbf{Q}_1$  from the QRD of channel estimate  $\mathbf{H}_1$  in Fig. 2, can be transformed into another unitary matrix  $\mathbf{Q}_5$  corresponding to  $\mathbf{H}_5$  with a rotation matrix  $\mathbf{Q}_5\mathbf{Q}_1^*$ . Authors in [12] present a method of obtaining the intermediate  $\mathbf{Q}$  matrices between  $\mathbf{Q}_1$  and  $\mathbf{Q}_5$  using

$$\mathbf{Q}(\alpha) = (\mathbf{Q_5}\mathbf{Q_1^*})^{\alpha}\mathbf{Q_1} \quad \alpha \in \mathbb{R} \mid 0 \le \alpha \le 1. \quad (3)$$

If  $\parallel {\bf Q}_1 - {\bf Q}_5 \parallel_F = \epsilon$ , where  $\epsilon$  is a small constant, (3) can be approximated as

$$\mathbf{Q}(\alpha) = (1 - \alpha) \cdot \mathbf{Q}_1 + \alpha \cdot \mathbf{Q}_5. \tag{4}$$

The  $\mathbf{R}(\alpha)$  matrix can also be calculated as

$$\mathbf{R_{interp}}(\alpha) = (1 - \alpha) \cdot \mathbf{R}_1 + \alpha \cdot \mathbf{R}_5. \tag{5}$$

The errors introduced due to the linear approximation can be controlled by  $\alpha$ , related to the distance D (can be in both time and frequency domain) between the two channel estimates  $\mathbf{H_1}$  and  $\mathbf{H_5}$ , used to produce  $\mathbf{Q_1}$  and  $\mathbf{Q_5}$ . Instead of using (2), the received vector  $\mathbf{y}$ , at any given intermediate position, can be rotated using

$$\mathbf{y_{rot}} = (1 - \alpha) \cdot \mathbf{Q_1^* y} + \alpha \cdot \mathbf{Q_5^* y}. \tag{6}$$

Figure 2 depicts the data flow through such an interpolating scheme operating in the FD. Channel estimation and the subsequent QRDs are performed only on a selected subset of sub-carriers. The generated QRDs are linearly combined using (4) and (5) to produce the  $\mathbf{Q}$  and  $\mathbf{R}$  matrices at the intermediate positions. If the final result required is the rotated vector in (6), e.g., for K-Best detector, the computation in (4)



Figure 3: Flow chart describing the proposed framework

can be skipped. When operating with an interpolation distance D, each  $\mathbf{Q}$  matrix rotates at most 2D-1 received data vectors. Figure 2 shows the case with D=4 when rotating the data vector  $\mathbf{y}_2$ . Partial CE at sub-carriers 1 and 5 on symbol 13 is performed to produce  $\mathbf{H}_1$  and  $\mathbf{H}_5$ . The QRDs of these two estimates are used to rotate the data tones highlighted. In particular,  $\mathbf{y}_2$  is rotated both by  $\mathbf{Q}_1$  and  $\mathbf{Q}_5$ . Since the position of  $\mathbf{y}_2$  is closer to estimate  $\mathbf{H}_1$ , a linear weight of 0.75 is applied for  $\mathbf{Q}_1^*\mathbf{y}_2$  and 0.25 for  $\mathbf{Q}_5^*\mathbf{y}_2$  to produce the inal rotated vector shown to the right. Adaptability is achieved by changing the interpolation distance by tracking channel variations and SNR, leading to a reduced QRD throughput when interpolating by longer distances and vice versa.

# B. Householder Transform based QRD

The next step is to select an appropriate algorithm for performing the QRD. There are three main methods and hardware implementations based on the GS and the GR method are presented in [13]–[15], whereas [16] presents an approach that combines the Householder transform (HT) and GR algorithm. Although any of the three algorithms can be used in the proposed adaptive QRD framework, the HT is chosen in this work for the following reasons. Mobile UEs with small form factors operating in poor scattering channels result in



Figure 4: Interpolation error, Sub-carrier distances and Correlation

increased spatial correlation. QRD performed on such ill conditioned channel matrices often leads to instability, which adversely affects the system performance. The HT based QRD is better suited to decompose such correlated channels in fixed point implementations when compared with GS method [5]. Furthermore, HT operates on columns of the input matrix and thus enables parallel vector rotation as opposed to the GR method. Additionally, HT with a sphere decoder can result in lower complexity in higher MIMO configurations [17]. This work adopts the real valued decomposition (RVD) version of the HT as it simplifies hardware and provides at least the same gain as direct complex valued decomposition [18], [19]. A complex valued matrix  $\mathbf{H}$  can be represented by an equivalent but larger real valued matrix  $\mathbf{H}_{real}$  by replacing each complex element in  $\mathbf{H}$  by a  $2\times 2$  real matrix [17].

The HT operates columnwise on the channel matrix  $\mathbf{H}$  and produces a series of reflection matrices. Left-multiplying the real valued equivalent representation,  $\mathbf{H}_{real}$ , by these matrices results in a real valued upper triangular matrix  $\mathbf{R}_{real}$  [17]. This orthogonalization process is described by

$$\mathbf{V}_{N}\mathbf{V}_{N-2}...\mathbf{V}_{2}\mathbf{V}_{1}\mathbf{H}_{real} = \mathbf{R}_{real},\tag{7}$$

where each of the  $V_i$  matrices are of the form

$$\mathbf{V}_{i} = \left(\mathbf{I} - 2\frac{\mathbf{v}_{i}\mathbf{v}_{i}^{\mathsf{T}}}{\mathbf{v}_{i}^{\mathsf{T}}\mathbf{v}_{i}}\right),\tag{8}$$

and  $\mathbf{v}$  is the difference vector from the column of  $\mathbf{H}_{real}$  that is being orthogonalized to one of the columns of the identity matrix  $\mathbf{I}$ . The transpose operation is denoted by  $(.)^{\mathsf{T}}$ . Since the final goal is to solve (2), the rotated real valued equivalent data vector  $\mathbf{Q}^{\mathsf{T}}\mathbf{y}$  can be constructed, without explicitly calculating the full  $\mathbf{Q}$  matrix as

$$\mathbf{Q}^{\mathsf{T}} \mathbf{y} = \mathbf{V}_{N} \cdots \mathbf{V}_{1} \mathbf{y}$$

$$= \left( \mathbf{I} - 2 \frac{\mathbf{v}_{N} \mathbf{v}_{N}^{\mathsf{T}}}{\mathbf{v}_{N}^{\mathsf{T}} \mathbf{v}_{N}} \right) \cdots \left( \mathbf{I} - 2 \frac{\mathbf{v}_{1} \mathbf{v}_{1}^{\mathsf{T}}}{\mathbf{v}_{1}^{\mathsf{T}} \mathbf{v}_{1}} \right) \mathbf{y}. \tag{9}$$

Examining (9), we notice that the difference vector  $\mathbf{v}_1$ , corresponding to the first column of  $\mathbf{H}_{real}$  performs rotations on data vector  $\mathbf{y}$ , before the difference vectors from other columns. Thus, an implementation where the rotations on received vector  $\mathbf{y}$  are started before the full QRD is finished can be realized, leading to a low-latency pipelined circuit.



Figure 5: Interpolation error, Symbol distances and Correlation

## C. Interpolation Distances and Performance Evaluation

By adjusting the interpolation distance (D), the proposed linear interpolating QRD provides adaptive trade-off between computational complexity and decomposition accuracy. Figure 3 describes a methodology to select D, where the available SNR and channel conditions are jointly considered to find a maximum allowed ORD error. From this limit, the interpolation distance measured in sub-carriers in the FD  $(D_{sc})$  is selected from a set of predefined values, computed by analyzing the frequency correlation properties of different channels. Figure 4(b) shows the FD correlation in different channel models. Adjacent sub-carriers are highly correlated and the correlation level decreases as the distance between sub-carriers increases. The rate at which correlation reduces is directly proportional to frequency selectivity. For example, at a distance of 15 sub-carriers, the frequency selective ETU channel has an average correlation of around 75%, whereas the EPA channel shows 90% correlation. If interpolation in the TD is desired, the distance measured in symbols  $(D_{sum})$ , obtained from the TD correlation properties in Fig. 5(b) can be used. The correlation level between symbols reduces at a faster rate in the fast fading 300 Hz channel compared to the other two slow fading channels. The proposed framework in Fig. 3 shows that drastic changes in channel conditions will trigger a recalculation of  $D_{sym}$  and  $D_{sc}$ , whereas BW changes can be handled at circuit level with parameters such as clock frequency and  $V_{DD}$ . In order to find  $D_{sc}$  or  $D_{sum}$ , the following method is proposed. Assuming that the channel noise and interpolation error are uncorrelated, the noise variance in a receiver using interpolation techniques can be expressed as

$$\sigma_n^2 = \sigma_{nsys}^2 + \sigma_{ninterp}^2,\tag{10}$$

where  $\sigma_{nsys}^2$  is the channel noise variance. The interpolation error variance is denoted by  $\sigma_{ninterp}^2$  and can be computed by taking the average value of the mean square error in the elements of the interpolated  ${\bf Q}$  matrix. A parameter

$$\gamma = \frac{\sigma_{ninterp}^2}{\sigma_{nsys}^2},\tag{11}$$

can be used to decide the amount of interpolation error introduced, depending on the receiver SNR [8]. For the proposed framework, we set  $\sigma_{ninterp}^2$  to be an order of magnitude lower

Table I: Interpolation distances for different channels

| Interpolation<br>Domain      | Frequency Only          |      |      | Time Only           |       |       |
|------------------------------|-------------------------|------|------|---------------------|-------|-------|
|                              | Sub-Carriers $(D_{sc})$ |      |      | Symbols $(D_{sym})$ |       |       |
| SNR <sub>dB</sub><br>Channel | ≤ 10                    | ≤ 20 | ≤ 30 | ≤ 10                | ≤ 20  | ≤ 30  |
| EPA-5Hz                      | 48                      | 24   | 8    | > 140               | > 140 | > 140 |
| EVA-70Hz                     | 32                      | 16   | 8    | 32                  | 16    | 12    |
| ETU-300Hz                    | 32                      | 16   | 4    | 8                   | 4     | 2     |



Figure 6: Performance evaluation of interpolating QRD with K(10)-Best SD

than  $\sigma_{nsys}^2$ , ensuring that interpolation error is not the limiting factor on system bit error rate (BER) performance. Figure 4(a) shows the dependence of the error  $\sigma_{ninterp}^2$  obtained by interpolating Q matrices using (4) between sub-carriers spaced at different correlation levels for the three channel models used in LTE-A. Using Fig. 4(a) and Fig. 4(b) the following strategy illustrates how a link adaptive processor can be used to reduce total computational complexity. An SNR of  $20~\rm dB$  at the receiver with  $16~\rm QAM$  data would correspond to a  $\sigma_{nsys}^2$  of  $10^{-2}$  and  $\sigma_{ninterp}^2=10^{-3}~(\gamma=0.1).$  This corresponds to a correlation requirement of 75% to 85% in Fig. 4(b) and allows FD interpolation over 24 sub-carriers in the EPA channel or 12 sub-carriers in the ETU channel. A second example with TD interpolation of 16 QAM at 30 dB SNR is highlighted in Fig. 5 requiring around 90% correlation, translated to a range of a few to tens of symbols depending on the channel. Examining Fig. 4 and Fig. 5 we can conclude that for a certain admissible error, a wide range of channel dependent interpolation distances are available, and adaptively tuning to suit the current UE conditions, results in significant reduction of computations.

Using the above framework, the maximum distances in either FD or TD interpolation modes leading to an uncoded BER loss of  $\leq 1\,\mathrm{dB}$  are listed in Table I. Distances lower than the ones in Table I can be used if higher QRD accuracy is needed. Spatial multiplexing is generally adopted in the medium to high SNR range, where interpolation distance from 16 to 4 are suitable, according to Table I. The performance of the proposed method is verified with uncoded BER simulations under various channel conditions by using different  $D_{sc}$  and  $D_{sym}$ . As shown in Fig. 6, if interpolating in only the TD, the proposed method results in almost no performance loss

Table II: Required number of QRD/s with FD Interpolation for 16 QAM

| Rx BW     | 5 MHz BW |           |           | Five-Band 20 MHz CA |           |           |
|-----------|----------|-----------|-----------|---------------------|-----------|-----------|
| SNR in dB | ≤ 10     | $\leq 20$ | $\leq 30$ | ≤ 10                | $\leq 20$ | $\leq 30$ |
| $D_{sc}$  | 48       | 24        | 8         | 48                  | 24        | 8         |
| EPA QRD/s | 75 k     | 150 k     | 450 k     | 1.5 M               | 3 M       | 9 M       |
| $D_{sc}$  | 32       | 16        | 4         | 32                  | 16        | 4         |
| ETU QRD/s | 112.5 k  | 225 k     | 900 k     | 2.25 M              | 4.5 M     | 18 M      |



Figure 7: Data flow in the proposed Adaptive interpolating QRD

in EVA-5 Hz channel due to a high level of correlation. The Doppler shift in the EVA-70 Hz channel is around 0.5% of subcarrier spacing and shows a loss of 0.5 dB (at uncoded BER of  $10^{-3}$ ) compared to the simulation without interpolation. Performing interpolation in both FD and TD with a distance of 12 and 8 sub-carriers at SNR  $\leq 20\,\mathrm{dB}$  and  $\leq 30\,\mathrm{dB}$  respectively, results in a loss of around 1.2 dB in the EVA-70 Hz channel. This loss can be further reduced by using smaller values of  $D_{sc}$  and  $D_{sum}$ .

# IV. VLSI ARCHITECTURE AND HARDWARE IMPLEMENTATION

This section describes the VLSI architecture for implementing the developed adaptive QRD processor, which supports  $4\times4$  complex-valued matrices and uses the HT algorithm. Various architectural level optimizations to support high decomposition throughput, extensive resource sharing, and easy reconfiguration have been elaborated. Moreover, the unique feature provided by  $28\,\mathrm{nm}$  FD-SOI technology has been explored at architecture design to achieve low power consumption.

# A. Top Level Architectures

Figure 7 shows the top level block diagram of the QRD processor, which includes an HT based QRD unit (HQRDU), a rotation unit (RU) and an interpolation unit (IU). The partial channel estimates and the received data vectors are fed into HQRDU and RU, respectively. The IU combines the data from these two units by using a linear weight  $\alpha$  and writes the outputs to the signal detector memory.

The throughput of the proposed QRD processor changes adaptively to the bandwidth  $(BW_{sc})$  assigned to the UE and the interpolation distance  $(D_{sc}, D_{sym})$ , which is decided based on current channel scenarios, system setup, and QoS requirement. Furthermore, the throughputs of the functional units in Fig. 7 are different to one another. For instance,



Figure 8: Top level architecture for minimum area



Figure 9: Top level architecture for single clock domain

the minimum throughput required from the HQRDU can be computed by

$$\mathbf{QRD}/s = \frac{BW_{sc} \left(N_{sym} - N_p/P_{space}\right) 1000}{D_{sc} D_{sym}} \tag{12}$$

where  $N_{sym}$  is the number of OFDM symbols in 1 ms,  $N_p$  is the number of OFDM symbols carrying pilots, and  $P_{space}$  is the pilot/reserved tone spacing. The corresponding throughput of the rotation and the interpolation units are  $(2D_{sc}-1)(2D_{sym}-1)$  times higher than the HQRDU. For simplicity, we discuss the throughput requirements for FD only interpolation by setting  $D_{sym}$  to 1 and use Table I for the values of  $D_{sc}$  under corresponding channel conditions. The minimum required throughput of the HQRDU from (12) is listed in Table II. The wide range of throughput requirement as well as the speed mismatch between different function units motivate the need for an easily reconfigurable architecture. In the following subsection, two such architectures to support adaptation are presented: one for minimum area and one optimized for power reduction.

#### B. Reconfigurability for Adaptive Interpolation Distance

The first architecture is based on folding concept and supports different  $D_{sc}$  with variable folding factors. As shown in Fig. 8, the HQRDU operates in one clock domain  ${\bf Clk1}$  at a rate configured by  $D_{sc}$ . To handle the multiple rotations for each QRD output, one instance of a RU and an IU are time

**Algorithm 1** HT based QRD for the real valued representation of a  $M \times M$  complex channel Matrix

```
1: procedure HTORD(H)
           Dorthogonalize first column of H
 2:
           [\mathbf{a}, \mathbf{b}, \mathbf{c}, \mathbf{d}] \leftarrow FRMT(\mathbf{H}) \triangleright Format to real matrix
 3:
 4:
 5:
                 \alpha_1 \leftarrow \text{MUL}\left(a_1, a_1\right)
                                                       \triangleright a_1 = \mathbf{a}[1]
                 \alpha_2 \leftarrow \text{IPG}\left(\mathbf{a}, \mathbf{a}\right)
                                                       \triangleright All of a except a_1
 6.
               R_{11} \leftarrow SQRT (\alpha_1 + \alpha_2)
 7.
           8:
 9:
                 v_1 \leftarrow v_1 + R_{11}
                                                       ▷ v vector generated
10:
              len_v \leftarrow \alpha_2 + \text{MUL}(v_1, v_1)
11.
                  D \leftarrow \text{DIV}(v_1)
                  G \leftarrow \text{MUL}\left(D, -a_2\right)
12:
           13:
                  \mathbf{w} \leftarrow \text{VSM}\left(\mathbf{a}, G\right)
14:
15:
             len_w \leftarrow IPG(\mathbf{w}, \mathbf{w})
16:
              \mathbf{v}_{div} \leftarrow \text{GENROT}(\mathbf{v}, len_v)
17:
           ⊳ Reflect b
           \mathbf{b}_{temp} \leftarrow \text{REFLECTCOL}(\mathbf{v}, \mathbf{v}_{div}, \mathbf{b})
18:
19:
             \mathbf{w}_{div} \leftarrow \text{GENROT}(\mathbf{w}, len_w)
20:
            \mathbf{b}_{refl} \leftarrow \text{REFLECTCOL}(\mathbf{w}, \mathbf{w}_{div}, \mathbf{b}_{temp})
21:
           return R_{11}, \mathbf{v}, \mathbf{v}_{div}, \mathbf{w}, \mathbf{w}_{div}
22.
           ⊳ Reflect c. d
23.
           return \mathbf{b}_{refl}, \mathbf{c}_{refl}, \mathbf{d}_{refl},
24:
           > Orthogonalize other column from above results
25: end procedure
 1: procedure GENROT(\mathbf{v}, len_v)
           > Sub-procedure to generate rotation vector
           return VSM (\mathbf{v}, \text{DIV}(len_v))
 4: end procedure
 1: procedure REFLECTCOL(v, v_{div}, x)
           Sub-procedure to reflect a vector
 2:
 3:
           return (\mathbf{x} - VSM(\mathbf{v}_{div}, IPG(\mathbf{v}, \mathbf{x})))
```

multiplexed and operate in the second clock domain  ${\bf Clk2}$ , which is  $(2D_{sc}-1)$  times the clock of the HQRDU. The variable length shift registers to the top store the intermediate results for interpolation and the bottom shift register simplifies read access to the received data y. A control unit is needed to generate the signals for this complicated scheduling. The design shown in Fig. 8 can perform interpolation by either a factor of 4 or 8. When  $D_{sc}=4$ , each HQRDU output rotates 7 received data vectors requiring  ${\bf Clk2}$  to be 7 times higher than  ${\bf Clk1}$ . Similarly when  $D_{sc}=8$ ,  ${\bf Clk2}$  should be 15 times higher than  ${\bf Clk1}$ . Even though this architecture has a lower area, complicated control logic as well as cross clock domain mechanism are introduced. Moreover, higher clock frequency is needed, potentially resulting in increased power consumption.

4: end procedure

A second architecture leverages the parallel processing concept to achieve low power adaptation. As shown in Fig. 9, multiple RUs and IUs are instantiated and throughput difference to the HQRDU is handled by activating different number of RUs and IUs. This resolves the problems associated



Figure 10: Architecture of the RVD based HT with a detailed structure of Stage 1 together with NR-DIV unit

with clock domain crossing, but requires higher area than the architecture in Fig. 8. However, it enables easy reconfiguration and simplifies scheduling control, where the unused blocks can be deactivated with clock gating. Adaptability for different  $D_{sc}$  is achieved by changing the clock frequency of the whole system and activating a corresponding number of RUs and IUs. This parallel architecture is able to support high throughput with relatively low clock frequency, allowing for  $V_{\rm DD}$  scaling and thus reducing dynamic power consumption significantly.

The parallel architecture is favored by FD-SOI technology used for chip fabrication. Traditional  $V_{\rm DD}$  scaling method may suffer from significant performance degradation in terms of maximum operational frequency. Designs implemented in FD-SOI have a higher operating speed when compared to designs in bulk technologies, facilitating the parallel implementation to operate at low  $V_{\rm DD}$  values [9]. Furthermore, FD-SOI provides access to back gates, enabling the critical blocks to be forward body biased for increased throughputs, allowing even more aggressive  $V_{\rm DD}$  reduction [20], [21]. This provides system designers another control knob to trade-off power and performance. Hence, the fabricated adaptive circuit is based on this parallel architecture and has two modes of operation, namely the interpolation by 4 mode (IP4) when  $D_{sc}=4$ , and the interpolation by 8 mode (IP8) with  $D_{sc}=8$ .

#### C. Modified HT ORD Unit

This subsection describes the algorithm and architectural details of the HQRDU. The proposed design uses an RVD based QRD and supports  $4\times 4$  complex-valued matrices. The

Table III: Design space exploration with HLS with final design highlighted

| Clock in MHz | FF <sup>†</sup> | Multipliers | Comb* | Cell Area‡ |
|--------------|-----------------|-------------|-------|------------|
| 25           | 1               | 299         | 93    | 1.00       |
| 100          | 4               | 86          | 70    | 0.60       |
| 125          | 5               | 77          | 68    | 0.57       |
| 250          | 10              | 48          | 57    | 0.54       |
| 500          | 20              | 23          | 48    | 0.50       |

<sup>†</sup> Folding Factor: Clock cycles per QRD

pseudo code for this process is shown in Alg. 1. Bold upper case letters are used to reference matrices, bold lower case for vectors and scalars are indicated by non bold characters.

The top level architecture of this HT is shown in Fig. 10 with an expanded view of the first stage. The variables and sub procedures used to construct Alg. 1 are highlighted in the figure for easy cross referencing. The HQRDU is divided into 4 stages, with each stage orthogonalizing two columns of the real valued gain matrix using (7). This orthogonalization operation on a given column b of H can be written as

$$\mathbf{Q}^{\mathsf{T}}\mathbf{b} = \mathbf{V}_{8} \cdots \mathbf{V}_{1}\mathbf{b}$$

$$= \cdots \left(\mathbf{I} - 2\frac{\mathbf{w}\mathbf{w}^{\mathsf{T}}}{\mathbf{w}^{\mathsf{T}}\mathbf{w}}\right) \left(\left(\mathbf{I} - 2\frac{\mathbf{v}\mathbf{v}^{\mathsf{T}}}{\mathbf{v}^{\mathsf{T}}\mathbf{v}}\right)\mathbf{b}\right)$$

$$= \cdots \left(\mathbf{I} - 2\frac{\mathbf{w}\mathbf{w}^{\mathsf{T}}}{len_{w}}\right) \left(\mathbf{b} - 2\left(\frac{\mathbf{v}}{len_{v}}\right)(\mathbf{v}^{\mathsf{T}}\mathbf{b})\right)$$

$$= \cdots \left(\mathbf{I} - 2\left(\mathbf{w}_{div}\right)\mathbf{w}^{\mathsf{T}}\right) \left(\mathbf{b} - 2\left(\mathbf{v}_{div}\right)(\mathbf{v}^{\mathsf{T}}\mathbf{b})\right),$$
(13)

where v and w are the difference vectors. By grouping the operations as shown in (13), the matrix-vector multiplications

<sup>\*</sup> Percent of Combinational Area in Total Area ‡ Cell Area: Normalized to area when FF = 1



Figure 11: BER performance of the RTL model with K(10)-best decoder



Figure 12: Data flow through the Interpolating QRD processor

 $V_1b$  can be converted into vector-vector multiplication performed by an inner product generator (IPG) unit. The v and the  $v_{div}$  vectors are also reused by the RUs to rotate the received data v corresponding to (9).

The architecture of stage 1 indicates that a few common blocks such as the IPG, vector-scalar multiplier (VSM), division (DIV) and multiplication (MUL) are used multiple times and the design partitioning in Fig. 10 allows hardware sharing by time multiplexing. One of the important blocks of the design is the DIV unit used to generate the rotation vectors  $\mathbf{v}_{div}$ . The performance of the DIV unit is critical at higher SNR values and a three iteration Newton-Raphson method is used. As shown in Fig. 10, this unit consists of a scaling block which converts the inputs to the range between 0.5 and 1 and uses linear approximation to compute the initial seed value, eliminating the need for a look up table (LUT). The output obtained from the third iteration is modified by the rescale unit and passed onto the other blocks. The last two stages of the HQRDU have an additional output called the condition indicator (CI), used to flag an ill conditioned H. This is detected by examining the values of the diagonal elements of R, with low values indicating a high condition number for H. Fixed point implementation of the QRD processor has a limited accuracy which may lead to incorrect computation of the Q and R matrices in ill conditioned channels. Performing interpolation on these matrices will further increase the error, leading to higher BER loss. This can be partly combated by lowering  $D_{sc}$ , reducing interpolation error. In channels where lowering  $D_{sc}$  still does not improve BER performance, the eNB can be informed about bad channel conditions to change the modulation alphabet or switch to diversity schemes.



Figure 13: Chip microphotograph

Figure 14: Area breakdown

#### D. Architectural Exploration with High-Level Synthesis

Based on the high-level architecture in Fig. 10, a high level synthesis (HLS) flow was adopted for architecture exploration, where a fixed point C++ model is converted into register transfer level (RTL) code [22]. The flow reduces design time and enables exploration of different combinations of pipelining and time sharing while providing a platform to easily test the effects of fixed point hardware. Table III shows the area of the design implemented with different clock speeds and folding factors (FF). Increasing the FF results in significant reduction in area, as the tool optimizes sharing of multipliers. Even though the design at 500 MHz has the lowest total area, increasing FF from 4 to 20 does not result in significant change in area. Hence, for final physical implementation a clock speed of 125 MHz and a FF of 5 was chosen to ease back-end design and chip testing, resulting in an area reduction of 43% over the fully unfolded design.

Extensive BER simulations were performed to select the inputs and outputs word lengths. They were quantized to 13 bits with 9 fractional bits, 3 integer bits and a sign bit. Most of the internal variables in Alg. 1 are also quantized to this format to enable multiplier sharing, resulting in the BER curves presented in Fig. 11. We see that the uncoded BER loss in the ETU channel due to fixed point hardware is around 1 dB when compared to the floating point implementation. This loss can be lowered by decreasing  $D_{sc}$  to 4.

Figure 12 shows the timing schedule of the design with a FF of 5, implying that a new H can be fed into the QRD processor every five clock cycles. The full design has a latency of 63 clock cycles with an initial latency of 20 clock cycles (CC) in stage 1. The output I1 in Fig. 12 is fed into stage 2 which produces J1 and so forth. The first data tone  $y_1$  is then rotated using the outputs from the four stages to produce the rotated tone  $\mathbf{v}_{1rot}$ . After this initial latency, rotated vectors and outputs from each stage are produced every five clock cycles leading to a throughput of 25 MQRD/s. The final implementation is based on the architecture in Fig. 9 with a single clock domain and two rotation banks consisting of 7 and 8 rotation units respectively. In interpolation by 4 mode, the design is clocked at 125 MHz with all the required rotations produced by the first bank and the second bank is clock gated. When interpolation by 8 is desired, the clock frequency is reduced to 62.5 MHz, but both rotation banks are operated. Hence, depending on the operation mode, the clock frequency of the whole design is



Figure 15: Power consumption at different core supply voltages: (a) Operation without FBB. (b) Operation with FBB.



Figure 16: (a) Power consumption in IP8 mode with different FBB. (b) Breakdown of power consumption in IP4 and IP8 Mode.

decreased, lowering the dynamic power consumption, while still maintaining the required throughput.

#### V. CHIP MEASUREMENT RESULTS

The designed QRD processor is fabricated in the STMicroelectronics  $28\,\mathrm{nm}$  FD-SOI technology with the low  $V_\mathrm{th}$ digital cell library. The chip microphotograph with different function units is shown in Fig. 13 requiring about 1.2 mm<sup>2</sup> of core area corresponding to around 1500 k gates. Figure 14 shows the area breakdown of the full system. The QRD unit occupies 205 k gates with each rotation unit requiring around 45 k gates. To facilitate chip measurement, an on-chip test module has been integrated, which accepts serialized data from the RTL equivalent model through a pattern generator, feeds the function units, and sends the processed data to a logic analyzer for comparison against the golden RTL output. A control state machine (CSM) generates the read/write signals to test memories and to switch between the interpolation by 4 mode (IP4), the interpolation by 8 mode (IP8), and the power testing mode. Clock gates are also configured by the CSM to save power by deactivating unused blocks

such as rotation bank 2 in the IP4 mode and for dynamic power measurements. The following subsection reports the performance of the fabricated chip with different measurement and configuration setups.

# A. Chip Performance with Different Tuning Parameters

The chip is powered by a global  $V_{\rm DD}$  and dedicated pads are used for FBB. By tuning these two parameters, we demonstrate the optimal trade-off between processing speed and power consumption. Figure 15 (a) shows the total power consumption in IP4 mode against clock frequency without FBB at different values of  $V_{\rm DD}$ . The maximum frequency achieved is  $55\,\rm MHz$  corresponding to a rate of  $11\,\rm MQRD/s$ . From the QRD throughput requirements specified in Table II, we see that a five-band 20 MHz CA downlink can be handled if operating in the EPA channel. However, the more difficult ETU channel cannot be decoded in all scenarios.

To boost the operation frequency for ETU channels, FBB can be used and Fig. 15 (b) shows the doubling in frequency to  $110\,\mathrm{MHz}$  obtained with  $1.15\,\mathrm{V}$  FBB. With FBB, we can also optimize the power consumption for a target throughput

Table IV: Comparison of previous QRD processors with proposed design

|                                    | T 11.5403            | 2.61 1 5223          | 01 1 5463            | ** ***     | COLUMN TOTAL         | FR 50.43               | 1.5.1 1.50.53        | mu i                       |
|------------------------------------|----------------------|----------------------|----------------------|------------|----------------------|------------------------|----------------------|----------------------------|
| Items                              | Luethi [13]          | Miyaoka [23]         | Shabany [16]         | Huang [15] | Chiu [7]             | Zhang [24]             | Mohamed [25]         | This work                  |
| Matrix type                        | $4 \times 4$ Complex | $4 \times 4$ Complex | $4 \times 4$ Complex | 8 × 8 Real | $4 \times 4$ Complex | $4 \times 4$ Complex   | $4 \times 4$ Complex | 4 × 4 Complex              |
| Algorithm                          | Modified GS          | Modified GS          | Hybrid: HT,GR        | GR         | GS                   | GR                     | Programmable         | Modified HT                |
| Adaptive                           | ×                    | ×                    | ×                    | ×          | ×                    | ×                      | ×                    | ✓                          |
| Technology                         | 180 nm               | 90 nm                | 130 nm               | 180 nm     | 90 nm                | 65 nm                  | 65 nm                | 28 nm                      |
| Max Freq                           | 162 MHz              | 300 MHz              | 270 MHz              | 100 MHz    | 114 MHz              | 500 MHz                | 166 MHz              | 110 MHz                    |
| Gate Count*                        | 61.8 k               | 334 k                | 36 k                 | 152 k      | 505 k                | 362 k                  | 469 k                | 205 k                      |
| QRD/s                              | 1.56 M               | 50 M                 | 6.7 M                | 25 M       | 28.5 M               | 69 M                   | 8.3 M                | 22 M                       |
| Norm. QRD/s <sup>†</sup>           | 10 M                 | 160 M                | 31 M                 | 160 M      | 91 M                 | 160 M                  | 19 M                 | 22 IVI                     |
| Gate Efficiency <sup>‡</sup>       | 25                   | 150                  | 186                  | 164        | 56                   | 190                    | 18                   | 107                        |
| Norm. Gate Efficiency <sup>□</sup> | 161                  | 479                  | 861                  | 1053       | 180                  | 442                    | 41                   | 107                        |
| Power (mW)                         | NA                   | NA                   | 48.2 @ 1.32V         | 319 @ 1.8V | 56.8 @ 1V            | 195 @1.1V <sup>△</sup> | 300 @1.0V*           | 29 @ 1.1V                  |
| Norm. Power (mW)*                  | INA                  | IVA                  | 33.5                 | 119        | 68.7                 | 195                    | 248                  | 29 @ 1.1 V                 |
| Energy per QRD (nJ)                | NA                   | NA                   | 7.2                  | 12.8       | 2                    | 1.08                   | 36                   | No IP <sup>⊚</sup> IP4 IP8 |
| Norm. Energy per QRD (nJ)          | INA                  | IVA                  | 1.08                 | 0.75       | 0.75                 | 1.22                   | 13                   | 1.3 0.4 0.2                |
|                                    |                      |                      |                      |            |                      |                        |                      |                            |

 $<sup>^{\</sup>dagger}$  Normalized throughput calculated using QRD/s×  $\frac{\text{Technology}}{28\,\text{pm}}$ 

\* Programmable Chip No Interpolation

Table V: Suggested Biasing and ORD Power Consumption

| R   | Rx BW                  |                   | 5 MHz     |           |                  | 5-Band 20 MHz CA |           |  |
|-----|------------------------|-------------------|-----------|-----------|------------------|------------------|-----------|--|
| SN  | SNR in dB              |                   | $\leq 20$ | $\leq 30$ | ≤ 10             | $\leq 20$        | $\leq 30$ |  |
|     | QRD/s                  | 75 k              | 150 k     | 450 k     | 1.5 M            | 3 M              | 9 M       |  |
|     | $\mathrm{V_{DD}}\ (V)$ | 0.5               | 0.5       | 0.5       | 0.6              | 0.6              | 0.8       |  |
| EPA | FBB (V)                | 0.0               | 0.0       | 0.2       | 0.0              | 0.1              | 0.8       |  |
|     | P (mW)                 | 0.24              | 0.24      | 0.96      | 1.9              | 2.8              | 13        |  |
|     | QRD/s                  | $112.5\mathrm{k}$ | 225 k     | 900 k     | $2.25\mathrm{M}$ | $4.5\mathrm{M}$  | 18 M      |  |
|     | $V_{\mathrm{DD}}$      | 0.5               | 0.5       | 0.5       | 0.6              | 0.7              | 1.1       |  |
| ETU | FBB                    | 0.0               | 0.0       | 0.2       | 0.0              | 0.0              | 0.9       |  |
|     | P (mW)                 | 0.24              | 0.24      | 0.96      | 1.9              | 5.0              | 24.3      |  |



Figure 17: Power reduction in QRD by switching to interpolation mode

by reducing the  $V_{\rm DD}.$  For instance,  $10\,\mbox{M\,QRD/s}$  (50 MHz clock) can be reached with both  $V_{\rm DD}$  and FBB set to  $0.8\,V$ as opposed to just using V<sub>DD</sub> of 1.1 V. The corresponding power consumption is reduced to 40 mW from 65 mW. From the measurement results in Fig. 15, we confirmed that FD-SOI technology with FBB tuning at circuit level allows the UE to adaptively adjust throughputs to meet requirements and also enables power optimization by lowering the system level  $V_{DD}$ when using FBB on only the critical blocks.

When operating in low mobility or high coherence bandwidth channels, the designed QRD processor can be configured to IP8 mode, where the interpolation distance is 8. In this mode, HQRDU operates less frequently while RU has the same amount of operation as in the IP4 mode. According to the parallel architecture in Fig. 9, both rotation banks are activated in IP8 mode and the whole QRD processor operates at a lower frequency to save power. The combined throughput of these two rotation banks when clocked at  $f_1$  Hz in the IP8 mode is nearly the same as the throughput of a single rotation bank clocked at  $2f_1$  Hz in the IP4 mode. Thus, the ORD processor is able to produce a required rotation throughput at a lower clock frequency in the IP8 mode. Figure 16(a) shows the power consumption in the IP8 mode at different  $V_{\rm DD}$  and FBB. Comparing Fig. 15(b) with Fig. 16(a), it is noticed that the IP8 mode at 40 MHz consumes around 40 mW

power, while the IP4 mode needs 100 mW to achieve the same rotation throughput (80 MHz clock). This reduction in power by a factor of 2.5 underlines the benefits of using a parallel architecture and switching to the IP8 mode under suitable channel conditions. Figure 16(b) shows the power breakdown in both the supported modes at the corresponding maximum power/frequency points. The HQRDU requires around 29 mW and rotation bank 1 takes 57 mW of power in the IP4 mode.

These measurements show that the proposed QRD processor is flexible, and provides a wide range of performance and power trade-offs via V<sub>DD</sub> scaling, FBB, and interpolation modes. Table V summarizes the minimal power consumption of HQRDU and the corresponding parameter settings when the throughputs and performance requirement are met for different channel conditions. It can be seen that the power consumption varies from 1.9 mW to 24.3 mW for CA scenarios. This highlights the advantages of the proposed adaptive concept, resulting in up to 92% power saving in good channel conditions.

#### B. Comparison and Discussion

Table IV compares the HQRDU in the proposed design against previously published QRD circuits. The proposed system is considered the base reference and all other implementations are normalized using the formulas in Table IV. Among

<sup>‡</sup> Calculated by QRD/s Gate Count 

<sup>\*</sup> Normalized to the area of a 2-input NAND gate \* Normalized by using Power×  $\left(\frac{1.1}{V_{DD}}\right)^2$ 

<sup>△</sup> Post layout simulation



Figure 18: Interpolating QRD performance with coding and K(10)-Best SD

all the measured implementations, the hybrid system in [16] has the highest gate efficiency while the design in [7], with the highest throughput, requires 505 k gates (with detector). The presented system provides a good balance between [16] and [7] with a throughput comparable to GR implementation in [15]. The full multipliers required for the HT as opposed to the CORDIC architecture in [15] results in a slightly higher gate count, but the HQRDU handles LTE-A channels with high mobility and frequency selectivity, provides higher stability in correlated scenarios and enables parallel received vector rotation. This work requires between 1.9 mW and 24.3 mW of power to decode five-band 20 MHz CA signals, which is the lowest among all the implementations. The parallel architecture and FBB facilitate high throughput resulting in the best energy efficiency (energy per QRD metric among measured chips). Though technology scaling leads to a linear increase in operational frequency and reduction in power consumption in simple circuits, it is not always straightforward to achieve this in complex circuits such as the QRD processor. However, we have tabulated the normalized throughput, power and energy to highlight the competitiveness of the HQRDU implementation. The FBB of 1.15 V at the maximum throughput causes significant increase in static leakage [26] and using the traditional dynamic power scaling does not provide a full picture for comparison. The main advantage of the proposed system is the adaptability to channel conditions, where QRD computations can be replaced by low power interpolation operations. Figure 17 shows the reduction in power dissipation obtained by using the proposed processor. In the IP4 mode, 75% of the QRD computations can be replaced by low power interpolations. Figure 16(b) indicates that the power ratio between interpolation and QRD operation is around 1/16. Thus, the IP4 mode allows up to 70% power reduction and the IP8 mode allows up to 82\% when compared to operation without interpolation. With simultaneous TD and FD interpolation, three other modes can also be reached with the proposed system. However, this would require additional control logic to modify the order of data reads from the channel estimator and received data buffers. The proposed system with a competitive HQRDU implementation and adaptive interpolation enables power efficient channel pre-processing in LTE-A systems.

#### C. Notes on Design Choices

Several options were available for a few design parameters and in this section we motivate some of the choices made for the current implementation. Interpolation either in FD, TD, or simultaneously in both is supported by the proposed solution. However, TD interpolation over longer time periods will require large buffers to store received data, increasing latency. Thus, the BER results in Fig. 6 use a  $D_{sum}$  of 12 corresponding to the maximum length of interpolation in two RBs. Uncoded BER is used as metric for performance evaluation in Fig. 6 in order to estimate the effects of raw interpolation error. However, a practical wireless receiver will incorporate channel coding to increase link reliability and robustness. The BER performance of the interpolating QRD processor, operating in a system that uses turbo decoding with hard decisions and six iterations is depicted in Fig. 18. A code rate of 0.5 with a code size of 5376 and a quadratic polynomial permutation interleaver with parameters from [27] is used for the simulation. The performance loss of the proposed QRD processor is around  $0.1 \, dB$  at a BER level of  $10^{-4}$ . Channel coding presents an additional performance tuning option that can be combined with  $D_{sc}$  to optimize power consumption in the receiver. Another interesting choice is circuit biasing for a particular QRD throughput. Although FBB increases frequency, it has the adverse effect of increasing static leakage current. Hence, in some scenarios, such as at  $70\,\mathrm{MHz}$ , it is better to operate with a  $V_\mathrm{DD}$  of  $0.9\,\mathrm{V}$  and FBB of  $1.0\,\mathrm{V}$  rather than at  $0.8\,\mathrm{V}$   $\mathrm{V}_\mathrm{DD}$  and a  $1.5\,\mathrm{V}$  FBB. The interpolation distances, another implementation choice, were chosen to be factors of 4 for two reasons. Firstly, for hardware friendly linear interpolation implemented with just bit shifts. and secondly, for operating with a partial channel estimator, such as the one based on a pipelined N point decimation in frequency fast Fourier transform [28]. In such estimators, the first two output samples are from frequency bins 0 and N/2 followed by the next two samples at bins N/4 and 3N/4, allowing pruning to be performed [29] to produce only estimates at distances corresponding to Table II.

An HLS based flow was used for its advantages in design space exploration and the ease of implementing complex circuits dealing with fixed point implementation. One of the aspects that was not explored in this work was a mixed design approach, where some blocks such as the square root unit or the division unit could have been designed in traditional HDL and included in the HLS flow. In the physical design stage, multiple power domains would have been beneficial to reduce overall power dissipation. Static leakage increases significantly in the 28 nm FD-SOI technology at higher FBB values, which could have been controlled by power gating unused blocks such as some of the rotation units.

#### VI. CONCLUSIONS

This paper presents an adaptive QRD processor capable of decoding five-band carrier aggregated LTE-A signals. Both time and frequency correlation properties of different channels are analyzed and distances suitable for linear QRD interpolation are proposed to lower complexity. Accurate QRD

computations are replaced with approximations from a reconfigurable interpolation unit, reducing power dissipation in good channel conditions. A modified HT based algorithm results in a high throughput design, and a simple hardware architecture enables adaptive switching between two interpolation factors. The back gate feature of FD-SOI is leveraged to reduce power consumption at low QRD throughputs and to double operational frequency in low correlation channels. Measurement results indicate that the processor requires a power ranging from 1.9 mW in EPA channels to 24.3 mW when operating in the more difficult ETU channels, which is further reduced by 70% to 80% with interpolations. The proposed system with a parallel architecture, clock gating and  $\rm V_{DD}$  scaling with FBB allows multiple levels of adaptability and hence, is suitable for battery powered high performance mobile devices.

#### ACKNOWLEDGMENT

This work is a part of the DARE project and the authors thank the Swedish Foundation for Strategic Research for funding and STMicroelectronics for chip fabrication. We would also like to thank Anders Nejdel and Oskar Andersson for their help and support.

#### REFERENCES

- 3rd Generation Partnership Project. LTE Release 10. [Online]. Available: http://www.3gpp.org/specifications/releases/70-release-10
   E. Dahlman et al., 4G: LTE/LTE-Advanced for Mobile Broadband.
- Oxford, UK: Elsevier Press, 2011.

  [3] M. Mohaisen *et al.*, "Adaptive Parallel and Iterative QRDM Algorithms for Sortial Multiplicing MIMO Systems" in IEEE 70th Vehicular
- [5] M. Mohaisen et al., "Adaptive Parallel and Iterative QRDM Algorithms for Spatial Multiplexing MIMO Systems," in *IEEE 70th Vehicular Technology Conf.*, Sept 2009, pp. 1–5.
- [4] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
- [5] L. Trefethen and D. Bau, Numerical Linear Algebra. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1997.
- [6] D. Cescato and H. Bolcskei, "QR Decomposition of Laurent Polynomial Matrices Sampled on the Unit Circle," *IEEE Trans. Inf. Theory*, vol. 56, pp. 4754–4761, Sept 2010.
- [7] P. L. Chiu et al., "A 684Mbps 57mW joint QR decomposition and MIMO processor for 4x4 MIMO-OFDM systems," in *IEEE Asian Solid State Circuits Conf.*, Nov 2011, pp. 309–312.
- [8] R. Gangarajaiah et al., "Low complexity adaptive channel estimation and QR decomposition for an LTE-A downlink," in *IEEE 25th Annu.* Int. Symp. Personal, Indoor, and Mobile Radio Commun., Sept 2014, pp. 459–463.
- [9] F. Arnaud et al., "Switching energy efficiency optimization for advanced CPU thanks to UTBB technology," in *IEEE Int. Electron Devices Meeting*, Dec 2012, pp. 3.2.1–3.2.4.
- [10] 3rd Generation Partnership Project. 3GPP TS 36.104 Base Station radio transmission and reception. [Online]. Available: http://www.3gpp.org/dynareport/36104.htm
- [11] P. L. Chiu et al., "Interpolation-Based QR Decomposition and Channel Estimation Processor for MIMO-OFDM System," *IEEE Trans. Circuits Syst. I*, vol. 58, pp. 1129 – 1141, May 2011.
- [12] N. Czink et al., "Subspace Modeling of Multi-User MIMO Channels," in *IEEE Veh. Technology Conf.*, Sept 2011, pp. 1–5.
- [13] P. Luethi et al., "Gram-Schmidt-based QR decomposition for MIMO detection: VLSI implementation and comparison," in *IEEE Asia Pacific Conf. Circuits Syst.*, Nov 2008, pp. 830–833.
- [14] P. Luethi et al., "VLSI Implementation of a High-Speed Iterative Sorted MMSE QR Decomposition," in *IEEE Int. Symp. Circuits Syst.*, May 2007, pp. 1421–1424.
- [15] Z.-Y. Huang and P.-Y. Tsai, "Efficient Implementation of QR Decomposition for Gigabit MIMO-OFDM Systems," *IEEE Trans. Circuits Syst. I*, vol. 58, Oct 2011.

- [17] R. Gangarajaiah et al., "A high-speed QR decomposition processor for carrier-aggregated LTE-A downlink systems," in European Conf. Circuit Theory and Design. Sent 2013. pp. 1–4.
- Theory and Design, Sept 2013, pp. 1–4.
   [18] R. F. H. Fischer and C. Windpassinger, "Real versus complex-valued equalisation in V-BLAST systems," Elect. Lett., vol. 39, pp. 470–471, 2003
- [19] M. Wenk et al., "K-best MIMO detection VLSI architectures achieving up to 424 Mbps," in Proc. IEEE Int. Symp. Circuits Syst., May 2006, pp. 1151–1154.
- [20] F. Abouzeid et al., "28nm FD-SOI technology and design platform for sub-10pJ/cycle and SER-immune 32bits processors," in European Solid-State Circuits Conf., Sept 2015, pp. 108–111.
- [21] N. Planes et al., "28nm FDSOI technology platform for high-speed low-voltage digital applications," in Symp. VLSI Technology, June 2012, pp. 133–134.
- [22] Mentor Graphics. High Level Synthesis. [Online]. Available: https://www.mentor.com/hls-lp/
- [23] Y. Miyaoka et al., "Sorted QR decomposition for high-speed MMSE MIMO detection based wireless communication systems," in Proc. IEEE Int. Symp. Circuits Syst., May 2012, pp. 2857–2860.
- [24] C. Zhang et al., "Energy Efficient Group-Sort QRD Processor with Online Update for MIMO Channel Pre-processing," *IEEE Trans. Circuits Syst. I*, vol. 62, no. 5, pp. 1220–1229, May 2015.
- [25] M. I. A. Mohamed et al., "Energy Efficient Programmable MIMO Decoder Accelerator Chip in 65-nm CMOS," IEEE Tran. Very Large Scale Integration Syst., vol. 22, no. 7, pp. 1481–1490, July 2014.
- [26] B. Mohammadi et al., "A 128 kb single-bitline 8.4 fJ/bit 90MHz at 0.3V 7T sense-amplifierless SRAM in 28 nm FD-SOI," in IEEE 42nd European Solid-State Circ. Conf., Sept 2016, pp. 429–432.
- [27] 3rd Generation Partnership Project. 3GPP TS 36.212 Multiplexing and channel coding. [Online]. Available: http://www.3gpp.org/dynareport/36212.htm
- [28] B. Yang, Z. Cao, and K. Letaief, "Analysis of low-complexity windowed DFT-based MMSE channel estimator for OFDM systems," *IEEE Trans. Commun.*, vol. 49, pp. 1977–1987, Nov 2001.
- [29] H. V. Sorensen and C. S. Burrus, "Efficient computation of the DFT with only a subset of input or output points," *IEEE Trans. Signal Process.*, vol. 41, pp. 1184–1200, Mar 1993.



Rakesh Gangarajaiah received the M.Sc, degree in electrical engineering from Lund University, Sweden in 2010. He is currently working towards a Ph.D. degree in digital circuit design at the Department of Electrical and Information Technology, Lund University, Sweden.

His research interests include signal processing and hardware design for baseband circuits in wireless communication systems.



Ove Edfors received the M.Sc. degree in computer science and electrical engineering in 1990 and the Ph.D. degree in signal processing in 1996, both from Luleå University of Technology, Sweden. In July 1997 he joined the staff at the Department of Electrical and Information Technology, Lund University, Sweden, where he since 2002 is professor of Radio Systems.

His research interests include radio systems, statistical signal processing and low-complexity algorithms with applications in telecommunication. His

current focus is on massive MIMO and energy efficient communications.



Liang Liu received his B.S. and Ph.D. degree from the Department of Electronics Engineering (2005) and Micro-electronics (2010) from Fudan University, China. In 2010, he was with the Rensselaer Polytechnic Institute (RPI), USA as a visiting researcher. He joined Lund University as a Post-doc in 2010. Since 2016, he is an Associate Professor at Lund University.

His research interest includes wireless communication system and digital integrated circuits design. He is a board member of the IEEE Swedish

SSC/CAS chapter. He is also a member of the technical committees of VLSI systems and applications and CAS for communications of the IEEE circuit and systems society.

Paper VI

# Paper VI

# A Cholesky Decomposition based Massive MIMO Uplink Detector with Adaptive Interpolation

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund university's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications\_standards/publications/rights\_link.html to learn how to obtain a License from RightsLink.

R. Gangarajaiah, H. Prabhu, O. Edfors, and L. Liu, "A Cholesky Decomposition based Massive MIMO Uplink Detector with Adaptive Interpolation," © 2017 IEEE, accepted for publication in the *IEEE International Symposium on Circuits and Systems*, 2017.

# A Cholesky Decomposition based Massive MIMO Uplink Detector with Adaptive Interpolation

Rakesh Gangarajaiah, Hemanth Prabhu, Ove Edfors, and Liang Liu Department of Electrical and Information Technology, Lund University, Sweden Email: {rakesh.gangarajaiah, hemanth.prabhu, ove.edfors, liang.liu}@eit.lth.se

Abstract—An adaptive uplink detection scheme for a Massive MIMO (MaMi) base station serving up to 16 users is presented. Considering user distribution in a cell, selective matched filtering (MF) is proposed for non-interference limited users and a Cholesky decomposition (CD) based zero-forcing (ZF) detector is implemented for the remaining users. Channel conditions such as coherence bandwidth are exploited to lower computational complexity by interpolating CD outputs. Performance evaluations on measured MaMi channels indicate a reduction in computation count by 60 times with a less than 1 dB loss at an uncoded bit error rate of  $10^{-3}$ . For the CD, a reconfigurable processor optimized for 8×8 matrices with block decomposition extension to support up to  $16 \times 16$  matrices is presented. Circuit level optimizations in 28 nm FD-SOI resulted in an energy of 1.4 nJ/CD at 400 MHz, and post-layout simulations indicate a 50% reduction in power dissipation when operating with the proposed interpolation based detection scheme compared to traditional ZF detection.

#### I. INTRODUCTION

Massive MIMO (MaMi) is a promising candidate for the next generation wireless systems, achieving high spectral efficiency by using large number of antennas at the base station (BS) [1]. This not only simplifies processing at the mobile user equipment with downlink precoding [2], but also provides good performance in the uplink with linear detection algorithms such as matched filtering (MF) and zero-forcing (ZF). However, the complexity of these algorithms increases linearly with the number of BS antennas (M), and quadratically with the number of active users (K). Furthermore, the propagation scenarios in a MaMi systems can vary more than traditional small-scale multiple-input multiple-output (MIMO) systems due to the distributed nature of several single antenna users. Hence, an adaptive signal detector which considers factors, such as the number of active users and channel conditions, is needed to achieve hardware and energy efficiency.

In this paper, a detector which employs MF detection for  $K_1$  non-interference limited users close to the BS, and ZF detection for the other  $K_2$  interference limited users is presented. The varying positions of mobile users over time, results in changing values of  $K_1$  and  $K_2$ , which indicates the need for reconfigurable MF and ZF detectors. ZF detection, which is the more complex of the two, can be performed with Cholesky decomposition (CD) of the Gram matrix ( $H^HH$ ) followed by forward backward substitution (FBS). Furthermore, the block CD algorithm can be used to construct the CD of larger matrices from the CD of smaller sub-matrices, enabling the implementation of a variable size CD processor. Nonetheless, CD has a complexity in the order of  $\mathcal{O}(K^3)$ , and if calculated on each tone in an orthogonal frequency division multiplexing (OFDM) system, a throughput proportional to the bandwidth is required. This traditional per-tone approach is neither scalable nor efficient for wide-band systems. Fortunately, channel conditions such as coherence bandwidth  $(B_{coh})$  can be exploited to compute CD of a few selected tones and linear interpolation can be used to approximate CD of intermediate tones. This enables a drastic reduction in computation count, in proportion to the interpolation distance measured in tones  $(D_{sc})$ , at the cost of a slight increase in bit error rate (BER).

The proposed detection scheme with combined MF and CD based ZF is evaluated in a MaMi system with 128 antennas at the BS. Effects of CD interpolation are analyzed with measurement data from [3] and the hardware architecture for an interpolation based ZF detector is presented, reducing computations by up to 60 times. To reduce silicon area and power, the CD unit is optimized for 8 users and block decomposition is employed when all 16 users require ZF detection. The ZF detector is implemented in 28 nm FD-SOI with a 141 k-gates CD unit and has a peak throughput of 20 M CD/s. The adaptive interpolation feature is controlled by a single parameter  $D_{sc}$ , which simplifies hardware implementation and provides a wide tuning range.

#### II. BACKGROUND

The baseband signal in the uplink of an OFDM system with N tones can be modeled as

$$\boldsymbol{y}_i = \boldsymbol{H}_i \boldsymbol{x}_i + \boldsymbol{n}_i, \quad \forall i \in \{1, 2, \dots, N\}$$
 (1)

where, the subscript i is the tone index,  $\boldsymbol{y}_i = \left[y_1,...,y_M\right]^T$  is the received data,  $\boldsymbol{H}_i \in \mathbb{C}^{M \times K}$  the channel gain matrix,  $\boldsymbol{x}_i = \left[x_1,...,x_K\right]^T$  the combined transmit data of users and  $\boldsymbol{n}_i$  the additive white Gaussian noise. Signal detection with MF increases signal to noise ratio (SNR), providing good performance for non-interference limited user, and is achieved by

$$\widehat{\boldsymbol{x}}_{MF} = \boldsymbol{H}^H \boldsymbol{y} = \boldsymbol{H}^H \boldsymbol{H} \boldsymbol{x} + \boldsymbol{H}^H \boldsymbol{n}, \tag{2}$$

where  $\boldsymbol{H}^H$  is the Hermitian conjugate of the channel estimate. However, in interference limited scenarios, the more advanced ZF detector is required which operates on the MF data by

$$\widehat{\boldsymbol{x}}_{ZF} = \left(\boldsymbol{H}^H \boldsymbol{H}\right)^{-1} \widehat{\boldsymbol{x}}_{MF}. \tag{3}$$

Typically, users are distributed over a cell, resulting in conditions where MF provides good BER performance for some users but others require ZF. This distribution varies over time due to mobility, sometimes leading to the worst case scenario where all users require ZF detection. Thus, reconfigurability in the signal detector for handling variable channel conditions is desired to achieve energy efficiency.

Operating channel conditions do not change drastically in low mobility scenarios. Therefore, in situations where the  $B_{coh}$  is in the order of several OFDM tones, different interpolation techniques can be used to lower ZF computations [4]. However, low complexity linear interpolation of the inverses with  $D_{sc}$  around 60 results in significant performance degradation.



Fig. 1: (a) Signal flow in hybrid detector, (b) Block CD based ZF detector

#### III. ADAPTIVE DETECTION WITH INTERPOLATION

Even though a typical BS is equipped with a direct source of power, energy efficiency is important due to the large scale of MaMi systems. Hence, implementing a full ZF detector for the worst case scenario where all user are interference limited, will result in a suboptimal design. In the next section a hybrid detection scheme is analyzed which coupled with an interpolation strategy lowers complexity and power for minimal performance loss.

#### A. Hybrid detection with MF and ZF

From a power consumption perspective MF is preferred over ZF but results in inferior performance for interference limited users. To balance these conflicting demands of power and performance, a hybrid detector with user selection (User Sel.) capability is presented in Fig. 1(a). The interference limited users are detected by comparing the diagonal and off-diagonal entries in the full Gram matrix. In limited mobility scenarios, where interference conditions do not change drastically, the computationally expensive full Gram matrix is calculated occasionally. The users with low interference are selected for MF with channel estimates  $H_{P1}$  followed by hard detection to produce  $x_{MF}$ . Next, interference cancelation (IC) using  $H_{P1}$  and  $x_{MF}$  is done on the received data y to calculate a lower dimension vector  $y_{IC}$ . The reduced Gram matrix for the remaining users is computed with  $H_{P2}$ , followed by ZF detection of the interference free vector  $y_{IC}$ , resulting in  $x_{ZF}$ .

The average BER of 16 users in a system with 128 BS antennas operating with the proposed detection scheme is compared against full ZF of all users in Fig. 2(a). In the simulations, eight users  $(K_1)$  with a higher SNR are detected with MF, whereas the other eight users  $(K_2)$  require ZF detection. When the SNR (per receive antenna) of  $K_1$  is just  $12 \, dB$  higher than  $K_2$ , significant degradation in BER is seen due strong interference, resulting in poor MF detection. However, when  $K_1$  users have 18 dB or higher SNR, MF works well and the hybrid scheme performs slightly better than ZF on all users. This can be attributed to the smaller ZF matrix size  $(K_2 = 8)$  and the resulting lower noise enhancement in the hybrid scheme. Another advantage of the proposed scheme is the reduced dynamic range of the Gram matrix entries generated from  $\dot{H}_{P2}$ , as only users with similar SNR are chosen for ZF. Thus, the hardware implementation of ZF detector in the hybrid scheme can be optimized to use lower precision when compared to a ZF detector for all users.



Fig. 2: BER simulations for hybrid detector and interpolating CD detector

#### B. Adaptive Interpolation of Cholesky Decomposition

The proposed hybrid scheme reduces the number of users requiring ZF detection and hence the computation count when some users are not limited by interference. Further reductions are obtained by exploiting the  $B_{coh}$  of the channel. Instead of direct matrix inversion of Gram matrices at each OFDM tone, the Gram matrix inverse of two tones which are at a distance of  $D_{sc}$  apart are interpolated for the intermediate tones. However, linear interpolation of these inverses causes significant degradation in performance [4]. As an alternative, consider ZF detection based on CD of the Gram matrix with

$$\widehat{\boldsymbol{x}}_{ZF} = \left(\boldsymbol{H}^{H}\boldsymbol{H}\right)^{-1}\widehat{\boldsymbol{x}}_{MF} = \left(\boldsymbol{L}\boldsymbol{L}^{H}\right)^{-1}\widehat{\boldsymbol{x}}_{MF}, \qquad (4)$$

followed by forward backward substitution (FBS) with,

$$\mathbf{y}_F = \mathbf{L}^{-1} \widehat{\mathbf{x}}_{MF}$$

$$\widehat{\mathbf{x}}_{ZF} = \mathbf{L}^{-H} \mathbf{y}_F,$$
(5)

where  $\boldsymbol{L}$  is a lower triangular matrix. To lower complexity, linear interpolation of the  $\boldsymbol{L}$  matrices is proposed instead of Gram matrix inverse interpolation. Furthermore, the interpolation distance  $D_{sc}$  is adaptively varied based on the  $B_{coh}$  of the operating channel to maintain low BER loss, similar to the method considered in [5] for reducing QR decomposition (QRD) computation cost. If  $\boldsymbol{L}_1$  and  $\boldsymbol{L}_N$  correspond to the CD components from tones which are  $D_{sc}$  apart, then the linear interpolated  $\boldsymbol{L}_i$  of intermediate tones are computed with a factor  $\alpha$  dependent on  $D_{sc}$  using,

$$\mathbf{L}_i = (1 - \alpha)\mathbf{L}_1 + \alpha\mathbf{L}_N = \mathbf{L}_1 + \alpha(\mathbf{L}_N - \mathbf{L}_1). \tag{6}$$

The error introduced by this approximation is reduced by decreasing  $D_{sc}$  whereas computation count is reduced by increasing  $D_{sc}$ , enabling run time trade-off between power and performance. The effects of CD interpolation on BER in a MaMi system are evaluated with actual channel measurement data from [3]. A tone spacing of  $15\,\mathrm{kHz}$  is used with different values of  $D_{sc}$  and the resulting BER is depicted in Fig. 2(b), showing a loss of less than  $1\,\mathrm{dB}$  at  $10^{-3}$  for  $D_{sc}$  as high as 60. This highlights the potential of replacing CD computations by low complexity linear interpolation from (6).

# C. Complexity analysis

The comparison of three linear detection methods with multiplications as the metric is shown in Fig. 3 (a). QRD based detection is often used in small-scale MIMO systems but is



(a) Comparison of detection schemes

(b) Operation count comparison

Fig. 3: Complexity analysis of detection schemes

not beneficial in terms of complexity in MaMi systems. The CD based full ZF detection has lower storage requirements and complexity than QRD but the proposed hybrid scheme requires the smallest number of multiplications. Fig. 3 (b) shows the significant reduction obtained with the hybrid scheme when the number of MF users  $(K_1)$  increases. Further reductions by a factor of  $(K_2 + 2)/3$  is obtained by replacing CD with interpolation from (6). As an example, in a system with M = 128, K = 16 and  $K_1 = 4$ , the hybrid detection scheme reduces complexity by 35%.

#### IV. VLSI ARCHITECTURE

The top level signal flow diagram for the proposed hybrid detector is shown in Fig. 1 (a). The received data y and the columns  $H_{P1}$  corresponding to users selected for MF are forwarded to the MF and IC unit. The estimates  $H_{P2}$  for  $K_2$  users are used to compute the reduced Gram matrix for  $\Sigma_2$  detection. In the following, optimizations for the ZF detector which consists of a CD unit and a FBS unit are discussed.

#### A. Block Based Cholesky decomposition

The hybrid detection scheme requires a reconfigurable CD unit, for detecting variable number of users with ZF, depending on interference conditions. A systolic array based CD unit can be obtained by modifying the design in [7] which has around 60% multiplier utilization. The current implementation is based on a similar structure and the internal connections with input-output matrices for a part of the CD unit are shown in Fig. 4 (a). A combination of different types of multipliers for scaling (CHM), finding the absolute value (SM) and for full complex multiplication (CFM) are used. To save silicon area, the CD unit is optimized for 8 users and block CD algorithm is used to handle scenarios where more than 8 users need ZF detection. Consider the Gram matrix for 16 users

$$H^{H}H = \begin{bmatrix} A & B^{H} \\ B & D \end{bmatrix}$$
 and  $S = D - BA^{-1}B^{H}$ , (7)

the Schur complement obtained with smaller  $8 \times 8$  submatrices A, B and D. With block decomposition algorithm, the

$$\mathrm{CD}(\boldsymbol{H}^{H}\boldsymbol{H}) = \begin{bmatrix} \boldsymbol{L}_{\boldsymbol{A}} & \boldsymbol{0} \\ \boldsymbol{B}\boldsymbol{L}_{\boldsymbol{A}}^{-H} & \boldsymbol{L}_{\boldsymbol{S}} \end{bmatrix}, \tag{8}$$

where  $L_A$  and  $L_S$  are the CD components of A and S respectively. This method, requires  $A^{-1}$ , which in a MaMi



(a) Internal connections in part of CD unit with an example transformation

(b) BER of Block CD and RTL

Fig. 4: Architecture of CD unit and performance comparison of Block CD system can be approximated by

$$A^{-1} \approx \mathbb{E}\left[a_{ij}^{-1}\right] = \gamma, \text{ and } S \approx D - \gamma B B^H,$$
 (9)

where  $a_{jj}$  are the diagonal elements of  $\boldsymbol{A}$  and the inversion operation of  $a_{jj}$  is obtained with a look up table (LUT). The  $\boldsymbol{B}\boldsymbol{L}_{\boldsymbol{A}}^{-H}$  product is computed with the special  $\boldsymbol{L}_{A_j}^{-H}$  matrices [6] generated during the CD of  $\boldsymbol{A}$  from

$$BL_A^{-H} = B\left(L_{A_1}^{-H}L_{A_2}^{-H}...L_{A_i}^{-H}\right), \forall j \in \{1,...,8\},$$
 (10)

which by construction modify only the corresponding j-th column of B. This operation has the same complexity and structure as  $BB^H$  in (9) and can be computed by the same hardware unit. Furthermore, these operations can be shared with the Gram matrix unit. Fig. 1 (b) shows the connections in the block CD implementation, using a  $8\times 8$  CD unit, a FBS unit together with a matrix multiplier (from Gram matrix unit). Fig. 4 (b) shows the BER performance of the block CD based detector, indicating a loss of less than 1 dB at  $10^{-3}$  when compared to floating point ZF. The timing schedule in Fig. 5 describes the execution order for 16 users, where the matrix multiplier operates in three modes, producing the Schur complement, the CD component  $BL_A^{-H}$ , and performing parts of forward substitution (Vec). When detecting 8 or lower number of users, only the CD, interpolation (IU) unit and FBS are active and if 12 users require ZF, block decomposition with zero padding is used.



Fig. 5: Timing schedule for K = 16 and K = 8



Fig. 6: Power reduction with adaptive ZF detector and implementation details

# B. Interpolation unit and Forward Backward Substitution unit

The IU unit stores two consecutive output  $L_1$  and  $L_N$  from the CD unit in the "Store1" and "Store2" cycles as shown in Fig. 5 and produces the  $L_i$  matrices using (6) for the intermediate tones. The parameter  $D_{sc}$  determines the  $\alpha$  weights and the unit is implemented to produce a new  $L_i$  every two FBS runs. The FBS unit operates on  $y_{IC}$  with the  $L_i$  matrix from the IU unit and runs twice for each input  $x_{MF}$ , performing forward followed by backward substitution. However, when 16 user ZF detection is required, the block is run 4 times as depicted in Fig. 5. The vector  $y_{FP1}$  from the forward substitution of  $x_{MF1}$ , is multiplied with  $BL_A^{-H}$ , subtracted from  $x_{MF2}$  using the Vec. Sub. unit followed by the second forward substitution to produce  $y_{FP2}$ . The same procedure is repeated to obtain the final result  $x_{ZF}$ .

#### V. IMPLEMENTATION AND RESULTS

The current implementation adopts a high level synthesis [8] flow which enables rapid design space exploration to optimize hardware reuse. The inputs and internal variables of the CD unit are quantized to 13 bits and a target throughput of 20 M CD/s is chosen to support wide-band systems and full ZF detection without interpolation. The CD unit for 8×8 matrices requires 120 (most of them complex) multiplication operations corresponding to 288 real multipliers. To save silicon area, the current design uses a folding factor (FF) of 20 and a clock of 400 MHz resulting in a implementation with 19 real multipliers operating at 80% utilization. In order to match the throughputs, the FBS unit is implemented with a clock of 600 MHz and a FF of 15. The operations performed in the FBS unit are sensitive to division accuracy and a high precision unit is implemented and reused. When the ZF detector operates in the interpolation mode, the IU unit clocked at 40 MHz generates a new output  $L_i$  every 2 clock cycles to be used in the FBS unit.

The BER performance of the implementation is compared with floating point models in Fig. 4 (b), indicating a marginal increase. The components of the ZF detector are synthesized and implemented in 28 nm FD-SOI technology and the implementation details are shown in Table I. Though the addition of the IU unit adds 19% area overhead, switching to interpolation mode allows CDs to be replaced by low power interpolations, e.g., with a  $D_{sc}$  of 20, the detector would compute 2 CDs and 18 interpolations instead of 20 CDs. The reduction in power dissipation for different values of  $D_{sc}$  is depicted in Fig. 6 for both CD and ZF detector (includes FBS). A comparison of the CD unit with other implementations

TABLE II: Comparison of Decomposition Processors

| Item                           | NS<br>[2] | LDL<br>[9]   | LU<br>[10]   | This<br>work<br>(CD) |
|--------------------------------|-----------|--------------|--------------|----------------------|
| Matrix Dimension (N)           | 16×16     | $4 \times 4$ | $4 \times 4$ | 8×8                  |
| Technology [nm]                | 65        | 90           | 90           | 28                   |
| Gate Count [k]                 | 104       | 90           | 68           | 141                  |
| Throughput [M ops/s]           | 0.5       | 30           | 31.5         | 20                   |
| Norm. Throughput [M ops/s] *   | 8         | 16           | 17           | 20                   |
| Norm. Gate Efficiency †        | 77        | 178          | 250          | 141                  |
| Power [mW]                     | -         | -            | 35♦          | 28                   |
| Norm. Energy <sup>‡</sup> [nJ] | -         | -            | 3◊           | 1.4                  |

<sup>\*</sup> Normalized (Norm.) throughput = Throughput  $\times \frac{\text{Technology}}{28} \times \frac{(N)(N+1)(N+2)}{(8)(9)(10)}$ 

from literature is presented in Table II. Compared to small-scale MIMO decomposition implementations in [9] [10], the proposed design has higher energy efficiency with comparable gate efficiency, and can be further improved by using adaptive interpolation strategy.

#### VI. CONCLUSION

This paper presents an adaptive uplink detector for MaMi systems by using a hybrid detection scheme. A CD interpolation strategy exploiting  $B_{coh}$  is employed to lower computational complexity and is verified with measured channel data. A reconfigurable architecture based on CD unit and block decomposition serving up to 16 users is presented. The ZF detector is implemented in 28 nm FD-SOI technology, has a peak throughput of 20 M CD/s and requires 1.4 nJ/CD. The CD interpolation scheme reduces power dissipation of the ZF detector by 50% with a performance loss of less than 1 dB, resulting in a highly energy efficient signal detector.

#### REFERENCES

- F. Rusek et al., "Scaling Up MIMO: Opportunities and challenges with very large arrays," *IEEE Signal Process. Mag.*, vol. 30, no. 1, pp. 40–60, Jan 2013.
- [2] H. Prabhu et al., "Hardware efficient approximative matrix inversion for linear pre-coding in massive MIMO," in *IEEE Int. Symp. Circ. and Syst.*, June 2014, pp. 1700–1703.
- [3] X. Gao et al., "Measured propagation characteristics for very-large MIMO at 2.6 GHz," in *The 46th Asilomar Conf. Signals, Syst. and Comput.*, November 2012, pp. 295–299.
- [4] MAMMOET project deliverable. (2016) D3.2-Distributed and centralized baseband processing algorithms, architectures, and platforms. [Online]. Available: https://mammoet-project.eu/downloads/publications/deliverables/MAMMOET-D3.2-baseband-processing-PU-M24.pdf
- [5] R. Gangarajaiah et al., "Low complexity adaptive channel estimation and QR decomposition for an LTE-A downlink," in IEEE 25th Annu. Int. Symp. Personal, Indoor, and Mobile Radio Commun., September 2014, pp. 459–463.
- [6] L. Trefethen and D. Bau, Numerical Linear Algebra. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1997.
- [7] S. J. Bellis et al., "Alternative systolic array for non-square-root Cholesky decomposition," *IEEE Proc. Comput. Digital Techn.*, vol. 144, no. 2, pp. 57–64, March 1997.
- [8] Mentor Graphics. (2016) High Level Synthesis. [Online]. Available: https://www.mentor.com/hls-lp/
- [9] D. Auras et al., "Efficient VLSI architectures for matrix inversion in soft-input soft-output MMSE MIMO detectors," in *IEEE Int. Symp. Circ. and Syst.*, June 2014.
- [10] C. Studer et al., "ASIC Implementation of soft-Input soft-Output MIMO detection using MMSE parallel interference Cancellation," IEEE J. Solid-State Circ., vol. 46, no. 7, pp. 1754–1765, July 2011.

<sup>†</sup> Norm. Throughput/Gate Count

Measurement result (includes forward substitution)

<sup>&</sup>lt;sup>‡</sup> Norm. Energy = Energy per matrix operation  $\times \frac{(8)(9)(10)}{(N)(N+1)(N+2)} \times (\frac{0.8}{V_{dd}})^2$