Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods

Gerholm, Markus LU and Sörstadius, Johan LU (2024) STAH11 20232
Department of Statistics
Abstract
As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values.... (More)
As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values. Three different approaches are used from both of the statistical frameworks. The frequentist models consist of Ridge regression, least absolute shrinkage and selection operator (LASSO) regression as well as the combined model Elastic net regression. The Bayesian models consist of three regressions with different prior beliefs regarding the coefficients’ probability distributions. The Normal distribution, the Cauchy distribution and the Horseshoe distribution were chosen in this thesis. To compare the different frameworks, different loss functions have been used such as predictability on new data, amount of explained variance and the amount of unnecessary predictor variables the model successfully regularizes. The results of the study show that the Bayesian Horseshoe model has the greatest overall performance regarding predictability, variable selection and parameter estimation. The LASSO regres- sion performs better variable selection on highly correlated data than all of the other models. The frequentist models are also more easily computed if compu- tational power or time is a limited resource, in the other cases the Horseshoe model is to prefer. (Less)
Please use this url to cite or link to this publication:
author
Gerholm, Markus LU and Sörstadius, Johan LU
supervisor
organization
course
STAH11 20232
year
type
M2 - Bachelor Degree
subject
keywords
Linear regression, high dimensional data, regularization, Bayesian methods
language
English
id
9145377
date added to LUP
2024-01-24 12:27:44
date last changed
2024-01-24 12:27:44
@misc{9145377,
  abstract     = {{As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values. Three different approaches are used from both of the statistical frameworks. The frequentist models consist of Ridge regression, least absolute shrinkage and selection operator (LASSO) regression as well as the combined model Elastic net regression. The Bayesian models consist of three regressions with different prior beliefs regarding the coefficients’ probability distributions. The Normal distribution, the Cauchy distribution and the Horseshoe distribution were chosen in this thesis. To compare the different frameworks, different loss functions have been used such as predictability on new data, amount of explained variance and the amount of unnecessary predictor variables the model successfully regularizes. The results of the study show that the Bayesian Horseshoe model has the greatest overall performance regarding predictability, variable selection and parameter estimation. The LASSO regres- sion performs better variable selection on highly correlated data than all of the other models. The frequentist models are also more easily computed if compu- tational power or time is a limited resource, in the other cases the Horseshoe model is to prefer.}},
  author       = {{Gerholm, Markus and Sörstadius, Johan}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods}},
  year         = {{2024}},
}