Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods

Gerholm, Markus; Sörstadius, Johan

Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods

Mark

Gerholm, Markus ^LU and Sörstadius, Johan ^LU (2024) STAH11 20232
Department of Statistics

Abstract: As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values.... (More); As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values. Three different approaches are used from both of the statistical frameworks. The frequentist models consist of Ridge regression, least absolute shrinkage and selection operator (LASSO) regression as well as the combined model Elastic net regression. The Bayesian models consist of three regressions with different prior beliefs regarding the coefficients’ probability distributions. The Normal distribution, the Cauchy distribution and the Horseshoe distribution were chosen in this thesis. To compare the different frameworks, different loss functions have been used such as predictability on new data, amount of explained variance and the amount of unnecessary predictor variables the model successfully regularizes. The results of the study show that the Bayesian Horseshoe model has the greatest overall performance regarding predictability, variable selection and parameter estimation. The LASSO regres- sion performs better variable selection on highly correlated data than all of the other models. The frequentist models are also more easily computed if compu- tational power or time is a limited resource, in the other cases the Horseshoe model is to prefer. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9145377

author

Gerholm, Markus ^LU and Sörstadius, Johan ^LU

supervisor

Farrukh Javed ^LU

organization

Department of Statistics

course

STAH11 20232

year

2024

type

M2 - Bachelor Degree

subject

Mathematics and Statistics

keywords

Linear regression, high dimensional data, regularization, Bayesian methods

language

English

id

9145377

date added to LUP

2024-01-24 12:27:44

date last changed

2024-01-24 12:27:44

@misc{9145377,
  abstract     = {{As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values. Three different approaches are used from both of the statistical frameworks. The frequentist models consist of Ridge regression, least absolute shrinkage and selection operator (LASSO) regression as well as the combined model Elastic net regression. The Bayesian models consist of three regressions with different prior beliefs regarding the coefficients’ probability distributions. The Normal distribution, the Cauchy distribution and the Horseshoe distribution were chosen in this thesis. To compare the different frameworks, different loss functions have been used such as predictability on new data, amount of explained variance and the amount of unnecessary predictor variables the model successfully regularizes. The results of the study show that the Bayesian Horseshoe model has the greatest overall performance regarding predictability, variable selection and parameter estimation. The LASSO regres- sion performs better variable selection on highly correlated data than all of the other models. The frequentist models are also more easily computed if compu- tational power or time is a limited resource, in the other cases the Horseshoe model is to prefer.}},
  author       = {{Gerholm, Markus and Sörstadius, Johan}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods