Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods
(2024) STAH11 20232Department of Statistics
- Abstract
- As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values.... (More)
- As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values. Three different approaches are used from both of the statistical frameworks. The frequentist models consist of Ridge regression, least absolute shrinkage and selection operator (LASSO) regression as well as the combined model Elastic net regression. The Bayesian models consist of three regressions with different prior beliefs regarding the coefficients’ probability distributions. The Normal distribution, the Cauchy distribution and the Horseshoe distribution were chosen in this thesis. To compare the different frameworks, different loss functions have been used such as predictability on new data, amount of explained variance and the amount of unnecessary predictor variables the model successfully regularizes. The results of the study show that the Bayesian Horseshoe model has the greatest overall performance regarding predictability, variable selection and parameter estimation. The LASSO regres- sion performs better variable selection on highly correlated data than all of the other models. The frequentist models are also more easily computed if compu- tational power or time is a limited resource, in the other cases the Horseshoe model is to prefer. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9145377
- author
- Gerholm, Markus LU and Sörstadius, Johan LU
- supervisor
- organization
- course
- STAH11 20232
- year
- 2024
- type
- M2 - Bachelor Degree
- subject
- keywords
- Linear regression, high dimensional data, regularization, Bayesian methods
- language
- English
- id
- 9145377
- date added to LUP
- 2024-01-24 12:27:44
- date last changed
- 2024-01-24 12:27:44
@misc{9145377, abstract = {{As the amount of high dimensional data becomes increasingly accessible and common, the need for reliable methods to combat problems such as overfitting and multicollinearity increases. Models need to be able to manage large data sets where predictor variables often outnumber the amount of observations. In this study the frequentist and Bayesian framework is tested against each other based on three different simulated situations. One where the amount of predictor variables greatly outnumber the observations, one where the simulated data has a high correlation between variables and one where a situation is created where the coefficients to be estimated are known beforehand. This enables comparisons between true values and estimated values. Three different approaches are used from both of the statistical frameworks. The frequentist models consist of Ridge regression, least absolute shrinkage and selection operator (LASSO) regression as well as the combined model Elastic net regression. The Bayesian models consist of three regressions with different prior beliefs regarding the coefficients’ probability distributions. The Normal distribution, the Cauchy distribution and the Horseshoe distribution were chosen in this thesis. To compare the different frameworks, different loss functions have been used such as predictability on new data, amount of explained variance and the amount of unnecessary predictor variables the model successfully regularizes. The results of the study show that the Bayesian Horseshoe model has the greatest overall performance regarding predictability, variable selection and parameter estimation. The LASSO regres- sion performs better variable selection on highly correlated data than all of the other models. The frequentist models are also more easily computed if compu- tational power or time is a limited resource, in the other cases the Horseshoe model is to prefer.}}, author = {{Gerholm, Markus and Sörstadius, Johan}}, language = {{eng}}, note = {{Student Paper}}, title = {{Regularization Methods and High Dimensional Data: A Comparative Study Based on Frequentist and Bayesian Methods}}, year = {{2024}}, }