Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso
(2026) STAH11 20252Department of Statistics
- Abstract
- Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean... (More)
- Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean squared prediction error (MSPE) on independently generated test sets and summarized as mean MSPE across replications. The experiments show that there is no universally best method for prediction, as every method was outperformed by another in at least one experiment. Ridge near-universally had better predictive performance than OLS, and outperformed Lasso in dense regimes. Lasso clearly outperformed both other methods in highly sparse regimes. The performance of OLS was competitive with the regularized methods in overdetermined and stable regimes, but even modest deviations from that state could strongly favor one of the regularized methods instead. The Moore-Penrose pseudoinverse solution to OLS was also considered in overparameterized settings, where it was competitive. In regimes where OLS clearly lagged behind, the most salient decider of whether Ridge or Lasso would be the best-performing method was whether strong sparsity was present or not. The thesis gives controlled method comparisons across relevant factors with tabular and graphical presentations of the results. The exclusive focus of the thesis is prediction; model interpretability and in-sample fit were not considered. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/student-papers/record/9225258
- author
- Enbom, David LU
- supervisor
-
- Ivan Hejny LU
- organization
- course
- STAH11 20252
- year
- 2026
- type
- M2 - Bachelor Degree
- subject
- keywords
- linear regression, generalization, prediction, ordinary least squares, ridge regression, lasso, regularization, simulation study, sample size, dimensionality, sparsity, signal-to-noise ratio, collinearity, mean squared prediction error, statistical learning
- language
- English
- id
- 9225258
- date added to LUP
- 2026-04-20 14:20:31
- date last changed
- 2026-04-20 14:24:01
@misc{9225258,
abstract = {{Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean squared prediction error (MSPE) on independently generated test sets and summarized as mean MSPE across replications. The experiments show that there is no universally best method for prediction, as every method was outperformed by another in at least one experiment. Ridge near-universally had better predictive performance than OLS, and outperformed Lasso in dense regimes. Lasso clearly outperformed both other methods in highly sparse regimes. The performance of OLS was competitive with the regularized methods in overdetermined and stable regimes, but even modest deviations from that state could strongly favor one of the regularized methods instead. The Moore-Penrose pseudoinverse solution to OLS was also considered in overparameterized settings, where it was competitive. In regimes where OLS clearly lagged behind, the most salient decider of whether Ridge or Lasso would be the best-performing method was whether strong sparsity was present or not. The thesis gives controlled method comparisons across relevant factors with tabular and graphical presentations of the results. The exclusive focus of the thesis is prediction; model interpretability and in-sample fit were not considered.}},
author = {{Enbom, David}},
language = {{eng}},
note = {{Student Paper}},
title = {{Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso}},
year = {{2026}},
}