Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso

Enbom, David LU (2026) STAH11 20252
Department of Statistics
Abstract
Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean... (More)
Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean squared prediction error (MSPE) on independently generated test sets and summarized as mean MSPE across replications. The experiments show that there is no universally best method for prediction, as every method was outperformed by another in at least one experiment. Ridge near-universally had better predictive performance than OLS, and outperformed Lasso in dense regimes. Lasso clearly outperformed both other methods in highly sparse regimes. The performance of OLS was competitive with the regularized methods in overdetermined and stable regimes, but even modest deviations from that state could strongly favor one of the regularized methods instead. The Moore-Penrose pseudoinverse solution to OLS was also considered in overparameterized settings, where it was competitive. In regimes where OLS clearly lagged behind, the most salient decider of whether Ridge or Lasso would be the best-performing method was whether strong sparsity was present or not. The thesis gives controlled method comparisons across relevant factors with tabular and graphical presentations of the results. The exclusive focus of the thesis is prediction; model interpretability and in-sample fit were not considered. (Less)
Please use this url to cite or link to this publication:
author
Enbom, David LU
supervisor
organization
course
STAH11 20252
year
type
M2 - Bachelor Degree
subject
keywords
linear regression, generalization, prediction, ordinary least squares, ridge regression, lasso, regularization, simulation study, sample size, dimensionality, sparsity, signal-to-noise ratio, collinearity, mean squared prediction error, statistical learning
language
English
id
9225258
date added to LUP
2026-04-20 14:20:31
date last changed
2026-04-20 14:24:01
@misc{9225258,
  abstract     = {{Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean squared prediction error (MSPE) on independently generated test sets and summarized as mean MSPE across replications. The experiments show that there is no universally best method for prediction, as every method was outperformed by another in at least one experiment. Ridge near-universally had better predictive performance than OLS, and outperformed Lasso in dense regimes. Lasso clearly outperformed both other methods in highly sparse regimes. The performance of OLS was competitive with the regularized methods in overdetermined and stable regimes, but even modest deviations from that state could strongly favor one of the regularized methods instead. The Moore-Penrose pseudoinverse solution to OLS was also considered in overparameterized settings, where it was competitive. In regimes where OLS clearly lagged behind, the most salient decider of whether Ridge or Lasso would be the best-performing method was whether strong sparsity was present or not. The thesis gives controlled method comparisons across relevant factors with tabular and graphical presentations of the results. The exclusive focus of the thesis is prediction; model interpretability and in-sample fit were not considered.}},
  author       = {{Enbom, David}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso}},
  year         = {{2026}},
}