Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso

Enbom, David

Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso

Mark

Enbom, David ^LU (2026) STAH11 20252
Department of Statistics

Abstract: Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean... (More); Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean squared prediction error (MSPE) on independently generated test sets and summarized as mean MSPE across replications. The experiments show that there is no universally best method for prediction, as every method was outperformed by another in at least one experiment. Ridge near-universally had better predictive performance than OLS, and outperformed Lasso in dense regimes. Lasso clearly outperformed both other methods in highly sparse regimes. The performance of OLS was competitive with the regularized methods in overdetermined and stable regimes, but even modest deviations from that state could strongly favor one of the regularized methods instead. The Moore-Penrose pseudoinverse solution to OLS was also considered in overparameterized settings, where it was competitive. In regimes where OLS clearly lagged behind, the most salient decider of whether Ridge or Lasso would be the best-performing method was whether strong sparsity was present or not. The thesis gives controlled method comparisons across relevant factors with tabular and graphical presentations of the results. The exclusive focus of the thesis is prediction; model interpretability and in-sample fit were not considered. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/student-papers/record/9225258

author

Enbom, David ^LU

supervisor

Ivan Hejny ^LU

organization

Department of Statistics

course

STAH11 20252

year

2026

type

M2 - Bachelor Degree

subject

Mathematics and Statistics

keywords

linear regression, generalization, prediction, ordinary least squares, ridge regression, lasso, regularization, simulation study, sample size, dimensionality, sparsity, signal-to-noise ratio, collinearity, mean squared prediction error, statistical learning

language

English

id

9225258

date added to LUP

2026-04-20 14:20:31

date last changed

2026-04-20 14:24:01

@misc{9225258,
  abstract     = {{Linear regression is a standard tool for numerical prediction, but there are several estimators to choose from, and this choice can greatly affect prediction accuracy. This thesis empirically explores how the out-of-sample predictive performance of ordinary least squares (OLS), ridge regression (Ridge), and the lasso (Lasso) is affected by varying key model conditions, given a baseline setting. For full control over all mechanisms at play, all experiments are implemented and simulated in R. The data-generating process is linear-Gaussian, where training sample size, data dimensionality, sparsity, signal-to-noise ratio, and predictor collinearity are the relevant factors of study. Out-of-sample predictive performance is evaluated using mean squared prediction error (MSPE) on independently generated test sets and summarized as mean MSPE across replications. The experiments show that there is no universally best method for prediction, as every method was outperformed by another in at least one experiment. Ridge near-universally had better predictive performance than OLS, and outperformed Lasso in dense regimes. Lasso clearly outperformed both other methods in highly sparse regimes. The performance of OLS was competitive with the regularized methods in overdetermined and stable regimes, but even modest deviations from that state could strongly favor one of the regularized methods instead. The Moore-Penrose pseudoinverse solution to OLS was also considered in overparameterized settings, where it was competitive. In regimes where OLS clearly lagged behind, the most salient decider of whether Ridge or Lasso would be the best-performing method was whether strong sparsity was present or not. The thesis gives controlled method comparisons across relevant factors with tabular and graphical presentations of the results. The exclusive focus of the thesis is prediction; model interpretability and in-sample fit were not considered.}},
  author       = {{Enbom, David}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso}},
  year         = {{2026}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Generalization in Linear Regression: A Simulation Study of OLS, Ridge, and Lasso