Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Does it matter how many we include? A random data simulation study examining how number of explanatory variables affect model quality in linear regression models describing quantitative data.

Svensson, Sofie LU (2025) STAH11 20242
Department of Statistics
Abstract
This study examined how the number of explanatory variables included in a linear regression model affects model quality. The maximum number of variables was set to 20 and only models describing quantitative data were studied. The effect on model quality was studied through comparisons of models with different numbers of explanatory variables. The models – and the data sets the models were derived from – were created in R. 1 000 data sets were produced, each including 1 000 observations for one dependent variable and 20 explanatory variables. The dependent variable was produced as an unknown combination of an unknown number of the explanatory variables. Random noise was also added to each data set. Best subset selection was used to produce... (More)
This study examined how the number of explanatory variables included in a linear regression model affects model quality. The maximum number of variables was set to 20 and only models describing quantitative data were studied. The effect on model quality was studied through comparisons of models with different numbers of explanatory variables. The models – and the data sets the models were derived from – were created in R. 1 000 data sets were produced, each including 1 000 observations for one dependent variable and 20 explanatory variables. The dependent variable was produced as an unknown combination of an unknown number of the explanatory variables. Random noise was also added to each data set. Best subset selection was used to produce 20 000 models based on the data sets. 1 000 of the models had one explanatory variable, 1 000 had two explanatory variables and so on. 9 069 of the models were used in the study, after models which did not meet the model assumptions were excluded. The models were compared using number of statistically significant variables, adjusted R2, predicted R2, AIC, BIC and Mallow’s Cp as measurements of model quality. It was found that for small linear models (1–10 variables), a higher number of variables clearly correlated with better model quality. As the number of variables increased, this correlation became less unambiguous. The image of a “ceiling” was used to describe that somewhere within the range of 15–19 variables, this correlation ceased to exist almost entirely. For all measurements used in the study, the correlation between higher number of variables and better model quality plateaued somewhere within this range. The findings could have implications for multivariate linear regression analysis, as they show that a higher number of explanatory variables does not always mean better model quality. (Less)
Please use this url to cite or link to this publication:
author
Svensson, Sofie LU
supervisor
organization
course
STAH11 20242
year
type
M2 - Bachelor Degree
subject
keywords
correlation, data fabrication, explanatory variables, linear regression analysis, model quality, multiple regression analysis, random data simulation, statistical programming
language
English
id
9181248
date added to LUP
2025-02-11 10:55:45
date last changed
2025-02-11 10:55:58
@misc{9181248,
  abstract     = {{This study examined how the number of explanatory variables included in a linear regression model affects model quality. The maximum number of variables was set to 20 and only models describing quantitative data were studied. The effect on model quality was studied through comparisons of models with different numbers of explanatory variables. The models – and the data sets the models were derived from – were created in R. 1 000 data sets were produced, each including 1 000 observations for one dependent variable and 20 explanatory variables. The dependent variable was produced as an unknown combination of an unknown number of the explanatory variables. Random noise was also added to each data set. Best subset selection was used to produce 20 000 models based on the data sets. 1 000 of the models had one explanatory variable, 1 000 had two explanatory variables and so on. 9 069 of the models were used in the study, after models which did not meet the model assumptions were excluded. The models were compared using number of statistically significant variables, adjusted R2, predicted R2, AIC, BIC and Mallow’s Cp as measurements of model quality. It was found that for small linear models (1–10 variables), a higher number of variables clearly correlated with better model quality. As the number of variables increased, this correlation became less unambiguous. The image of a “ceiling” was used to describe that somewhere within the range of 15–19 variables, this correlation ceased to exist almost entirely. For all measurements used in the study, the correlation between higher number of variables and better model quality plateaued somewhere within this range. The findings could have implications for multivariate linear regression analysis, as they show that a higher number of explanatory variables does not always mean better model quality.}},
  author       = {{Svensson, Sofie}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Does it matter how many we include? A random data simulation study examining how number of explanatory variables affect model quality in linear regression models describing quantitative data.}},
  year         = {{2025}},
}