Does it matter how many we include? A random data simulation study examining how number of explanatory variables affect model quality in linear regression models describing quantitative data.

Svensson, Sofie

Does it matter how many we include? A random data simulation study examining how number of explanatory variables affect model quality in linear regression models describing quantitative data.

Mark

Svensson, Sofie ^LU (2025) STAH11 20242
Department of Statistics

Abstract: This study examined how the number of explanatory variables included in a linear regression model affects model quality. The maximum number of variables was set to 20 and only models describing quantitative data were studied. The effect on model quality was studied through comparisons of models with different numbers of explanatory variables. The models – and the data sets the models were derived from – were created in R. 1 000 data sets were produced, each including 1 000 observations for one dependent variable and 20 explanatory variables. The dependent variable was produced as an unknown combination of an unknown number of the explanatory variables. Random noise was also added to each data set. Best subset selection was used to produce... (More); This study examined how the number of explanatory variables included in a linear regression model affects model quality. The maximum number of variables was set to 20 and only models describing quantitative data were studied. The effect on model quality was studied through comparisons of models with different numbers of explanatory variables. The models – and the data sets the models were derived from – were created in R. 1 000 data sets were produced, each including 1 000 observations for one dependent variable and 20 explanatory variables. The dependent variable was produced as an unknown combination of an unknown number of the explanatory variables. Random noise was also added to each data set. Best subset selection was used to produce 20 000 models based on the data sets. 1 000 of the models had one explanatory variable, 1 000 had two explanatory variables and so on. 9 069 of the models were used in the study, after models which did not meet the model assumptions were excluded. The models were compared using number of statistically significant variables, adjusted R2, predicted R2, AIC, BIC and Mallow’s Cp as measurements of model quality. It was found that for small linear models (1–10 variables), a higher number of variables clearly correlated with better model quality. As the number of variables increased, this correlation became less unambiguous. The image of a “ceiling” was used to describe that somewhere within the range of 15–19 variables, this correlation ceased to exist almost entirely. For all measurements used in the study, the correlation between higher number of variables and better model quality plateaued somewhere within this range. The findings could have implications for multivariate linear regression analysis, as they show that a higher number of explanatory variables does not always mean better model quality. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9181248

author

Svensson, Sofie ^LU

supervisor

N/A N/A

organization

Department of Statistics

course

STAH11 20242

year

2025

type

M2 - Bachelor Degree

subject

Mathematics and Statistics

keywords

correlation, data fabrication, explanatory variables, linear regression analysis, model quality, multiple regression analysis, random data simulation, statistical programming

language

English

id

9181248

date added to LUP

2025-02-11 10:55:45

date last changed

2025-02-11 10:55:58

@misc{9181248,
  abstract     = {{This study examined how the number of explanatory variables included in a linear regression model affects model quality. The maximum number of variables was set to 20 and only models describing quantitative data were studied. The effect on model quality was studied through comparisons of models with different numbers of explanatory variables. The models – and the data sets the models were derived from – were created in R. 1 000 data sets were produced, each including 1 000 observations for one dependent variable and 20 explanatory variables. The dependent variable was produced as an unknown combination of an unknown number of the explanatory variables. Random noise was also added to each data set. Best subset selection was used to produce 20 000 models based on the data sets. 1 000 of the models had one explanatory variable, 1 000 had two explanatory variables and so on. 9 069 of the models were used in the study, after models which did not meet the model assumptions were excluded. The models were compared using number of statistically significant variables, adjusted R2, predicted R2, AIC, BIC and Mallow’s Cp as measurements of model quality. It was found that for small linear models (1–10 variables), a higher number of variables clearly correlated with better model quality. As the number of variables increased, this correlation became less unambiguous. The image of a “ceiling” was used to describe that somewhere within the range of 15–19 variables, this correlation ceased to exist almost entirely. For all measurements used in the study, the correlation between higher number of variables and better model quality plateaued somewhere within this range. The findings could have implications for multivariate linear regression analysis, as they show that a higher number of explanatory variables does not always mean better model quality.}},
  author       = {{Svensson, Sofie}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Does it matter how many we include? A random data simulation study examining how number of explanatory variables affect model quality in linear regression models describing quantitative data.}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Does it matter how many we include? A random data simulation study examining how number of explanatory variables affect model quality in linear regression models describing quantitative data.