On Valuation of Observations in Linear Regression Models

Jönsson, Mattias

On Valuation of Observations in Linear Regression Models

Mark

Jönsson, Mattias ^LU (2020) STAH11 20192
Department of Statistics

Abstract: In the Machine Learning field, more and more of the data collection is commercialised, even with monetary rewards to people and organisations for providing input data for models. Even if data collection is not associated with direct costs for the researcher, there are many cases where there are indirect, or circumstancial, costs associated with it.

An established concept in game theory is "Shapley Values", which has had a lot of success in the field of statistics and machine learning over the last number of years, for example as a technique for variable importance estimations. Now, researchers have proposed using Shapley Values also to quantify the worth, or value, of an observation in a model (Data Shapley Values). However, little... (More); In the Machine Learning field, more and more of the data collection is commercialised, even with monetary rewards to people and organisations for providing input data for models. Even if data collection is not associated with direct costs for the researcher, there are many cases where there are indirect, or circumstancial, costs associated with it.

An established concept in game theory is "Shapley Values", which has had a lot of success in the field of statistics and machine learning over the last number of years, for example as a technique for variable importance estimations. Now, researchers have proposed using Shapley Values also to quantify the worth, or value, of an observation in a model (Data Shapley Values). However, little effort has earlier been spent to properly evaluate these in an Ordinary Least Squares setting, especially since there is already a very established way of quantifying an observations influence (Cook's Distance), which should be reasonably well aligned.

Hence, this thesis sets out to explore the use of Data Shapley in Linear Regression models, with the purpose to research if this is a valuable concept for a researcher using OLS models. This thesis will try to approach the topic by answering the following specific questions:
* What is a suitable set of parameters for estimating Data Shapley-values for linear regression models?
* How well does Data Shapley values and Cooks Distance values agree on the valuation of an observation?
* Is it possible to use Data Shapley values to detect outliers also in linear regression models?

Data Shapley is studied in some detail with the use of four different datasets and models, and Data Shapley values that are estimated using three different metrics and four different configurations of the estimation algorithm. Results are compared with Cook's Distance for evaluation.

The main conclusion from this research is that Data Shapley is a serious contender to Cook's Distance in capturing the worth of an observation. It performs better than, or at least as well as, Cook's Distance in capturing the low value observations, but it also performs significantly better than Cook's Distance in capturing good observations as well. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9003861

author

Jönsson, Mattias ^LU

supervisor

Björn Holmquist ^LU

organization

Department of Statistics

course

STAH11 20192

year

2020

type

M2 - Bachelor Degree

subject

Mathematics and Statistics

language

English

id

9003861

date added to LUP

2020-02-24 08:21:58

date last changed

2020-02-24 08:21:58

@misc{9003861,
  abstract     = {{In the Machine Learning field, more and more of the data collection is commercialised, even with monetary rewards to people and organisations for providing input data for models. Even if data collection is not associated with direct costs for the researcher, there are many cases where there are indirect, or circumstancial, costs associated with it.

An established concept in game theory is "Shapley Values", which has had a lot of success in the field of statistics and machine learning over the last number of years, for example as a technique for variable importance estimations. Now, researchers have proposed using Shapley Values also to quantify the worth, or value, of an observation in a model (Data Shapley Values). However, little effort has earlier been spent to properly evaluate these in an Ordinary Least Squares setting, especially since there is already a very established way of quantifying an observations influence (Cook's Distance), which should be reasonably well aligned.

Hence, this thesis sets out to explore the use of Data Shapley in Linear Regression models, with the purpose to research if this is a valuable concept for a researcher using OLS models. This thesis will try to approach the topic by answering the following specific questions:
* What is a suitable set of parameters for estimating Data Shapley-values for linear regression models?
* How well does Data Shapley values and Cooks Distance values agree on the valuation of an observation?
* Is it possible to use Data Shapley values to detect outliers also in linear regression models?

Data Shapley is studied in some detail with the use of four different datasets and models, and Data Shapley values that are estimated using three different metrics and four different configurations of the estimation algorithm. Results are compared with Cook's Distance for evaluation.

The main conclusion from this research is that Data Shapley is a serious contender to Cook's Distance in capturing the worth of an observation. It performs better than, or at least as well as, Cook's Distance in capturing the low value observations, but it also performs significantly better than Cook's Distance in capturing good observations as well.}},
  author       = {{Jönsson, Mattias}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{On Valuation of Observations in Linear Regression Models}},
  year         = {{2020}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

On Valuation of Observations in Linear Regression Models