Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

On Valuation of Observations in Linear Regression Models

Jönsson, Mattias LU (2020) STAH11 20192
Department of Statistics
Abstract
In the Machine Learning field, more and more of the data collection is commercialised, even with monetary rewards to people and organisations for providing input data for models. Even if data collection is not associated with direct costs for the researcher, there are many cases where there are indirect, or circumstancial, costs associated with it.

An established concept in game theory is "Shapley Values", which has had a lot of success in the field of statistics and machine learning over the last number of years, for example as a technique for variable importance estimations. Now, researchers have proposed using Shapley Values also to quantify the worth, or value, of an observation in a model (Data Shapley Values). However, little... (More)
In the Machine Learning field, more and more of the data collection is commercialised, even with monetary rewards to people and organisations for providing input data for models. Even if data collection is not associated with direct costs for the researcher, there are many cases where there are indirect, or circumstancial, costs associated with it.

An established concept in game theory is "Shapley Values", which has had a lot of success in the field of statistics and machine learning over the last number of years, for example as a technique for variable importance estimations. Now, researchers have proposed using Shapley Values also to quantify the worth, or value, of an observation in a model (Data Shapley Values). However, little effort has earlier been spent to properly evaluate these in an Ordinary Least Squares setting, especially since there is already a very established way of quantifying an observations influence (Cook's Distance), which should be reasonably well aligned.

Hence, this thesis sets out to explore the use of Data Shapley in Linear Regression models, with the purpose to research if this is a valuable concept for a researcher using OLS models. This thesis will try to approach the topic by answering the following specific questions:
* What is a suitable set of parameters for estimating Data Shapley-values for linear regression models?
* How well does Data Shapley values and Cooks Distance values agree on the valuation of an observation?
* Is it possible to use Data Shapley values to detect outliers also in linear regression models?

Data Shapley is studied in some detail with the use of four different datasets and models, and Data Shapley values that are estimated using three different metrics and four different configurations of the estimation algorithm. Results are compared with Cook's Distance for evaluation.

The main conclusion from this research is that Data Shapley is a serious contender to Cook's Distance in capturing the worth of an observation. It performs better than, or at least as well as, Cook's Distance in capturing the low value observations, but it also performs significantly better than Cook's Distance in capturing good observations as well. (Less)
Please use this url to cite or link to this publication:
author
Jönsson, Mattias LU
supervisor
organization
course
STAH11 20192
year
type
M2 - Bachelor Degree
subject
language
English
id
9003861
date added to LUP
2020-02-24 08:21:58
date last changed
2020-02-24 08:21:58
@misc{9003861,
  abstract     = {{In the Machine Learning field, more and more of the data collection is commercialised, even with monetary rewards to people and organisations for providing input data for models. Even if data collection is not associated with direct costs for the researcher, there are many cases where there are indirect, or circumstancial, costs associated with it.

An established concept in game theory is "Shapley Values", which has had a lot of success in the field of statistics and machine learning over the last number of years, for example as a technique for variable importance estimations. Now, researchers have proposed using Shapley Values also to quantify the worth, or value, of an observation in a model (Data Shapley Values). However, little effort has earlier been spent to properly evaluate these in an Ordinary Least Squares setting, especially since there is already a very established way of quantifying an observations influence (Cook's Distance), which should be reasonably well aligned.

Hence, this thesis sets out to explore the use of Data Shapley in Linear Regression models, with the purpose to research if this is a valuable concept for a researcher using OLS models. This thesis will try to approach the topic by answering the following specific questions:
* What is a suitable set of parameters for estimating Data Shapley-values for linear regression models?
* How well does Data Shapley values and Cooks Distance values agree on the valuation of an observation?
* Is it possible to use Data Shapley values to detect outliers also in linear regression models?

Data Shapley is studied in some detail with the use of four different datasets and models, and Data Shapley values that are estimated using three different metrics and four different configurations of the estimation algorithm. Results are compared with Cook's Distance for evaluation.

The main conclusion from this research is that Data Shapley is a serious contender to Cook's Distance in capturing the worth of an observation. It performs better than, or at least as well as, Cook's Distance in capturing the low value observations, but it also performs significantly better than Cook's Distance in capturing good observations as well.}},
  author       = {{Jönsson, Mattias}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{On Valuation of Observations in Linear Regression Models}},
  year         = {{2020}},
}