Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Exploring Factors Influencing On-Base Percentage in Modern Baseball

Gergis, Heidar LU (2024) In Bachelor's Thesis in Mathematical Sciences FMSL01 20241
Mathematical Statistics
Abstract
This report uses highly detailed data from Statcast from the 2015–2023 seasons to explore and identify the key factors influencing on-base percentage in the American top league, Major League Baseball. On-base percentage is simply the percentage of times a player gets on base, and can be achieved in multiple ways. It is an important metric in baseball, as without getting on base, you cannot score. Using machine learning techniques, we aim to identify these factors and develop models with predictive power.
This highly detailed data, with tens of features and more than 1 million rows of tracked events, was then used to develop and implement different machine learning models, such as logistic regression and XGBoost. These models were trained... (More)
This report uses highly detailed data from Statcast from the 2015–2023 seasons to explore and identify the key factors influencing on-base percentage in the American top league, Major League Baseball. On-base percentage is simply the percentage of times a player gets on base, and can be achieved in multiple ways. It is an important metric in baseball, as without getting on base, you cannot score. Using machine learning techniques, we aim to identify these factors and develop models with predictive power.
This highly detailed data, with tens of features and more than 1 million rows of tracked events, was then used to develop and implement different machine learning models, such as logistic regression and XGBoost. These models were trained and then tested for performance that took into account the imbalance in the data, such as the F2 score and area under the precision-recall curve (AUC-PR). To aid in the interpretation of the models, SHAP (SHapley Additive exPlanations) values were used to provide insight.
Our results show that the XGBoost models significantly outperform the logistic regression model in terms of both F2 score and AUC-PR, achieving high scores of 90.50% and 95.82%, respectively. This can be contrasted with the respective 77.83% and 68.42% for the logistic regression model. We also find that the XGBoost model can be greatly reduced with a feature-selected model, with less than a third of the variables achieving near-identical scores (89.60% and 94.90%, respectively).
Feature importance and SHAP analysis showed that factors such as hit location, launch angle, ball-strike count difference, and whether contact was made were the most important and influential factors. (Less)
Please use this url to cite or link to this publication:
author
Gergis, Heidar LU
supervisor
organization
course
FMSL01 20241
year
type
M2 - Bachelor Degree
subject
publication/series
Bachelor's Thesis in Mathematical Sciences
report number
LUTFMS-4014-2024
ISSN
1654-6229
other publication id
2024:K20
language
English
id
9175631
date added to LUP
2024-09-30 10:16:25
date last changed
2024-09-30 10:16:25
@misc{9175631,
  abstract     = {{This report uses highly detailed data from Statcast from the 2015–2023 seasons to explore and identify the key factors influencing on-base percentage in the American top league, Major League Baseball. On-base percentage is simply the percentage of times a player gets on base, and can be achieved in multiple ways. It is an important metric in baseball, as without getting on base, you cannot score. Using machine learning techniques, we aim to identify these factors and develop models with predictive power.
 This highly detailed data, with tens of features and more than 1 million rows of tracked events, was then used to develop and implement different machine learning models, such as logistic regression and XGBoost. These models were trained and then tested for performance that took into account the imbalance in the data, such as the F2 score and area under the precision-recall curve (AUC-PR). To aid in the interpretation of the models, SHAP (SHapley Additive exPlanations) values were used to provide insight.
 Our results show that the XGBoost models significantly outperform the logistic regression model in terms of both F2 score and AUC-PR, achieving high scores of 90.50% and 95.82%, respectively. This can be contrasted with the respective 77.83% and 68.42% for the logistic regression model. We also find that the XGBoost model can be greatly reduced with a feature-selected model, with less than a third of the variables achieving near-identical scores (89.60% and 94.90%, respectively).
 Feature importance and SHAP analysis showed that factors such as hit location, launch angle, ball-strike count difference, and whether contact was made were the most important and influential factors.}},
  author       = {{Gergis, Heidar}},
  issn         = {{1654-6229}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Bachelor's Thesis in Mathematical Sciences}},
  title        = {{Exploring Factors Influencing On-Base Percentage in Modern Baseball}},
  year         = {{2024}},
}