Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Emails, Algorithms, and Bookkeeping

Croneborg, Claes LU and Karlsson, Viktor LU (2024) STAN40 20241
Department of Statistics
Abstract
This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag-... (More)
This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag- of-Words (BoW), emphasizing the importance of feature representation in k-Nearest Neighbors models.

Among the tested models, multinomial Naive Bayes emerged as one of the top performers, despite being initially considered as a benchmark. Linear models showed similar performance, outperforming k-Nearest Neighbors. Tree-based models revealed that Random Forest and Gradi- ent Boosting outperformed the simpler Decision Trees, attributing their success to the use of multiple learners. Recurrent Neural Networks, especially those with pre-trained weights from Skip-grams, underperformed compared to classical classifiers, most likely due to the small data- set available.

Classical models were trained on BoW and TF-IDF representations, using a vocabulary con- sisting of only the top-19 most discriminative features (words) for each category chosen using cross-validation, along with eight featured-engineered variables. Two recurrent neural networks models were employed, one with pre-trained word embeddings from Skip-grams, and one without pre-trained word embeddings. Surprisingly, BoW-based models performed similarly or better than TF-IDF-based models, contrary to the expectation that TF-IDF would give more emphasis on important words. The imbalanced dataset posed challenges, particularly for one category, which included a diverse range of documents with indistinct patterns for the models to learn. Support Vector Machines model on TF-IDF data representation demonstrated the best perform- ance achieving a weighted average accuracy of 86%.

Additionally, the authors present a suggested framework for reducing the vocabulary size that significantly reduces the computational cost while still maintaining high accuracy. (Less)
Please use this url to cite or link to this publication:
author
Croneborg, Claes LU and Karlsson, Viktor LU
supervisor
organization
alternative title
Classifying the Inbox
course
STAN40 20241
year
type
H1 - Master's Degree (One Year)
subject
keywords
Multi-label text classification, Naive Bayes, Linear Discriminant Analysis, Lo- gistic Regression, Support Vector Machines, k-Nearest Neighbour, Decision Tree, Random Forest, Gradient Boosting, Recurrent Neural Network, Word Embeddings
language
English
id
9160235
date added to LUP
2024-06-17 14:14:45
date last changed
2024-06-17 14:14:45
@misc{9160235,
  abstract     = {{This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag- of-Words (BoW), emphasizing the importance of feature representation in k-Nearest Neighbors models. 

Among the tested models, multinomial Naive Bayes emerged as one of the top performers, despite being initially considered as a benchmark. Linear models showed similar performance, outperforming k-Nearest Neighbors. Tree-based models revealed that Random Forest and Gradi- ent Boosting outperformed the simpler Decision Trees, attributing their success to the use of multiple learners. Recurrent Neural Networks, especially those with pre-trained weights from Skip-grams, underperformed compared to classical classifiers, most likely due to the small data- set available. 

Classical models were trained on BoW and TF-IDF representations, using a vocabulary con- sisting of only the top-19 most discriminative features (words) for each category chosen using cross-validation, along with eight featured-engineered variables. Two recurrent neural networks models were employed, one with pre-trained word embeddings from Skip-grams, and one without pre-trained word embeddings. Surprisingly, BoW-based models performed similarly or better than TF-IDF-based models, contrary to the expectation that TF-IDF would give more emphasis on important words. The imbalanced dataset posed challenges, particularly for one category, which included a diverse range of documents with indistinct patterns for the models to learn. Support Vector Machines model on TF-IDF data representation demonstrated the best perform- ance achieving a weighted average accuracy of 86%. 

Additionally, the authors present a suggested framework for reducing the vocabulary size that significantly reduces the computational cost while still maintaining high accuracy.}},
  author       = {{Croneborg, Claes and Karlsson, Viktor}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Emails, Algorithms, and Bookkeeping}},
  year         = {{2024}},
}