Emails, Algorithms, and Bookkeeping

Croneborg, Claes; Karlsson, Viktor

Emails, Algorithms, and Bookkeeping

Mark

Croneborg, Claes ^LU and Karlsson, Viktor ^LU (2024) STAN40 20241
Department of Statistics

Abstract: This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag-... (More); This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag- of-Words (BoW), emphasizing the importance of feature representation in k-Nearest Neighbors models.

Among the tested models, multinomial Naive Bayes emerged as one of the top performers, despite being initially considered as a benchmark. Linear models showed similar performance, outperforming k-Nearest Neighbors. Tree-based models revealed that Random Forest and Gradi- ent Boosting outperformed the simpler Decision Trees, attributing their success to the use of multiple learners. Recurrent Neural Networks, especially those with pre-trained weights from Skip-grams, underperformed compared to classical classifiers, most likely due to the small data- set available.

Classical models were trained on BoW and TF-IDF representations, using a vocabulary con- sisting of only the top-19 most discriminative features (words) for each category chosen using cross-validation, along with eight featured-engineered variables. Two recurrent neural networks models were employed, one with pre-trained word embeddings from Skip-grams, and one without pre-trained word embeddings. Surprisingly, BoW-based models performed similarly or better than TF-IDF-based models, contrary to the expectation that TF-IDF would give more emphasis on important words. The imbalanced dataset posed challenges, particularly for one category, which included a diverse range of documents with indistinct patterns for the models to learn. Support Vector Machines model on TF-IDF data representation demonstrated the best perform- ance achieving a weighted average accuracy of 86%.

Additionally, the authors present a suggested framework for reducing the vocabulary size that significantly reduces the computational cost while still maintaining high accuracy. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9160235

author

Croneborg, Claes ^LU and Karlsson, Viktor ^LU

supervisor

Malgorzata Bogdan ^LU

organization

Department of Statistics

alternative title

Classifying the Inbox

course

STAN40 20241

year

2024

type

H1 - Master's Degree (One Year)

subject

Mathematics and Statistics

keywords

Multi-label text classification, Naive Bayes, Linear Discriminant Analysis, Lo- gistic Regression, Support Vector Machines, k-Nearest Neighbour, Decision Tree, Random Forest, Gradient Boosting, Recurrent Neural Network, Word Embeddings

language

English

id

9160235

date added to LUP

2024-06-17 14:14:45

date last changed

2024-06-17 14:14:45

@misc{9160235,
  abstract     = {{This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag- of-Words (BoW), emphasizing the importance of feature representation in k-Nearest Neighbors models. 

Among the tested models, multinomial Naive Bayes emerged as one of the top performers, despite being initially considered as a benchmark. Linear models showed similar performance, outperforming k-Nearest Neighbors. Tree-based models revealed that Random Forest and Gradi- ent Boosting outperformed the simpler Decision Trees, attributing their success to the use of multiple learners. Recurrent Neural Networks, especially those with pre-trained weights from Skip-grams, underperformed compared to classical classifiers, most likely due to the small data- set available. 

Classical models were trained on BoW and TF-IDF representations, using a vocabulary con- sisting of only the top-19 most discriminative features (words) for each category chosen using cross-validation, along with eight featured-engineered variables. Two recurrent neural networks models were employed, one with pre-trained word embeddings from Skip-grams, and one without pre-trained word embeddings. Surprisingly, BoW-based models performed similarly or better than TF-IDF-based models, contrary to the expectation that TF-IDF would give more emphasis on important words. The imbalanced dataset posed challenges, particularly for one category, which included a diverse range of documents with indistinct patterns for the models to learn. Support Vector Machines model on TF-IDF data representation demonstrated the best perform- ance achieving a weighted average accuracy of 86%. 

Additionally, the authors present a suggested framework for reducing the vocabulary size that significantly reduces the computational cost while still maintaining high accuracy.}},
  author       = {{Croneborg, Claes and Karlsson, Viktor}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Emails, Algorithms, and Bookkeeping}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Emails, Algorithms, and Bookkeeping