Emails, Algorithms, and Bookkeeping
(2024) STAN40 20241Department of Statistics
- Abstract
- This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag-... (More)
- This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag- of-Words (BoW), emphasizing the importance of feature representation in k-Nearest Neighbors models.
Among the tested models, multinomial Naive Bayes emerged as one of the top performers, despite being initially considered as a benchmark. Linear models showed similar performance, outperforming k-Nearest Neighbors. Tree-based models revealed that Random Forest and Gradi- ent Boosting outperformed the simpler Decision Trees, attributing their success to the use of multiple learners. Recurrent Neural Networks, especially those with pre-trained weights from Skip-grams, underperformed compared to classical classifiers, most likely due to the small data- set available.
Classical models were trained on BoW and TF-IDF representations, using a vocabulary con- sisting of only the top-19 most discriminative features (words) for each category chosen using cross-validation, along with eight featured-engineered variables. Two recurrent neural networks models were employed, one with pre-trained word embeddings from Skip-grams, and one without pre-trained word embeddings. Surprisingly, BoW-based models performed similarly or better than TF-IDF-based models, contrary to the expectation that TF-IDF would give more emphasis on important words. The imbalanced dataset posed challenges, particularly for one category, which included a diverse range of documents with indistinct patterns for the models to learn. Support Vector Machines model on TF-IDF data representation demonstrated the best perform- ance achieving a weighted average accuracy of 86%.
Additionally, the authors present a suggested framework for reducing the vocabulary size that significantly reduces the computational cost while still maintaining high accuracy. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9160235
- author
- Croneborg, Claes LU and Karlsson, Viktor LU
- supervisor
- organization
- alternative title
- Classifying the Inbox
- course
- STAN40 20241
- year
- 2024
- type
- H1 - Master's Degree (One Year)
- subject
- keywords
- Multi-label text classification, Naive Bayes, Linear Discriminant Analysis, Lo- gistic Regression, Support Vector Machines, k-Nearest Neighbour, Decision Tree, Random Forest, Gradient Boosting, Recurrent Neural Network, Word Embeddings
- language
- English
- id
- 9160235
- date added to LUP
- 2024-06-17 14:14:45
- date last changed
- 2024-06-17 14:14:45
@misc{9160235, abstract = {{This report investigates the multi-label classification problem of expense-related documents, aiming to be classified into nine distinct types of categories. Various machine learning methods including Naive Bayes, Logistic Regression, Support Vector Machines, Linear Discriminant Ana- lysis, k-Nearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Recurrent Neural Networks were employed and evaluated. The k-Nearest Neighbors model exhibited the lowest overall performance, likely due to its sensitivity to individual training observations and the impact of outliers in a limited dataset. Intriguingly, k-Nearest Neighbors performed worse with Term Frequency - Inverse Document Frequency (TF-IDF) representation compared to Bag- of-Words (BoW), emphasizing the importance of feature representation in k-Nearest Neighbors models. Among the tested models, multinomial Naive Bayes emerged as one of the top performers, despite being initially considered as a benchmark. Linear models showed similar performance, outperforming k-Nearest Neighbors. Tree-based models revealed that Random Forest and Gradi- ent Boosting outperformed the simpler Decision Trees, attributing their success to the use of multiple learners. Recurrent Neural Networks, especially those with pre-trained weights from Skip-grams, underperformed compared to classical classifiers, most likely due to the small data- set available. Classical models were trained on BoW and TF-IDF representations, using a vocabulary con- sisting of only the top-19 most discriminative features (words) for each category chosen using cross-validation, along with eight featured-engineered variables. Two recurrent neural networks models were employed, one with pre-trained word embeddings from Skip-grams, and one without pre-trained word embeddings. Surprisingly, BoW-based models performed similarly or better than TF-IDF-based models, contrary to the expectation that TF-IDF would give more emphasis on important words. The imbalanced dataset posed challenges, particularly for one category, which included a diverse range of documents with indistinct patterns for the models to learn. Support Vector Machines model on TF-IDF data representation demonstrated the best perform- ance achieving a weighted average accuracy of 86%. Additionally, the authors present a suggested framework for reducing the vocabulary size that significantly reduces the computational cost while still maintaining high accuracy.}}, author = {{Croneborg, Claes and Karlsson, Viktor}}, language = {{eng}}, note = {{Student Paper}}, title = {{Emails, Algorithms, and Bookkeeping}}, year = {{2024}}, }