Advanced

Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents

Jansson, Johannes LU and Miller, Victor (2018) In Master's Theses in Mathematical Sciences FMAM05 20181
Mathematics (Faculty of Engineering)
Abstract
Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.

The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared... (More)
Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.

The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well.

The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset. (Less)
Please use this url to cite or link to this publication:
author
Jansson, Johannes LU and Miller, Victor
supervisor
organization
alternative title
En studie i maskininlärningsbasereade klustringsmetoder för att efterlikna mänsklig kategorisering av dokument
course
FMAM05 20181
year
type
H2 - Master's Degree (Two Years)
subject
keywords
machine learning, k-means, support vector machine, svm, tf-idf, clustering, document, documents, pdf, information retrieval, scikit-learn
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMA-3343-2018
ISSN
1404-6342
other publication id
2018:E12
language
English
id
8938389
date added to LUP
2018-06-07 17:24:56
date last changed
2018-10-11 16:19:49
@misc{8938389,
  abstract     = {Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.
 
The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well.
 
The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset.},
  author       = {Jansson, Johannes and Miller, Victor},
  issn         = {1404-6342},
  keyword      = {machine learning,k-means,support vector machine,svm,tf-idf,clustering,document,documents,pdf,information retrieval,scikit-learn},
  language     = {eng},
  note         = {Student Paper},
  series       = {Master's Theses in Mathematical Sciences},
  title        = {Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents},
  year         = {2018},
}