Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents
(2018) In Master's Theses in Mathematical Sciences FMAM05 20181Mathematics (Faculty of Engineering)
- Abstract
- Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.
The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared... (More) - Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.
The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well.
The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/8938389
- author
- Jansson, Johannes LU and Miller, Victor
- supervisor
-
- Karl Åström LU
- organization
- alternative title
- En studie i maskininlärningsbasereade klustringsmetoder för att efterlikna mänsklig kategorisering av dokument
- course
- FMAM05 20181
- year
- 2018
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- machine learning, k-means, support vector machine, svm, tf-idf, clustering, document, documents, pdf, information retrieval, scikit-learn
- publication/series
- Master's Theses in Mathematical Sciences
- report number
- LUTFMA-3343-2018
- ISSN
- 1404-6342
- other publication id
- 2018:E12
- language
- English
- id
- 8938389
- date added to LUP
- 2018-06-07 17:24:56
- date last changed
- 2018-10-11 16:19:49
@misc{8938389, abstract = {{Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful. The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well. The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset.}}, author = {{Jansson, Johannes and Miller, Victor}}, issn = {{1404-6342}}, language = {{eng}}, note = {{Student Paper}}, series = {{Master's Theses in Mathematical Sciences}}, title = {{Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents}}, year = {{2018}}, }