Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents

Jansson, Johannes; Miller, Victor

Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents

Mark

Jansson, Johannes ^LU and Miller, Victor (2018) In Master's Theses in Mathematical Sciences FMAM05 20181
Mathematics (Faculty of Engineering)

Abstract: Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.

The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared... (More); Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.

The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well.

The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8938389

author

Jansson, Johannes ^LU and Miller, Victor

supervisor

Karl Åström ^LU

organization

Mathematics (Faculty of Engineering)

alternative title

En studie i maskininlärningsbasereade klustringsmetoder för att efterlikna mänsklig kategorisering av dokument

course

FMAM05 20181

year

2018

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

machine learning, k-means, support vector machine, svm, tf-idf, clustering, document, documents, pdf, information retrieval, scikit-learn

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMA-3343-2018

ISSN

1404-6342

other publication id

2018:E12

language

English

id

8938389

date added to LUP

2018-06-07 17:24:56

date last changed

2018-10-11 16:19:49

@misc{8938389,
  abstract     = {{Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful.
 
The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well.
 
The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset.}},
  author       = {{Jansson, Johannes and Miller, Victor}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents}},
  year         = {{2018}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents