Large scale cluster analysis with Hadoop and Mahout
(2015) EITM01 20132Department of Electrical and Information Technology
- Abstract
- User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is... (More)
- User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is theoretically possible, there are several practical limitations that influence both the ability to run cluster analysis on such data sets, and also the results. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/5148140
- author
- Aronsson, Felix LU
- supervisor
- organization
- course
- EITM01 20132
- year
- 2015
- type
- H2 - Master's Degree (Two Years)
- subject
- report number
- LU/LTH-EIT 2015-431
- language
- English
- id
- 5148140
- date added to LUP
- 2015-03-17 15:30:34
- date last changed
- 2015-03-17 15:31:45
@misc{5148140, abstract = {{User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is theoretically possible, there are several practical limitations that influence both the ability to run cluster analysis on such data sets, and also the results.}}, author = {{Aronsson, Felix}}, language = {{eng}}, note = {{Student Paper}}, title = {{Large scale cluster analysis with Hadoop and Mahout}}, year = {{2015}}, }