Advanced

Large scale cluster analysis with Hadoop and Mahout

Aronsson, Felix LU (2015) EITM01 20132
Department of Electrical and Information Technology
Abstract
User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is... (More)
User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is theoretically possible, there are several practical limitations that influence both the ability to run cluster analysis on such data sets, and also the results. (Less)
Please use this url to cite or link to this publication:
author
Aronsson, Felix LU
supervisor
organization
course
EITM01 20132
year
type
H2 - Master's Degree (Two Years)
subject
report number
LU/LTH-EIT 2015-431
language
English
id
5148140
date added to LUP
2015-03-17 15:30:34
date last changed
2015-03-17 15:31:45
@misc{5148140,
  abstract     = {User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is theoretically possible, there are several practical limitations that influence both the ability to run cluster analysis on such data sets, and also the results.},
  author       = {Aronsson, Felix},
  language     = {eng},
  note         = {Student Paper},
  title        = {Large scale cluster analysis with Hadoop and Mahout},
  year         = {2015},
}