Large scale cluster analysis with Hadoop and Mahout

Aronsson, Felix

Large scale cluster analysis with Hadoop and Mahout

Mark

Aronsson, Felix ^LU (2015) EITM01 20132
Department of Electrical and Information Technology

Abstract: User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is... (More); User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is theoretically possible, there are several practical limitations that influence both the ability to run cluster analysis on such data sets, and also the results. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/5148140

author

Aronsson, Felix ^LU

supervisor

Yufei Pan

organization

Department of Electrical and Information Technology

course

EITM01 20132

year

2015

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

report number

LU/LTH-EIT 2015-431

language

English

id

5148140

date added to LUP

2015-03-17 15:30:34

date last changed

2015-03-17 15:31:45

@misc{5148140,
  abstract     = {{User generated data is getting more and more common. This data often expands in to hundreds of millions, if not billions, of data points. It is in the interest of every company with these vast amounts of data to make sense of them in one way or another. In machine learning, cluster analysis has been one way of trying to categorize data without supervision. Mahout is a library which runs on top of the Hadoop framework and tries to make cluster analysis (as well as other machine learning algorithms) arbitrarily scalable. This thesis focuses on using Mahout to cluster a large data set to see if the clustering algorithms in Mahout will scale to several millions of documents and tens of millions of dimensions. I find that while it is theoretically possible, there are several practical limitations that influence both the ability to run cluster analysis on such data sets, and also the results.}},
  author       = {{Aronsson, Felix}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Large scale cluster analysis with Hadoop and Mahout}},
  year         = {{2015}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Large scale cluster analysis with Hadoop and Mahout