Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Distribuera mera - Spark och Hadoop utan Big Data

Nihlgård, Oscar (2016)
Computer Science and Engineering (BSc)
Abstract
Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an... (More)
Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an abstraction layer
that allows for code to be executed either distributed or locally is implemented by using
Java streams as a local equivalent of Spark. With this abstraction layer small tasks that are
only sometimes suited for distribution can choose the best alternative at run time. It is concluded that these tools can be useful even for small amounts of data, and even when
the execution time for a non-distributed solution is very short (under a minute). (Less)
Please use this url to cite or link to this publication:
author
Nihlgård, Oscar
organization
year
type
M2 - Bachelor Degree
subject
keywords
big data, hadoop, spark, hdfs, distribuering
language
Swedish
id
8880990
date added to LUP
2016-06-14 04:07:23
date last changed
2018-10-18 10:33:23
@misc{8880990,
  abstract     = {{Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an abstraction layer
that allows for code to be executed either distributed or locally is implemented by using
Java streams as a local equivalent of Spark. With this abstraction layer small tasks that are
only sometimes suited for distribution can choose the best alternative at run time. It is concluded that these tools can be useful even for small amounts of data, and even when
the execution time for a non-distributed solution is very short (under a minute).}},
  author       = {{Nihlgård, Oscar}},
  language     = {{swe}},
  note         = {{Student Paper}},
  title        = {{Distribuera mera - Spark och Hadoop utan Big Data}},
  year         = {{2016}},
}