Distribuera mera - Spark och Hadoop utan Big Data
(2016)Computer Science and Engineering (BSc)
- Abstract
- Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an... (More) - Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an abstraction layer
that allows for code to be executed either distributed or locally is implemented by using
Java streams as a local equivalent of Spark. With this abstraction layer small tasks that are
only sometimes suited for distribution can choose the best alternative at run time. It is concluded that these tools can be useful even for small amounts of data, and even when
the execution time for a non-distributed solution is very short (under a minute). (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/8880990
- author
- Nihlgård, Oscar
- organization
- year
- 2016
- type
- M2 - Bachelor Degree
- subject
- keywords
- big data, hadoop, spark, hdfs, distribuering
- language
- Swedish
- id
- 8880990
- date added to LUP
- 2016-06-14 04:07:23
- date last changed
- 2018-10-18 10:33:23
@misc{8880990, abstract = {{Distribution as a concept means that a task (for example, data storage or code execution) is parallelized on multiple computers. It goes hand in hand with the concept of big data – extreme amounts of data that can’t be processed by a single computer. Because of this, the most established tools for distributed parallelization is tools that are designed to handle big data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop Distributed File System (distributed data storage), are also suited for handling smaller amounts of data. Distribution is a potentially cheap and scalable way of working even for small amounts of data. The primary method of the report is performance tests. As a side track, an abstraction layer that allows for code to be executed either distributed or locally is implemented by using Java streams as a local equivalent of Spark. With this abstraction layer small tasks that are only sometimes suited for distribution can choose the best alternative at run time. It is concluded that these tools can be useful even for small amounts of data, and even when the execution time for a non-distributed solution is very short (under a minute).}}, author = {{Nihlgård, Oscar}}, language = {{swe}}, note = {{Student Paper}}, title = {{Distribuera mera - Spark och Hadoop utan Big Data}}, year = {{2016}}, }