Distribuera mera - Spark och Hadoop utan Big Data

Nihlgård, Oscar

Distribuera mera - Spark och Hadoop utan Big Data

Mark

Nihlgård, Oscar (2016)
Computer Science and Engineering (BSc)

Abstract: Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an... (More); Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an abstraction layer
that allows for code to be executed either distributed or locally is implemented by using
Java streams as a local equivalent of Spark. With this abstraction layer small tasks that are
only sometimes suited for distribution can choose the best alternative at run time. It is concluded that these tools can be useful even for small amounts of data, and even when
the execution time for a non-distributed solution is very short (under a minute). (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8880990

author

Nihlgård, Oscar

organization

Computer Science and Engineering (BSc)

year

2016

type

M2 - Bachelor Degree

subject

Technology and Engineering

keywords

big data, hadoop, spark, hdfs, distribuering

language

Swedish

id

8880990

date added to LUP

2016-06-14 04:07:23

date last changed

2018-10-18 10:33:23

@misc{8880990,
  abstract     = {{Distribution as a concept means that a task (for example, data storage or code execution) is
parallelized on multiple computers. It goes hand in hand with the concept of big data –
extreme amounts of data that can’t be processed by a single computer. Because of this, the
most established tools for distributed parallelization is tools that are designed to handle big
data. This thesis explores whether two such tools, Spark (distributed code execution) and Hadoop
Distributed File System (distributed data storage), are also suited for handling smaller
amounts of data. Distribution is a potentially cheap and scalable way of working even for
small amounts of data. The primary method of the report is performance tests. As a side track, an abstraction layer
that allows for code to be executed either distributed or locally is implemented by using
Java streams as a local equivalent of Spark. With this abstraction layer small tasks that are
only sometimes suited for distribution can choose the best alternative at run time. It is concluded that these tools can be useful even for small amounts of data, and even when
the execution time for a non-distributed solution is very short (under a minute).}},
  author       = {{Nihlgård, Oscar}},
  language     = {{swe}},
  note         = {{Student Paper}},
  title        = {{Distribuera mera - Spark och Hadoop utan Big Data}},
  year         = {{2016}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Distribuera mera - Spark och Hadoop utan Big Data