Exploring the data behind students’ published theses - Analyzing the pride of Lund University
(2020) STAH11 20192Department of Statistics
- Abstract
- Lund University have for over ten years been using a website called LUP Student Papers where they publish theses from bachelor and master courses. The aim of this thesis is to use visualizations and data mining techniques that will explore and shows interesting aspects of the data. The wide variety of variables in the dataset can be used to gain insight regarding several interesting questions such as are the number of theses increasing for each year? Is a thesis in English becoming more common? How many times have a thesis been downloaded on average?
The second purpose of this thesis is to use a Random Forest model and classify the abstract into three faculties, LUSEM, Social Science and Engineering. The aim is to see if the three... (More) - Lund University have for over ten years been using a website called LUP Student Papers where they publish theses from bachelor and master courses. The aim of this thesis is to use visualizations and data mining techniques that will explore and shows interesting aspects of the data. The wide variety of variables in the dataset can be used to gain insight regarding several interesting questions such as are the number of theses increasing for each year? Is a thesis in English becoming more common? How many times have a thesis been downloaded on average?
The second purpose of this thesis is to use a Random Forest model and classify the abstract into three faculties, LUSEM, Social Science and Engineering. The aim is to see if the three faculties can be easily classified which would suggest that there is some noticeable difference in the text between the faculties. The abstracts had to be preprocessed with natural language processing techniques such as tokenization and stemming. The classification model achieved a relatively good accuracy around 0.80 and therefore suggest that the abstract can be classified. Further research can focus on different models for text classification. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9033084
- author
- Granberg, Per LU
- supervisor
- organization
- course
- STAH11 20192
- year
- 2020
- type
- M2 - Bachelor Degree
- subject
- keywords
- data visualization, Lund University, classification model, NLP
- language
- English
- id
- 9033084
- date added to LUP
- 2021-01-07 12:19:11
- date last changed
- 2021-01-07 12:19:11
@misc{9033084, abstract = {{Lund University have for over ten years been using a website called LUP Student Papers where they publish theses from bachelor and master courses. The aim of this thesis is to use visualizations and data mining techniques that will explore and shows interesting aspects of the data. The wide variety of variables in the dataset can be used to gain insight regarding several interesting questions such as are the number of theses increasing for each year? Is a thesis in English becoming more common? How many times have a thesis been downloaded on average? The second purpose of this thesis is to use a Random Forest model and classify the abstract into three faculties, LUSEM, Social Science and Engineering. The aim is to see if the three faculties can be easily classified which would suggest that there is some noticeable difference in the text between the faculties. The abstracts had to be preprocessed with natural language processing techniques such as tokenization and stemming. The classification model achieved a relatively good accuracy around 0.80 and therefore suggest that the abstract can be classified. Further research can focus on different models for text classification.}}, author = {{Granberg, Per}}, language = {{eng}}, note = {{Student Paper}}, title = {{Exploring the data behind students’ published theses - Analyzing the pride of Lund University}}, year = {{2020}}, }