On the Effectiveness of Handcrafted and Learned Features in Automated Essay Scoring

Jakobsson, Edvin

On the Effectiveness of Handcrafted and Learned Features in Automated Essay Scoring

Mark

Jakobsson, Edvin ^LU (2019) FYTM04 20191
Computational Biology and Biological Physics - Has been reorganised

Abstract: The task of Automated Essay Scoring (AES) has been active for more than half a century, starting with handcrafting statistical features used for linear regression, and currently being improved by the latest advancements in machine learning and natural language processing. Most current research uses some form of character or word embeddings to represent the essays rather than statistical features, enabling the models to analyze the text in full and automatically learn what to look for. Handcrafted features have possibly reached their maximum potential, and have been shown to be outperformed by more complex representations
of textual data. However, the fundamental differences between handcrafted and learned features have not been properly... (More); The task of Automated Essay Scoring (AES) has been active for more than half a century, starting with handcrafting statistical features used for linear regression, and currently being improved by the latest advancements in machine learning and natural language processing. Most current research uses some form of character or word embeddings to represent the essays rather than statistical features, enabling the models to analyze the text in full and automatically learn what to look for. Handcrafted features have possibly reached their maximum potential, and have been shown to be outperformed by more complex representations
of textual data. However, the fundamental differences between handcrafted and learned features have not been properly documented, nor their fundamental strengths and weaknesses compared.
In this paper we compare two different kinds of models for automated essay scoring, a Multilayer Perceptron (MLP) using handcrafted features and a standard Convolutional Neural Network (CNN) using word embeddings. The models are trained and tested and their strengths and weaknesses are discussed. We show that a simple CNN outperforms the MLP using handcrafted features, but that the MLP is a viable method to use for small tasks because of the easier implementation and shorter training time. We also provide some tips and suggestions when constructing a CNN for AES, and we discuss a potential downside of the quadratic weighted kappa score that is sometimes a suggested validation
metric for AES-systems. (Less)
Popular Abstract (Swedish): När man talar om framtiden finns det ett begrepp som dyker upp i nästan varje diskussion: "articiell intelligens". Vår fascination kring framtidens potentiellt självtänkande robotar speglas i hur mycket det diskuteras i alla yrkesgrupper just nu, oavsett om ämnet är koldioxidutsläpp, självkörande bilar eller automatiska inköpslistor. Men hur långt har egentligen maskininlärning kommit idag? Hur intelligent är dagens artificiella "intelligens"?
En viktig aspekt inom maskininlärning är att kunna få en dator att förstå mänskligt tal och skrift. Många yrken idag hade betydligt underlättats om datorer kunde hjälpa till med monotona uppgifter som till exempel att tolka läkarjournaler eller betygsätta labbrapporter. Lärare lägger otaliga timmar... (More); När man talar om framtiden finns det ett begrepp som dyker upp i nästan varje diskussion: "articiell intelligens". Vår fascination kring framtidens potentiellt självtänkande robotar speglas i hur mycket det diskuteras i alla yrkesgrupper just nu, oavsett om ämnet är koldioxidutsläpp, självkörande bilar eller automatiska inköpslistor. Men hur långt har egentligen maskininlärning kommit idag? Hur intelligent är dagens artificiella "intelligens"?
En viktig aspekt inom maskininlärning är att kunna få en dator att förstå mänskligt tal och skrift. Många yrken idag hade betydligt underlättats om datorer kunde hjälpa till med monotona uppgifter som till exempel att tolka läkarjournaler eller betygsätta labbrapporter. Lärare lägger otaliga timmar på att rätta prov och rapporter, tid som annars hade kunnat spenderas på undervisning och utveckling. Uppgiften är enformig och sällan särskilt givande, men oundviklig och otroligt viktig att få korrekt och rättvis. Intelligenta program, kapabla till att läsa en text och kunna ge insiktsfulla kommenterar och poäng hade varit ovärderliga för lärare, och det hade kunnat spara samhället ofantliga resurser.
Utvecklingen av provrättande program påbörjades redan i samband med datoriseringen
av samhället, och mycket har hänt i deras struktur och effektivitet sedan dess. De tidigaste modellerna gjorde kvalificerade gissningar om texten baserat på statistiska värden så som textens längd eller antalet stavfel. Dagens modeller baseras på en helt annan typ av intelligens som påminner om hur ett barn först lär sig att läsa. Programmen exponeras för ofantliga mängder text och får själva lära sig vad som skiljer en halvfärdig inköpslista från ett litterärt mästerverk.
Så hur väl fungerar de här nya programmen? Hur väl förstår de vad en riktig människa tänker eller skriver? Och hur användbara är de i dagens samhälle? (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8991130

author

Jakobsson, Edvin ^LU

supervisor

Mattias Ohlsson ^LU

organization

Computational Biology and Biological Physics - Has been reorganised

course

FYTM04 20191

year

2019

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Natural Language Processing Machine Learning Automated Essay Scoring

report number

LU TP 19-35

language

English

id

8991130

date added to LUP

2019-07-29 16:18:01

date last changed

2019-07-29 16:18:01

@misc{8991130,
  abstract     = {{The task of Automated Essay Scoring (AES) has been active for more than half a century, starting with handcrafting statistical features used for linear regression, and currently being improved by the latest advancements in machine learning and natural language processing. Most current research uses some form of character or word embeddings to represent the essays rather than statistical features, enabling the models to analyze the text in full and automatically learn what to look for. Handcrafted features have possibly reached their maximum potential, and have been shown to be outperformed by more complex representations
of textual data. However, the fundamental differences between handcrafted and learned features have not been properly documented, nor their fundamental strengths and weaknesses compared.
In this paper we compare two different kinds of models for automated essay scoring, a Multilayer Perceptron (MLP) using handcrafted features and a standard Convolutional Neural Network (CNN) using word embeddings. The models are trained and tested and their strengths and weaknesses are discussed. We show that a simple CNN outperforms the MLP using handcrafted features, but that the MLP is a viable method to use for small tasks because of the easier implementation and shorter training time. We also provide some tips and suggestions when constructing a CNN for AES, and we discuss a potential downside of the quadratic weighted kappa score that is sometimes a suggested validation
metric for AES-systems.}},
  author       = {{Jakobsson, Edvin}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{On the Effectiveness of Handcrafted and Learned Features in Automated Essay Scoring}},
  year         = {{2019}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

On the Effectiveness of Handcrafted and Learned Features in Automated Essay Scoring