Comparing and combining two approaches to automated subject classification of text

Golub, Koraljka; Ardö, Anders; Mladenic, Dunja; Grobelnik, Marko

Comparing and combining two approaches to automated subject classification of text

Mark

Golub, Koraljka ^LU ; Ardö, Anders ^LU ; Mladenic, Dunja and Grobelnik, Marko (2006) 10th European Conference, ECDL 2006 4172. p.467-470

Abstract: A machine-learning and a string-matching approach to automated subject classification of text were compared, as to their performance, advantages and downsides. The former approach was based on an SVM algorithm, while the latter comprised string-matching between a controlled vocabulary and words in the text to be classified. Data collection consisted of a subset from Compendex, classified into six different classes. It was shown that SVM on average outperforms the string-matching approach: our hypothesis that SVM yields better recall and string-matching better precision was confirmed only on one of the classes. The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies.... (More); A machine-learning and a string-matching approach to automated subject classification of text were compared, as to their performance, advantages and downsides. The former approach was based on an SVM algorithm, while the latter comprised string-matching between a controlled vocabulary and words in the text to be classified. Data collection consisted of a subset from Compendex, classified into six different classes. It was shown that SVM on average outperforms the string-matching approach: our hypothesis that SVM yields better recall and string-matching better precision was confirmed only on one of the classes. The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies. The results have shown that the original approaches, i.e. machine-learning approach without using background knowledge from the controlled vocabulary, and string-matching approach based on controlled vocabulary, outperform approaches in which combinations of automatically and manually obtained terms were used. Reasons for these results need further investigation, including a larger data collection and combining the two using predictions. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/387253

author

Golub, Koraljka ^LU ; Ardö, Anders ^LU ; Mladenic, Dunja and Grobelnik, Marko

organization

Department of Electrical and Information Technology

publishing date

2006

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Electrical Engineering, Electronic Engineering, Information Engineering

host publication

Research and Advanced Technology for Digital Libraries. Proceedings / Lecture Notes in Computer Science

volume

4172

pages

467 - 470

publisher

Springer

conference name

10th European Conference, ECDL 2006

conference location

Alicante, Spain

conference dates

2006-09-17 - 2006-09-22

external identifiers

wos:000241101500045
scopus:33750236672

ISSN

DOI

language

LU publication?

yes

id

2bb00c04-3a65-4f21-8708-615f60bdc107 (old id 387253)

alternative location

http://www.eit.lth.se/fileadmin/eit/home/hs.aar/Publ/ECDL2006.pdf

date added to LUP

2016-04-01 12:00:35

date last changed

2026-01-02 08:41:43

@inproceedings{2bb00c04-3a65-4f21-8708-615f60bdc107,
  abstract     = {{A machine-learning and a string-matching approach to automated subject classification of text were compared, as to their performance, advantages and downsides. The former approach was based on an SVM algorithm, while the latter comprised string-matching between a controlled vocabulary and words in the text to be classified. Data collection consisted of a subset from Compendex, classified into six different classes. It was shown that SVM on average outperforms the string-matching approach: our hypothesis that SVM yields better recall and string-matching better precision was confirmed only on one of the classes. The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies. The results have shown that the original approaches, i.e. machine-learning approach without using background knowledge from the controlled vocabulary, and string-matching approach based on controlled vocabulary, outperform approaches in which combinations of automatically and manually obtained terms were used. Reasons for these results need further investigation, including a larger data collection and combining the two using predictions.}},
  author       = {{Golub, Koraljka and Ardö, Anders and Mladenic, Dunja and Grobelnik, Marko}},
  booktitle    = {{Research and Advanced Technology for Digital Libraries. Proceedings / Lecture Notes in Computer Science}},
  issn         = {{0302-9743}},
  language     = {{eng}},
  pages        = {{467--470}},
  publisher    = {{Springer}},
  title        = {{Comparing and combining two approaches to automated subject classification of text}},
  url          = {{http://dx.doi.org/10.1007/11863878_45}},
  doi          = {{10.1007/11863878_45}},
  volume       = {{4172}},
  year         = {{2006}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Comparing and combining two approaches to automated subject classification of text