Advanced

Automatic classification using DDC on the Swedish union catalogue

Golub, Koraljka LU ; Hagelbäck, Johan and Ardö, Anders LU (2018) 18th European Networked Knowledge Organization Systems Workshop, NKOS 2018 In CEUR Workshop Proceedings 2200. p.4-16
Abstract

With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes... (More)

With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.

(Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
keywords
Automatic classification, Dewey Decimal Classification, LIBRIS, Machine learning, Multinomial Naïve Bayes, Subject access., Support Vector Machine
host publication
Proceedings of the 18th European Networked Knowledge Organization Systems (NKOS) Workshop co-located with the 22nd International Conference on Theory and Practice of Digital Libraries 2018 (TPDL 2018)
series title
CEUR Workshop Proceedings
volume
2200
pages
13 pages
publisher
CEUR
conference name
18th European Networked Knowledge Organization Systems Workshop, NKOS 2018
conference location
Porto, Portugal
conference dates
2018-09-13
external identifiers
  • scopus:85053933816
ISSN
1613-0073
language
English
LU publication?
yes
id
bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5
alternative location
http://ceur-ws.org/Vol-2200/paper1.pdf
date added to LUP
2018-10-26 07:44:18
date last changed
2019-02-20 11:33:30
@inproceedings{bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5,
  abstract     = {<p>With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.</p>},
  author       = {Golub, Koraljka and Hagelbäck, Johan and Ardö, Anders},
  booktitle    = {CEUR Workshop Proceedings},
  issn         = {1613-0073},
  keyword      = {Automatic classification,Dewey Decimal Classification,LIBRIS,Machine learning,Multinomial Naïve Bayes,Subject access.,Support Vector Machine},
  language     = {eng},
  location     = {Porto, Portugal},
  pages        = {4--16},
  publisher    = {CEUR},
  title        = {Automatic classification using DDC on the Swedish union catalogue},
  volume       = {2200},
  year         = {2018},
}