Automatic classification using DDC on the Swedish union catalogue

Golub, Koraljka; Hagelbäck, Johan; Ardö, Anders

Automatic classification using DDC on the Swedish union catalogue

Mark

Golub, Koraljka ^LU ; Hagelbäck, Johan and Ardö, Anders ^LU (2018) 18th European Networked Knowledge Organization Systems Workshop, NKOS 2018 In CEUR Workshop Proceedings 2200. p.4-16

Abstract: With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes... (More); With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5

author

Golub, Koraljka ^LU ; Hagelbäck, Johan and Ardö, Anders ^LU

organization

Department of Electrical and Information Technology

publishing date

2018

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Information Studies

keywords

Automatic classification, Dewey Decimal Classification, LIBRIS, Machine learning, Multinomial Naïve Bayes, Subject access., Support Vector Machine

host publication

Proceedings of the 18th European Networked Knowledge Organization Systems (NKOS) Workshop co-located with the 22nd International Conference on Theory and Practice of Digital Libraries 2018 (TPDL 2018)

series title

CEUR Workshop Proceedings

volume

2200

pages

13 pages

publisher

CEUR-WS

conference name

18th European Networked Knowledge Organization Systems Workshop, NKOS 2018

conference location

Porto, Portugal

conference dates

2018-09-13

external identifiers

scopus:85053933816

ISSN

1613-0073

language

English

LU publication?

yes

id

bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5

alternative location

http://ceur-ws.org/Vol-2200/paper1.pdf

date added to LUP

2018-10-26 07:44:18

date last changed

2025-10-14 09:27:18

@inproceedings{bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5,
  abstract     = {{<p>With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.</p>}},
  author       = {{Golub, Koraljka and Hagelbäck, Johan and Ardö, Anders}},
  booktitle    = {{Proceedings of the 18th European Networked Knowledge Organization Systems (NKOS) Workshop co-located with the 22nd International Conference on Theory and Practice of Digital Libraries 2018 (TPDL 2018)}},
  issn         = {{1613-0073}},
  keywords     = {{Automatic classification; Dewey Decimal Classification; LIBRIS; Machine learning; Multinomial Naïve Bayes; Subject access.; Support Vector Machine}},
  language     = {{eng}},
  pages        = {{4--16}},
  publisher    = {{CEUR-WS}},
  series       = {{CEUR Workshop Proceedings}},
  title        = {{Automatic classification using DDC on the Swedish union catalogue}},
  url          = {{http://ceur-ws.org/Vol-2200/paper1.pdf}},
  volume       = {{2200}},
  year         = {{2018}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Automatic classification using DDC on the Swedish union catalogue