Automatic classification using DDC on the Swedish union catalogue
(2018) 18th European Networked Knowledge Organization Systems Workshop, NKOS 2018 In CEUR Workshop Proceedings 2200. p.4-16- Abstract
With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes... (More)
With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.
(Less)
- author
- Golub, Koraljka LU ; Hagelbäck, Johan and Ardö, Anders LU
- organization
- publishing date
- 2018
- type
- Chapter in Book/Report/Conference proceeding
- publication status
- published
- subject
- keywords
- Automatic classification, Dewey Decimal Classification, LIBRIS, Machine learning, Multinomial Naïve Bayes, Subject access., Support Vector Machine
- host publication
- Proceedings of the 18th European Networked Knowledge Organization Systems (NKOS) Workshop co-located with the 22nd International Conference on Theory and Practice of Digital Libraries 2018 (TPDL 2018)
- series title
- CEUR Workshop Proceedings
- volume
- 2200
- pages
- 13 pages
- publisher
- CEUR-WS
- conference name
- 18th European Networked Knowledge Organization Systems Workshop, NKOS 2018
- conference location
- Porto, Portugal
- conference dates
- 2018-09-13
- external identifiers
-
- scopus:85053933816
- ISSN
- 1613-0073
- language
- English
- LU publication?
- yes
- id
- bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5
- alternative location
- http://ceur-ws.org/Vol-2200/paper1.pdf
- date added to LUP
- 2018-10-26 07:44:18
- date last changed
- 2022-04-25 18:30:53
@inproceedings{bcc9aee9-3f1d-4b6a-9e3a-df80aabe22a5, abstract = {{<p>With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.</p>}}, author = {{Golub, Koraljka and Hagelbäck, Johan and Ardö, Anders}}, booktitle = {{Proceedings of the 18th European Networked Knowledge Organization Systems (NKOS) Workshop co-located with the 22nd International Conference on Theory and Practice of Digital Libraries 2018 (TPDL 2018)}}, issn = {{1613-0073}}, keywords = {{Automatic classification; Dewey Decimal Classification; LIBRIS; Machine learning; Multinomial Naïve Bayes; Subject access.; Support Vector Machine}}, language = {{eng}}, pages = {{4--16}}, publisher = {{CEUR-WS}}, series = {{CEUR Workshop Proceedings}}, title = {{Automatic classification using DDC on the Swedish union catalogue}}, url = {{http://ceur-ws.org/Vol-2200/paper1.pdf}}, volume = {{2200}}, year = {{2018}}, }