Automatic Classification of Swedish Metadata Using Dewey Decimal Classification : A Comparison of Approaches

Golub, Koraljka; Hagelbäck, Johan; Ardö, Anders

Automatic Classification of Swedish Metadata Using Dewey Decimal Classification : A Comparison of Approaches

Mark

Golub, Koraljka ^LU ; Hagelbäck, Johan and Ardö, Anders ^LU (2020) In Journal of Data and Information Science 5(1). p.18-38

Abstract: With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC. State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels... (More); With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC. State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels). Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available-and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes). Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems. In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future. The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/25ca0527-90a9-4c7d-9cea-84704f019eaa

author

Golub, Koraljka ^LU ; Hagelbäck, Johan and Ardö, Anders ^LU

organization

Department of Electrical and Information Technology

publishing date

2020-04-22

type

Contribution to journal

publication status

published

subject

Natural Language Processing

keywords

1D convolutional neural network, Automatic classification, Dewey Decimal Classification, LIBRIS, Machine learning, Multinomial Naïve Bayes, Recurrent neural network, Simple linear network, Standard neural network, String matching, Support Vector Machine, Word embeddings

in

Journal of Data and Information Science

volume

5

issue

1

pages

21 pages

publisher

De Gruyter

external identifiers

scopus:85085118766

ISSN

2096-157X

DOI

10.2478/jdis-2020-0003

language

English

LU publication?

yes

id

25ca0527-90a9-4c7d-9cea-84704f019eaa

date added to LUP

2020-06-25 09:19:35

date last changed

2025-04-04 14:57:38

@article{25ca0527-90a9-4c7d-9cea-84704f019eaa,
  abstract     = {{<p>With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC. State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels). Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available-and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes). Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems. In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future. The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.</p>}},
  author       = {{Golub, Koraljka and Hagelbäck, Johan and Ardö, Anders}},
  issn         = {{2096-157X}},
  keywords     = {{1D convolutional neural network; Automatic classification; Dewey Decimal Classification; LIBRIS; Machine learning; Multinomial Naïve Bayes; Recurrent neural network; Simple linear network; Standard neural network; String matching; Support Vector Machine; Word embeddings}},
  language     = {{eng}},
  month        = {{04}},
  number       = {{1}},
  pages        = {{18--38}},
  publisher    = {{De Gruyter}},
  series       = {{Journal of Data and Information Science}},
  title        = {{Automatic Classification of Swedish Metadata Using Dewey Decimal Classification : A Comparison of Approaches}},
  url          = {{http://dx.doi.org/10.2478/jdis-2020-0003}},
  doi          = {{10.2478/jdis-2020-0003}},
  volume       = {{5}},
  year         = {{2020}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Automatic Classification of Swedish Metadata Using Dewey Decimal Classification : A Comparison of Approaches