Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing

Golub, Koraljka

Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing

Mark

Golub, Koraljka ^LU (2007)

Abstract: With the exponential growth of the World Wide Web, automated subject classification has become a major research issue. Organizing web pages into a hierarchical structure for subject browsing has been gaining more recognition as an important tool in information-seeking processes.The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if they are similar enough to the former. In the thesis, a string-matching algorithm based on a controlled vocabulary was explored. It does not require training documents, but instead reuses the intellectual work invested into creating the controlled vocabulary. Terms from the Engineering Information thesaurus... (More); With the exponential growth of the World Wide Web, automated subject classification has become a major research issue. Organizing web pages into a hierarchical structure for subject browsing has been gaining more recognition as an important tool in information-seeking processes.The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if they are similar enough to the former. In the thesis, a string-matching algorithm based on a controlled vocabulary was explored. It does not require training documents, but instead reuses the intellectual work invested into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against text of documents to be classified. Plain string-matching was enhanced in several ways, including term weighting with cut-offs, exclusion of certain terms, and enrichment of the controlled vocabulary with automatically extracted terms. The final results were comparable to those of state-of-the-art machine-learning algorithms, especially for particular classes. Concerning web pages, it was indicated that all the structural information and metadata available in web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important.In the context of browsing, the biggest difference between three approaches to automated classification (machine learning, information retrieval, library science) is whether they use controlled vocabularies. It has been claimed that well-structured, high-quality classification schemes, such as those used predominantly in library science approaches, could serve as good browsing structures. In the thesis it was shown that Dewey Decimal Classification and Engineering Information classification scheme are suitable for the task. Moreover, a log analysis of a large web-based service using Dewey Decimal Classification demonstrated that browsing is used to a much larger degree than searching.The final conclusion is that an appropriate controlled vocabulary, with a large number of entry vocabulary designating classes, could be utilised in automated classification. If the same controlled vocabulary has an appropriate hierarchical structure, it could at the same time provide a good browsing structure to the automatically classified collection of documents. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/599083

author

Golub, Koraljka ^LU

supervisor

Anders Ardö ^LU

opponent

Professor Tudhope, Douglas, University of Glamorgan, UK

organization

Department of Electrical and Information Technology

publishing date

2007

type

Thesis

publication status

published

subject

Electrical Engineering, Electronic Engineering, Information Engineering

keywords

Dewey Decimal Classification, Engineering Information, hierarchical browsing, controlled vocabularies, thesauri, classification schemes, Automated classification, subject classification, Artificial intelligens, Artificiell intelligens

pages

280 pages

publisher

Electro and information technology

defense location

Room E:1406, E-building, Ole Römers väg 3, Lund University Faculty of Engineering

defense date

2007-11-01 10:15:00

ISBN

91-7167-042-4

language

English

LU publication?

yes

id

0cab375a-0784-491e-a7d9-a803c0ca5984 (old id 599083)

date added to LUP

2016-04-04 11:07:05

date last changed

2025-04-04 14:11:36

@phdthesis{0cab375a-0784-491e-a7d9-a803c0ca5984,
  abstract     = {{With the exponential growth of the World Wide Web, automated subject classification has become a major research issue. Organizing web pages into a hierarchical structure for subject browsing has been gaining more recognition as an important tool in information-seeking processes.The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if they are similar enough to the former. In the thesis, a string-matching algorithm based on a controlled vocabulary was explored. It does not require training documents, but instead reuses the intellectual work invested into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against text of documents to be classified. Plain string-matching was enhanced in several ways, including term weighting with cut-offs, exclusion of certain terms, and enrichment of the controlled vocabulary with automatically extracted terms. The final results were comparable to those of state-of-the-art machine-learning algorithms, especially for particular classes. Concerning web pages, it was indicated that all the structural information and metadata available in web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important.In the context of browsing, the biggest difference between three approaches to automated classification (machine learning, information retrieval, library science) is whether they use controlled vocabularies. It has been claimed that well-structured, high-quality classification schemes, such as those used predominantly in library science approaches, could serve as good browsing structures. In the thesis it was shown that Dewey Decimal Classification and Engineering Information classification scheme are suitable for the task. Moreover, a log analysis of a large web-based service using Dewey Decimal Classification demonstrated that browsing is used to a much larger degree than searching.The final conclusion is that an appropriate controlled vocabulary, with a large number of entry vocabulary designating classes, could be utilised in automated classification. If the same controlled vocabulary has an appropriate hierarchical structure, it could at the same time provide a good browsing structure to the automatically classified collection of documents.}},
  author       = {{Golub, Koraljka}},
  isbn         = {{91-7167-042-4}},
  keywords     = {{Dewey Decimal Classification; Engineering Information; hierarchical browsing; controlled vocabularies; thesauri; classification schemes; Automated classification; subject classification; Artificial intelligens; Artificiell intelligens}},
  language     = {{eng}},
  publisher    = {{Electro and information technology}},
  school       = {{Lund University}},
  title        = {{Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing}},
  url          = {{https://lup.lub.lu.se/search/files/5698449/599084.pdf}},
  year         = {{2007}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing