Automated subject classification of textual web pages, for browsing

Golub, Koraljka

Automated subject classification of textual web pages, for browsing

Mark

Golub, Koraljka ^LU (2005)

Abstract: With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes.

In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common... (More); With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes.

In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common research aims and a number of methods and techniques, and as such could benefit from each other, it has been shown that authors belonging to the three communities do not communicate with authors from the other two communities to a large extent. The two biggest differences between the approaches are whether they employ a vector space model (machine learning and information retrieval), and whether they make use of controlled vocabularies such as, for example, classification schemes, thesauri, or ontologies (library science).

Certain special characteristics of Web pages (e.g. metadata and structural elements such as title, headings, main text) were investigated as to how they could be best used in automated classification. The study indicated that all the structural information and metadata available in Web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important.

It has been claimed that well-structured, high-quality controlled vocabularies, could serve as good browsing structures. The degree and nature of subject browsing conducted by users of a large Web-based service (Renardus) was studied, using log analysis. The study showed that browsing is used to a much larger degree than searching, indicating the usefulness of browsing in such services and possibly implying the suitability of such a controlled vocabulary (Dewey Decimal Classification) for browsing. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/530137

author

Golub, Koraljka ^LU

supervisor

Anders Ardö ^LU
Traugott Koch ^LU

organization

Department of Electrical and Information Technology

publishing date

2005

type

Thesis

publication status

published

subject

Electrical Engineering, Electronic Engineering, Information Engineering

keywords

automated classification, subject browsing, structural Web-page elements, Web page classification, document clustering, bibliographic coupling, text categorization, Subject classification

pages

115 pages

publisher

Digital Information Systems Group, Department of Information Technology, Lund University

ISBN

91-7167-034-3

language

English

LU publication?

yes

id

b5e14c2a-5908-4ad1-9521-d01a7011d55a (old id 530137)

date added to LUP

2016-04-04 11:01:08

date last changed

2025-04-04 14:08:43

@misc{b5e14c2a-5908-4ad1-9521-d01a7011d55a,
  abstract     = {{With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes. <br/><br>
In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common research aims and a number of methods and techniques, and as such could benefit from each other, it has been shown that authors belonging to the three communities do not communicate with authors from the other two communities to a large extent. The two biggest differences between the approaches are whether they employ a vector space model (machine learning and information retrieval), and whether they make use of controlled vocabularies such as, for example, classification schemes, thesauri, or ontologies (library science).<br/><br>
Certain special characteristics of Web pages (e.g. metadata and structural elements such as title, headings, main text) were investigated as to how they could be best used in automated classification. The study indicated that all the structural information and metadata available in Web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important. <br/><br>
It has been claimed that well-structured, high-quality controlled vocabularies, could serve as good browsing structures. The degree and nature of subject browsing conducted by users of a large Web-based service (Renardus) was studied, using log analysis. The study showed that browsing is used to a much larger degree than searching, indicating the usefulness of browsing in such services and possibly implying the suitability of such a controlled vocabulary (Dewey Decimal Classification) for browsing.}},
  author       = {{Golub, Koraljka}},
  isbn         = {{91-7167-034-3}},
  keywords     = {{automated classification; subject browsing; structural Web-page elements; Web page classification; document clustering; bibliographic coupling; text categorization; Subject classification}},
  language     = {{eng}},
  note         = {{Licentiate Thesis}},
  publisher    = {{Digital Information Systems Group, Department of Information Technology, Lund University}},
  title        = {{Automated subject classification of textual web pages, for browsing}},
  url          = {{https://lup.lub.lu.se/search/files/5675430/624829.pdf}},
  year         = {{2005}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Automated subject classification of textual web pages, for browsing