Advanced

Automated subject classification of textual web pages, for browsing

Golub, Koraljka LU (2005)
Abstract
With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes.

In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common... (More)
With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes.

In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common research aims and a number of methods and techniques, and as such could benefit from each other, it has been shown that authors belonging to the three communities do not communicate with authors from the other two communities to a large extent. The two biggest differences between the approaches are whether they employ a vector space model (machine learning and information retrieval), and whether they make use of controlled vocabularies such as, for example, classification schemes, thesauri, or ontologies (library science).

Certain special characteristics of Web pages (e.g. metadata and structural elements such as title, headings, main text) were investigated as to how they could be best used in automated classification. The study indicated that all the structural information and metadata available in Web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important.

It has been claimed that well-structured, high-quality controlled vocabularies, could serve as good browsing structures. The degree and nature of subject browsing conducted by users of a large Web-based service (Renardus) was studied, using log analysis. The study showed that browsing is used to a much larger degree than searching, indicating the usefulness of browsing in such services and possibly implying the suitability of such a controlled vocabulary (Dewey Decimal Classification) for browsing. (Less)
Please use this url to cite or link to this publication:
author
supervisor
organization
publishing date
type
Thesis
publication status
published
subject
keywords
automated classification, subject browsing, structural Web-page elements, Web page classification, document clustering, bibliographic coupling, text categorization, Subject classification
pages
115 pages
publisher
Digital Information Systems Group, Department of Information Technology, Lund University
ISBN
91-7167-034-3
language
English
LU publication?
yes
id
b5e14c2a-5908-4ad1-9521-d01a7011d55a (old id 530137)
date added to LUP
2007-10-04 08:46:01
date last changed
2016-09-19 08:45:07
@misc{b5e14c2a-5908-4ad1-9521-d01a7011d55a,
  abstract     = {With the exponential growth of the World Wide Web, automated subject classification of Web pages has become a major research issue in information and computer sciences. Organizing Web pages into a hierarchical structure for subject browsing is gaining more recognition as an important tool in information-seeking processes. <br/><br>
In this thesis, different automated classification approaches, focusing on organizing textual Web pages into a browsable hierarchical structure, were critically examined and compared. Three major approaches to automated subject classification have been recognized, each coming from a different research community: machine learning, information retrieval and library science. While these approaches have common research aims and a number of methods and techniques, and as such could benefit from each other, it has been shown that authors belonging to the three communities do not communicate with authors from the other two communities to a large extent. The two biggest differences between the approaches are whether they employ a vector space model (machine learning and information retrieval), and whether they make use of controlled vocabularies such as, for example, classification schemes, thesauri, or ontologies (library science).<br/><br>
Certain special characteristics of Web pages (e.g. metadata and structural elements such as title, headings, main text) were investigated as to how they could be best used in automated classification. The study indicated that all the structural information and metadata available in Web pages should be used in order to achieve the best automated classification results; however, the exact way of combining them proved not to be very important. <br/><br>
It has been claimed that well-structured, high-quality controlled vocabularies, could serve as good browsing structures. The degree and nature of subject browsing conducted by users of a large Web-based service (Renardus) was studied, using log analysis. The study showed that browsing is used to a much larger degree than searching, indicating the usefulness of browsing in such services and possibly implying the suitability of such a controlled vocabulary (Dewey Decimal Classification) for browsing.},
  author       = {Golub, Koraljka},
  isbn         = {91-7167-034-3},
  keyword      = {automated classification,subject browsing,structural Web-page elements,Web page classification,document clustering,bibliographic coupling,text categorization,Subject classification},
  language     = {eng},
  pages        = {115},
  publisher    = {ARRAY(0x96aa0c0)},
  title        = {Automated subject classification of textual web pages, for browsing},
  year         = {2005},
}