Advanced

Importance of HTML structural elements and metadata in automated subject classification

Golub, Koraljka LU and Ardö, Anders LU (2005) 9th European Conference, ECDL 2005 3652. p.368-378
Abstract
The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of... (More)
The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline. (Less)
Please use this url to cite or link to this publication:
author
and
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
host publication
Research and advanced technology for digital libraries / Lecture Notes in Computer Science
volume
3652
pages
368 - 378
publisher
Springer
conference name
9th European Conference, ECDL 2005
conference location
Vienna, Austria
conference dates
2005-09-18 - 2005-09-23
external identifiers
  • other:doi:10.1007/3-540-45747-X
  • scopus:33645992424
ISSN
0302-9743
1611-3349
ISBN
3-540-28767-1
DOI
10.1007/3-540-45747-X
language
English
LU publication?
yes
id
638857e7-288d-46b0-a8b1-4cac8476b035 (old id 210418)
alternative location
http://www.eit.lth.se/fileadmin/eit/home/hs.aar/Publ/ECDL2005.pdf
date added to LUP
2016-04-01 11:39:51
date last changed
2021-08-18 01:54:10
@inproceedings{638857e7-288d-46b0-a8b1-4cac8476b035,
  abstract     = {The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.},
  author       = {Golub, Koraljka and Ardö, Anders},
  booktitle    = {Research and advanced technology for digital libraries / Lecture Notes in Computer Science},
  isbn         = {3-540-28767-1},
  issn         = {0302-9743},
  language     = {eng},
  pages        = {368--378},
  publisher    = {Springer},
  title        = {Importance of HTML structural elements and metadata in automated subject classification},
  url          = {http://dx.doi.org/10.1007/3-540-45747-X},
  doi          = {10.1007/3-540-45747-X},
  volume       = {3652},
  year         = {2005},
}