Advanced

The role of different thesauri terms and captions in automated subject classification

Golub, Koraljka LU (2006) 2006 IEEE/WIC/ACM International Conference on Web Intelligence In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence p.961-965
Abstract
The paper aims to explore to what degree different types of terms in engineering information (Ei) thesaurus and classification scheme influence automated subject classification performance. Preferred terms, their synonyms, broader, narrower, related terms, and captions are examined in combination with a stemmer and a stop-word list. The algorithm comprises string-to-string matching between words in the documents to be classified and words in term lists derived from the Ei thesaurus and classification scheme. The data collection for evaluation consists of some 35000 scientific paper abstracts from the compendex database. A subset of the Ei thesaurus and classification scheme is used, comprising 92 classes at up to five hierarchical levels... (More)
The paper aims to explore to what degree different types of terms in engineering information (Ei) thesaurus and classification scheme influence automated subject classification performance. Preferred terms, their synonyms, broader, narrower, related terms, and captions are examined in combination with a stemmer and a stop-word list. The algorithm comprises string-to-string matching between words in the documents to be classified and words in term lists derived from the Ei thesaurus and classification scheme. The data collection for evaluation consists of some 35000 scientific paper abstracts from the compendex database. A subset of the Ei thesaurus and classification scheme is used, comprising 92 classes at up to five hierarchical levels from general engineering. The results show that preferred terms perform best, whereas captions perform worst. Stemming in most cases shows performance improvement, whereas the stop-word list does not have a significant impact (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
keywords
thesauri term, automated subject classification, engineering information, string-to-string matching, document classification, compendex database, data collection
in
Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
pages
5 pages
publisher
IEEE--Institute of Electrical and Electronics Engineers Inc.
conference name
2006 IEEE/WIC/ACM International Conference on Web Intelligence
external identifiers
  • WOS:000245469500168
  • Scopus:42549163204
ISBN
0-7695-2747-7
DOI
10.1109/WI.2006.169
project
DISKA/DO:PING: ALVIS
language
English
LU publication?
yes
id
d0aaf5e3-2bc8-4a02-b519-75040f8d8788 (old id 617040)
date added to LUP
2007-11-24 09:58:43
date last changed
2016-10-13 04:49:02
@misc{d0aaf5e3-2bc8-4a02-b519-75040f8d8788,
  abstract     = {The paper aims to explore to what degree different types of terms in engineering information (Ei) thesaurus and classification scheme influence automated subject classification performance. Preferred terms, their synonyms, broader, narrower, related terms, and captions are examined in combination with a stemmer and a stop-word list. The algorithm comprises string-to-string matching between words in the documents to be classified and words in term lists derived from the Ei thesaurus and classification scheme. The data collection for evaluation consists of some 35000 scientific paper abstracts from the compendex database. A subset of the Ei thesaurus and classification scheme is used, comprising 92 classes at up to five hierarchical levels from general engineering. The results show that preferred terms perform best, whereas captions perform worst. Stemming in most cases shows performance improvement, whereas the stop-word list does not have a significant impact},
  author       = {Golub, Koraljka},
  isbn         = {0-7695-2747-7},
  keyword      = {thesauri term,automated subject classification,engineering information,string-to-string matching,document classification,compendex database,data collection},
  language     = {eng},
  pages        = {961--965},
  publisher    = {ARRAY(0x9a6d310)},
  series       = {Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence},
  title        = {The role of different thesauri terms and captions in automated subject classification},
  url          = {http://dx.doi.org/10.1109/WI.2006.169},
  year         = {2006},
}