Text categorization using predicate-argument structures

Persson, Jacob; Johansson, Richard; Nugues, Pierre

Text categorization using predicate-argument structures

Mark

Persson, Jacob ; Johansson, Richard ^LU and Nugues, Pierre ^LU

(2009) 4. p.142-149

Abstract: Most text categorization methods use the vector space model in combination with a representation of documents based on bags of words. As its name indicates, bags of words ignore possible structures in the text and only take into account isolated, unrelated words. Although this limitation is widely acknowledged, most previous attempts to extend the bag-of-words model with more advanced approaches failed to produce conclusive improvements. We propose a novel method that extends the word-level representation to automatically extracted semantic and syntactic features. We investigated three extensions: word-sense information, subject–verb–object triples, and rolesemantic predicate–argument tuples, all fitting within the vector space model. We... (More); Most text categorization methods use the vector space model in combination with a representation of documents based on bags of words. As its name indicates, bags of words ignore possible structures in the text and only take into account isolated, unrelated words. Although this limitation is widely acknowledged, most previous attempts to extend the bag-of-words model with more advanced approaches failed to produce conclusive improvements. We propose a novel method that extends the word-level representation to automatically extracted semantic and syntactic features. We investigated three extensions: word-sense information, subject–verb–object triples, and rolesemantic predicate–argument tuples, all fitting within the vector space model. We computed their contribution to the categorization results on the Reuters corpus of newswires (RCV1). We show that these three extensions, either taken individually or in combination, result in statistically significant improvements of the microaverage F1 over a baseline using bags of words. We found that our best extended model that uses a combination of syntactic and semantic features reduces the error of the word-level baseline by up to 10 percent for the categories having more than 1,000 documents in the training corpus. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/1668763

author

Persson, Jacob ; Johansson, Richard ^LU and Nugues, Pierre ^LU

organization

publishing date

2009

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer Sciences

host publication

Proceedings of the 17th Nordic Conference on Computational Lin- guistics (NODALIDA 2009) / Nealt Proceedings Series

volume

4

pages

142 - 149

external identifiers

scopus:85121737646

ISSN

1736-6305

language

English

LU publication?

yes

id

c8f432f7-a7c1-40a2-9c52-6f56b1e26c8c (old id 1668763)

alternative location

http://dspace.utlib.ee/dspace/bitstream/10062/9746/1/paper6.pdf

date added to LUP

2016-04-04 09:07:57

date last changed

2025-10-14 13:14:23

@inproceedings{c8f432f7-a7c1-40a2-9c52-6f56b1e26c8c,
  abstract     = {{Most text categorization methods use the vector space model in combination with a representation of documents based on bags of words. As its name indicates, bags of words ignore possible structures in the text and only take into account isolated, unrelated words. Although this limitation is widely acknowledged, most previous attempts to extend the bag-of-words model with more advanced approaches failed to produce conclusive improvements. We propose a novel method that extends the word-level representation to automatically extracted semantic and syntactic features. We investigated three extensions: word-sense information, subject–verb–object triples, and rolesemantic predicate–argument tuples, all fitting within the vector space model. We computed their contribution to the categorization results on the Reuters corpus of newswires (RCV1). We show that these three extensions, either taken individually or in combination, result in statistically significant improvements of the microaverage F1 over a baseline using bags of words. We found that our best extended model that uses a combination of syntactic and semantic features reduces the error of the word-level baseline by up to 10 percent for the categories having more than 1,000 documents in the training corpus.}},
  author       = {{Persson, Jacob and Johansson, Richard and Nugues, Pierre}},
  booktitle    = {{Proceedings of the 17th Nordic Conference on Computational Lin- guistics (NODALIDA 2009) / Nealt Proceedings Series}},
  issn         = {{1736-6305}},
  language     = {{eng}},
  pages        = {{142--149}},
  title        = {{Text categorization using predicate-argument structures}},
  url          = {{http://dspace.utlib.ee/dspace/bitstream/10062/9746/1/paper6.pdf}},
  volume       = {{4}},
  year         = {{2009}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Text categorization using predicate-argument structures