Advanced

Statistics for sentential co-occurrence

Willners, Caroline LU and Holtsberg, Anders (2001) In Working Papers, Lund University, Dept. of Linguistics 48.
Abstract
There is a growing trend in linguistics to use large corpora as a tool in the study of language. Through the investigation of the different contexts a word occurs in, it is possible to gain insight in the meanings associated with the word. Concordances are commonly used as a tool in lexicography, but while the study of concordances is fruitful it is also tedious, so statistical methods are gaining grounds in corpus linguistics. Several statistical measures have been introduced to measure the strength in association between two words, e.g. t-score (Barnbrook 1996:97-98), mutual information, MI (Charniak 1993; McEnery & Wilson 1996; Oakes

1998) and Berry-Rogghe’s z-score (1973). Those measures are designed to measure the... (More)
There is a growing trend in linguistics to use large corpora as a tool in the study of language. Through the investigation of the different contexts a word occurs in, it is possible to gain insight in the meanings associated with the word. Concordances are commonly used as a tool in lexicography, but while the study of concordances is fruitful it is also tedious, so statistical methods are gaining grounds in corpus linguistics. Several statistical measures have been introduced to measure the strength in association between two words, e.g. t-score (Barnbrook 1996:97-98), mutual information, MI (Charniak 1993; McEnery & Wilson 1996; Oakes

1998) and Berry-Rogghe’s z-score (1973). Those measures are designed to measure the strength of association between words occurring at a close distance from each other, i.e. immediately next to each other or within a fixed window span. Research that uses the sentence as a linguistic unit of study has also been presented. For example, antonymous concepts have been shown to co-occur in the same sentence more often than chance predicts by Justeson & Katz 1991, 1992 and Fellbaum 1995. A problem using the sentence as unit of study is that the lengths of the sentences vary from sentence to sentence. This has an impact on the statistical calculation – it is more likely to find two given words in a long sentence than in a short one. The probability of finding two given words co-occurring in the same sentence is thus affected. We introduce an exact expression for the calculation of the expected number of sentential co-occurrences. The p-value is

calculated assuming that the number of random co-occurrences follows a Poisson distribution. A formal proof justifying this approximation is provided in the appendix. Apart from the statistical methods that account for the variation in sentence length, a case study is presented as an application of the statistical method. The study replicates Justeson and Katz’s 1991 study that shows that English antonyms co-occur sententially more frequently than chance predicts. The results of our study show that the variation in sentence length causes the chance for co-occurrence of two given words to increase. However, the main finding of Justeson & Katz is reinforced: antonyms co-occur significantly more often in the same sentence than expected by chance. (Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Working Paper
publication status
published
subject
in
Working Papers, Lund University, Dept. of Linguistics
volume
48
language
English
LU publication?
yes
id
5d64372f-4bc3-4d51-89c2-5713c9cf7ae1 (old id 528668)
alternative location
http://www.ling.lu.se/disseminations/pdf/48/Holtsberg_Willners.pdf
date added to LUP
2007-09-28 07:45:54
date last changed
2016-04-16 11:17:29
@misc{5d64372f-4bc3-4d51-89c2-5713c9cf7ae1,
  abstract     = {There is a growing trend in linguistics to use large corpora as a tool in the study of language. Through the investigation of the different contexts a word occurs in, it is possible to gain insight in the meanings associated with the word. Concordances are commonly used as a tool in lexicography, but while the study of concordances is fruitful it is also tedious, so statistical methods are gaining grounds in corpus linguistics. Several statistical measures have been introduced to measure the strength in association between two words, e.g. t-score (Barnbrook 1996:97-98), mutual information, MI (Charniak 1993; McEnery &amp; Wilson 1996; Oakes<br/><br>
1998) and Berry-Rogghe’s z-score (1973). Those measures are designed to measure the strength of association between words occurring at a close distance from each other, i.e. immediately next to each other or within a fixed window span. Research that uses the sentence as a linguistic unit of study has also been presented. For example, antonymous concepts have been shown to co-occur in the same sentence more often than chance predicts by Justeson &amp; Katz 1991, 1992 and Fellbaum 1995. A problem using the sentence as unit of study is that the lengths of the sentences vary from sentence to sentence. This has an impact on the statistical calculation – it is more likely to find two given words in a long sentence than in a short one. The probability of finding two given words co-occurring in the same sentence is thus affected. We introduce an exact expression for the calculation of the expected number of sentential co-occurrences. The p-value is<br/><br>
calculated assuming that the number of random co-occurrences follows a Poisson distribution. A formal proof justifying this approximation is provided in the appendix. Apart from the statistical methods that account for the variation in sentence length, a case study is presented as an application of the statistical method. The study replicates Justeson and Katz’s 1991 study that shows that English antonyms co-occur sententially more frequently than chance predicts. The results of our study show that the variation in sentence length causes the chance for co-occurrence of two given words to increase. However, the main finding of Justeson &amp; Katz is reinforced: antonyms co-occur significantly more often in the same sentence than expected by chance.},
  author       = {Willners, Caroline and Holtsberg, Anders},
  language     = {eng},
  series       = {Working Papers, Lund University, Dept. of Linguistics},
  title        = {Statistics for sentential co-occurrence},
  volume       = {48},
  year         = {2001},
}