Statistics for sentential co-occurrence

Willners, Caroline; Holtsberg, Anders

Statistics for sentential co-occurrence

Mark

Willners, Caroline ^LU and Holtsberg, Anders (2001) In Working Papers, Lund University, Dept. of Linguistics 48.

Abstract: There is a growing trend in linguistics to use large corpora as a tool in the study of language. Through the investigation of the different contexts a word occurs in, it is possible to gain insight in the meanings associated with the word. Concordances are commonly used as a tool in lexicography, but while the study of concordances is fruitful it is also tedious, so statistical methods are gaining grounds in corpus linguistics. Several statistical measures have been introduced to measure the strength in association between two words, e.g. t-score (Barnbrook 1996:97-98), mutual information, MI (Charniak 1993; McEnery & Wilson 1996; Oakes

1998) and Berry-Rogghe’s z-score (1973). Those measures are designed to measure the... (More); There is a growing trend in linguistics to use large corpora as a tool in the study of language. Through the investigation of the different contexts a word occurs in, it is possible to gain insight in the meanings associated with the word. Concordances are commonly used as a tool in lexicography, but while the study of concordances is fruitful it is also tedious, so statistical methods are gaining grounds in corpus linguistics. Several statistical measures have been introduced to measure the strength in association between two words, e.g. t-score (Barnbrook 1996:97-98), mutual information, MI (Charniak 1993; McEnery & Wilson 1996; Oakes

1998) and Berry-Rogghe’s z-score (1973). Those measures are designed to measure the strength of association between words occurring at a close distance from each other, i.e. immediately next to each other or within a fixed window span. Research that uses the sentence as a linguistic unit of study has also been presented. For example, antonymous concepts have been shown to co-occur in the same sentence more often than chance predicts by Justeson & Katz 1991, 1992 and Fellbaum 1995. A problem using the sentence as unit of study is that the lengths of the sentences vary from sentence to sentence. This has an impact on the statistical calculation – it is more likely to find two given words in a long sentence than in a short one. The probability of finding two given words co-occurring in the same sentence is thus affected. We introduce an exact expression for the calculation of the expected number of sentential co-occurrences. The p-value is

calculated assuming that the number of random co-occurrences follows a Poisson distribution. A formal proof justifying this approximation is provided in the appendix. Apart from the statistical methods that account for the variation in sentence length, a case study is presented as an application of the statistical method. The study replicates Justeson and Katz’s 1991 study that shows that English antonyms co-occur sententially more frequently than chance predicts. The results of our study show that the variation in sentence length causes the chance for co-occurrence of two given words to increase. However, the main finding of Justeson & Katz is reinforced: antonyms co-occur significantly more often in the same sentence than expected by chance. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/528668

author

Willners, Caroline ^LU and Holtsberg, Anders

organization

General Linguistics

publishing date

2001

type

Working paper/Preprint

publication status

published

subject

Comparative Language Studies and Linguistics

in

Working Papers, Lund University, Dept. of Linguistics

volume

48

language

English

LU publication?

yes

id

5d64372f-4bc3-4d51-89c2-5713c9cf7ae1 (old id 528668)

alternative location

http://www.ling.lu.se/disseminations/pdf/48/Holtsberg_Willners.pdf

date added to LUP

2016-04-04 13:30:10

date last changed

2025-04-04 14:55:37

@misc{5d64372f-4bc3-4d51-89c2-5713c9cf7ae1,
  abstract     = {{There is a growing trend in linguistics to use large corpora as a tool in the study of language. Through the investigation of the different contexts a word occurs in, it is possible to gain insight in the meanings associated with the word. Concordances are commonly used as a tool in lexicography, but while the study of concordances is fruitful it is also tedious, so statistical methods are gaining grounds in corpus linguistics. Several statistical measures have been introduced to measure the strength in association between two words, e.g. t-score (Barnbrook 1996:97-98), mutual information, MI (Charniak 1993; McEnery &amp; Wilson 1996; Oakes<br/><br>
1998) and Berry-Rogghe’s z-score (1973). Those measures are designed to measure the strength of association between words occurring at a close distance from each other, i.e. immediately next to each other or within a fixed window span. Research that uses the sentence as a linguistic unit of study has also been presented. For example, antonymous concepts have been shown to co-occur in the same sentence more often than chance predicts by Justeson &amp; Katz 1991, 1992 and Fellbaum 1995. A problem using the sentence as unit of study is that the lengths of the sentences vary from sentence to sentence. This has an impact on the statistical calculation – it is more likely to find two given words in a long sentence than in a short one. The probability of finding two given words co-occurring in the same sentence is thus affected. We introduce an exact expression for the calculation of the expected number of sentential co-occurrences. The p-value is<br/><br>
calculated assuming that the number of random co-occurrences follows a Poisson distribution. A formal proof justifying this approximation is provided in the appendix. Apart from the statistical methods that account for the variation in sentence length, a case study is presented as an application of the statistical method. The study replicates Justeson and Katz’s 1991 study that shows that English antonyms co-occur sententially more frequently than chance predicts. The results of our study show that the variation in sentence length causes the chance for co-occurrence of two given words to increase. However, the main finding of Justeson &amp; Katz is reinforced: antonyms co-occur significantly more often in the same sentence than expected by chance.}},
  author       = {{Willners, Caroline and Holtsberg, Anders}},
  language     = {{eng}},
  note         = {{Working Paper}},
  series       = {{Working Papers, Lund University, Dept. of Linguistics}},
  title        = {{Statistics for sentential co-occurrence}},
  url          = {{https://lup.lub.lu.se/search/files/6135794/624438.pdf}},
  volume       = {{48}},
  year         = {{2001}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Statistics for sentential co-occurrence