Word length, sentence length and frequency: Zipf's law revisited.
(2004) In Studia Linguistica 58(1). p.3752 Abstract
 This paper examines data from English, Swedish and German in order to find a theoretical distribution that describes the observed relation between word length and frequency. In Swedish and English, most word tokens consist of three letters only, while shorter or longer words occur less frequently. We found that the equation with the general form fexp = a * Lb * cL (a variant of the socalled gamma distribution) approximates the observed frequencies reasonably well. This formula incorporates both the fact that the number of possible words increases with word length, and the fact that longer words tend to be avoided, presumably because they are uneconomic. To our knowledge this formula has not been proposed to describe word frequency data.... (More)
 This paper examines data from English, Swedish and German in order to find a theoretical distribution that describes the observed relation between word length and frequency. In Swedish and English, most word tokens consist of three letters only, while shorter or longer words occur less frequently. We found that the equation with the general form fexp = a * Lb * cL (a variant of the socalled gamma distribution) approximates the observed frequencies reasonably well. This formula incorporates both the fact that the number of possible words increases with word length, and the fact that longer words tend to be avoided, presumably because they are uneconomic. To our knowledge this formula has not been proposed to describe word frequency data. We examined frequency distributions of word length in Swedish and English, and explored different variants of the equation by systematically varying the a, b and c parameters. Subsequently, we also applied the formula to the frequency distribution of sentence length in English, and found an almost perfect fit for a corpus consisting of different text genres. Moreover, the data showed that the formula can be used to distinguish between different kinds of text genres. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/record/161894
 author
 Sigurd, Bengt ^{LU} ; EegOlofsson, Mats ^{LU} and van de Weijer, Joost ^{LU}
 organization
 publishing date
 2004
 type
 Contribution to journal
 publication status
 published
 subject
 keywords
 length frequency
 in
 Studia Linguistica
 volume
 58
 issue
 1
 pages
 37  52
 publisher
 WileyBlackwell
 external identifiers

 wos:000220654000003
 ISSN
 14679582
 DOI
 10.1111/j.00393193.2004.00109.x
 language
 English
 LU publication?
 yes
 id
 c4d5330ee4414776a7a7de0418859186 (old id 161894)
 date added to LUP
 20070727 14:58:58
 date last changed
 20160416 03:02:29
@article{c4d5330ee4414776a7a7de0418859186, abstract = {This paper examines data from English, Swedish and German in order to find a theoretical distribution that describes the observed relation between word length and frequency. In Swedish and English, most word tokens consist of three letters only, while shorter or longer words occur less frequently. We found that the equation with the general form fexp = a * Lb * cL (a variant of the socalled gamma distribution) approximates the observed frequencies reasonably well. This formula incorporates both the fact that the number of possible words increases with word length, and the fact that longer words tend to be avoided, presumably because they are uneconomic. To our knowledge this formula has not been proposed to describe word frequency data. We examined frequency distributions of word length in Swedish and English, and explored different variants of the equation by systematically varying the a, b and c parameters. Subsequently, we also applied the formula to the frequency distribution of sentence length in English, and found an almost perfect fit for a corpus consisting of different text genres. Moreover, the data showed that the formula can be used to distinguish between different kinds of text genres.}, author = {Sigurd, Bengt and EegOlofsson, Mats and van de Weijer, Joost}, issn = {14679582}, keyword = {length frequency}, language = {eng}, number = {1}, pages = {3752}, publisher = {WileyBlackwell}, series = {Studia Linguistica}, title = {Word length, sentence length and frequency: Zipf's law revisited.}, url = {http://dx.doi.org/10.1111/j.00393193.2004.00109.x}, volume = {58}, year = {2004}, }