Word length, sentence length and frequency : Zipf's law revisited

Sigurd, Bengt; Eeg-Olofsson, Mats; van de Weijer, Joost

Word length, sentence length and frequency : Zipf's law revisited

Mark

Sigurd, Bengt ^LU ; Eeg-Olofsson, Mats ^LU and van de Weijer, Joost ^LU

(2004) In Studia Linguistica 58(1). p.37-52

Abstract: This paper examines data from English, Swedish and German in order to find a theoretical distribution that describes the observed relation between word length and frequency. In Swedish and English, most word tokens consist of three letters only, while shorter or longer words occur less frequently. We found that the equation with the general form fexp = a * Lb * cL (a variant of the so-called gamma distribution) approximates the observed frequencies reasonably well. This formula incorporates both the fact that the number of possible words increases with word length, and the fact that longer words tend to be avoided, presumably because they are uneconomic. To our knowledge this formula has not been proposed to describe word frequency data.... (More); This paper examines data from English, Swedish and German in order to find a theoretical distribution that describes the observed relation between word length and frequency. In Swedish and English, most word tokens consist of three letters only, while shorter or longer words occur less frequently. We found that the equation with the general form fexp = a * Lb * cL (a variant of the so-called gamma distribution) approximates the observed frequencies reasonably well. This formula incorporates both the fact that the number of possible words increases with word length, and the fact that longer words tend to be avoided, presumably because they are uneconomic. To our knowledge this formula has not been proposed to describe word frequency data. We examined frequency distributions of word length in Swedish and English, and explored different variants of the equation by systematically varying the a, b and c parameters. Subsequently, we also applied the formula to the frequency distribution of sentence length in English, and found an almost perfect fit for a corpus consisting of different text genres. Moreover, the data showed that the formula can be used to distinguish between different kinds of text genres. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/161894

author

Sigurd, Bengt ^LU ; Eeg-Olofsson, Mats ^LU and van de Weijer, Joost ^LU

organization

General Linguistics

publishing date

2004

type

Contribution to journal

publication status

published

subject

Comparative Language Studies and Linguistics

keywords

length frequency

in

Studia Linguistica

volume

58

issue

1

pages

37 - 52

publisher

Wiley-Blackwell

external identifiers

wos:000220654000003
scopus:13244251170

ISSN

1467-9582

DOI

10.1111/j.0039-3193.2004.00109.x

language

English

LU publication?

yes

additional info

The information about affiliations in this record was updated in December 2015. The record was previously connected to the following departments: Linguistics and Phonetics (015010003)

id

c4d5330e-e441-4776-a7a7-de0418859186 (old id 161894)

date added to LUP

2016-04-01 15:30:50

date last changed

2025-10-14 13:26:00

@article{c4d5330e-e441-4776-a7a7-de0418859186,
  abstract     = {{This paper examines data from English, Swedish and German in order to find a theoretical distribution that describes the observed relation between word length and frequency. In Swedish and English, most word tokens consist of three letters only, while shorter or longer words occur less frequently. We found that the equation with the general form fexp = a * Lb * cL (a variant of the so-called gamma distribution) approximates the observed frequencies reasonably well. This formula incorporates both the fact that the number of possible words increases with word length, and the fact that longer words tend to be avoided, presumably because they are uneconomic. To our knowledge this formula has not been proposed to describe word frequency data. We examined frequency distributions of word length in Swedish and English, and explored different variants of the equation by systematically varying the a, b and c parameters. Subsequently, we also applied the formula to the frequency distribution of sentence length in English, and found an almost perfect fit for a corpus consisting of different text genres. Moreover, the data showed that the formula can be used to distinguish between different kinds of text genres.}},
  author       = {{Sigurd, Bengt and Eeg-Olofsson, Mats and van de Weijer, Joost}},
  issn         = {{1467-9582}},
  keywords     = {{length frequency}},
  language     = {{eng}},
  number       = {{1}},
  pages        = {{37--52}},
  publisher    = {{Wiley-Blackwell}},
  series       = {{Studia Linguistica}},
  title        = {{Word length, sentence length and frequency : Zipf's law revisited}},
  url          = {{http://dx.doi.org/10.1111/j.0039-3193.2004.00109.x}},
  doi          = {{10.1111/j.0039-3193.2004.00109.x}},
  volume       = {{58}},
  year         = {{2004}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Word length, sentence length and frequency : Zipf's law revisited