Using Confidence Intervals to Determine Adequate Item Sample Sizes for Vocabulary Tests : An Essential but Overlooked Practice

Gyllstad, Henrik; McLean, Stuart; Stewart, Jeffrey

Using Confidence Intervals to Determine Adequate Item Sample Sizes for Vocabulary Tests : An Essential but Overlooked Practice

Mark

Gyllstad, Henrik ^LU ; McLean, Stuart and Stewart, Jeffrey (2021) In Language Testing 38(4). p.558-579

Abstract: The last three decades have seen an increase of tests aimed at measuring an individual’s vocabulary level or size. The target words used in these tests are typically sampled from word frequency lists, which are in turn based on language corpora. Conventionally, test developers sample items from frequency bands of 1000 words; different tests employ different sampling ratios. Some have as few as 5 or 10 items representing the underlying population of words, whereas other tests feature a larger number of items, such as 24, 30, or 40. However, very rarely are the sampling size choices supported by clear empirical evidence. Here, using a bootstrapping approach, we illustrate the effect that a sample-size increase has on confidence intervals of... (More); The last three decades have seen an increase of tests aimed at measuring an individual’s vocabulary level or size. The target words used in these tests are typically sampled from word frequency lists, which are in turn based on language corpora. Conventionally, test developers sample items from frequency bands of 1000 words; different tests employ different sampling ratios. Some have as few as 5 or 10 items representing the underlying population of words, whereas other tests feature a larger number of items, such as 24, 30, or 40. However, very rarely are the sampling size choices supported by clear empirical evidence. Here, using a bootstrapping approach, we illustrate the effect that a sample-size increase has on confidence intervals of individual learner vocabulary knowledge estimates, and on the inferences that can safely be made from test scores. We draw on a unique dataset consisting of adult L1 Japanese test takers’ performance on two English vocabulary test formats, each featuring 1000 words. Our analysis shows that there are few purposes and settings where as few as 5 to 10 sampled items from a 1000-word frequency band (1K) are sufficient. The use of 30 or more items per 1000-word frequency band and tests consisting of fewer bands is recommended. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/ee7facac-c682-4041-b4ec-0e5083f99fa1

author

Gyllstad, Henrik ^LU ; McLean, Stuart and Stewart, Jeffrey

organization

publishing date

2021-10-01

type

Contribution to journal

publication status

published

subject

Comparative Language Studies and Linguistics

keywords

Assessment, bootstrapping, confidence intervals, statistics, testing, validity, vocabulary

in

Language Testing

volume

38

issue

4

pages

21 pages

publisher

SAGE Publications

external identifiers

scopus:85097990311

ISSN

1477-0946

DOI

10.1177/0265532220979562

language

English

LU publication?

yes

id

ee7facac-c682-4041-b4ec-0e5083f99fa1

date added to LUP

2020-10-26 12:55:34

date last changed

2025-10-31 22:05:49

@article{ee7facac-c682-4041-b4ec-0e5083f99fa1,
  abstract     = {{The last three decades have seen an increase of tests aimed at measuring an individual’s vocabulary level or size. The target words used in these tests are typically sampled from word frequency lists, which are in turn based on language corpora. Conventionally, test developers sample items from frequency bands of 1000 words; different tests employ different sampling ratios. Some have as few as 5 or 10 items representing the underlying population of words, whereas other tests feature a larger number of items, such as 24, 30, or 40. However, very rarely are the sampling size choices supported by clear empirical evidence. Here, using a bootstrapping approach, we illustrate the effect that a sample-size increase has on confidence intervals of individual learner vocabulary knowledge estimates, and on the inferences that can safely be made from test scores. We draw on a unique dataset consisting of adult L1 Japanese test takers’ performance on two English vocabulary test formats, each featuring 1000 words. Our analysis shows that there are few purposes and settings where as few as 5 to 10 sampled items from a 1000-word frequency band (1K) are sufficient. The use of 30 or more items per 1000-word frequency band and tests consisting of fewer bands is recommended.}},
  author       = {{Gyllstad, Henrik and McLean, Stuart and Stewart, Jeffrey}},
  issn         = {{1477-0946}},
  keywords     = {{Assessment; bootstrapping; confidence intervals; statistics; testing; validity; vocabulary}},
  language     = {{eng}},
  month        = {{10}},
  number       = {{4}},
  pages        = {{558--579}},
  publisher    = {{SAGE Publications}},
  series       = {{Language Testing}},
  title        = {{Using Confidence Intervals to Determine Adequate Item Sample Sizes for Vocabulary Tests : An Essential but Overlooked Practice}},
  url          = {{http://dx.doi.org/10.1177/0265532220979562}},
  doi          = {{10.1177/0265532220979562}},
  volume       = {{38}},
  year         = {{2021}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Using Confidence Intervals to Determine Adequate Item Sample Sizes for Vocabulary Tests : An Essential but Overlooked Practice