Three Empirical Studies on the Agreement of Reviewers about the Quality of Software Engineer ing Experiments

Kitchenham, Barbara; Sjøberg, Dag; Dybå, Tore; Pfahl, Dietmar; Brereton, Pearl; Budgen, David; Höst, Martin; Runeson, Per

Three Empirical Studies on the Agreement of Reviewers about the Quality of Software Engineer ing Experiments

Mark

Kitchenham, Barbara ; Sjøberg, Dag ; Dybå, Tore ; Pfahl, Dietmar ^LU ; Brereton, Pearl ; Budgen, David ; Höst, Martin ^LU and Runeson, Per ^LU

(2012) In Information and Software Technology 54(8). p.804-819

Abstract: Context: During systematic literature reviews it is necessary to assess the quality of empirical papers. Current guidelines suggest that two researchers should independently apply a quality checklist and any disagreements must be resolved. However, there is little empirical evidence concerning the effectiveness of these guidelines. Aims: This paper investigates the three techniques that can be used to improve the reliability (i.e. the consensus among reviewers) of quality assessments, specifically, the number of reviewers, the use of a set of evaluation criteria and consultation among reviewers. We undertook a series of studies to investigate these factors. Method: Two studies involved four research papers and eight reviewers using a... (More); Context: During systematic literature reviews it is necessary to assess the quality of empirical papers. Current guidelines suggest that two researchers should independently apply a quality checklist and any disagreements must be resolved. However, there is little empirical evidence concerning the effectiveness of these guidelines. Aims: This paper investigates the three techniques that can be used to improve the reliability (i.e. the consensus among reviewers) of quality assessments, specifically, the number of reviewers, the use of a set of evaluation criteria and consultation among reviewers. We undertook a series of studies to investigate these factors. Method: Two studies involved four research papers and eight reviewers using a quality checklist with nine questions. The first study was based on individual assessments, the second study on joint assessments with a period of inter-rater discussion. A third more formal randomised block experiment involved 48 reviewers assessing two of the papers used previously in teams of one, two and three persons to assess the impact of discussion among teams of different size using the evaluations of the "teams" of one person as a control. Results: For the first two studies, the inter-rater reliability was poor for individual assessments, but better for joint evaluations. However, the results of the third study contradicted the results of study 2. Inter-rater reliability was poor for all groups but worse for teams of two or three than for individuals. Conclusions: When performing quality assessments for systematic literature reviews, we recommend using three independent reviewers and adopting the median assessment. A quality checklist seems useful but it is difficult to ensure that the checklist is both appropriate and understood by reviewers. Furthermore, future experiments should ensure participants are given more time to understand the quality checklist and to evaluate the research papers. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/2518839

author

Kitchenham, Barbara ; Sjøberg, Dag ; Dybå, Tore ; Pfahl, Dietmar ^LU ; Brereton, Pearl ; Budgen, David ; Höst, Martin ^LU and Runeson, Per ^LU

organization

publishing date

2012

type

Contribution to journal

publication status

published

subject

Computer Sciences

in

Information and Software Technology

volume

54

issue

8

pages

804 - 819

publisher

Elsevier

external identifiers

wos:000305599200002
scopus:84861576818

ISSN

0950-5849

DOI

10.1016/j.infsof.2011.11.008

language

English

LU publication?

yes

id

054dbaab-5ebb-4dc0-b532-11282a09f7f3 (old id 2518839)

date added to LUP

2016-04-01 13:22:11

date last changed

2025-10-14 13:16:44

@article{054dbaab-5ebb-4dc0-b532-11282a09f7f3,
  abstract     = {{Context: During systematic literature reviews it is necessary to assess the quality of empirical papers. Current guidelines suggest that two researchers should independently apply a quality checklist and any disagreements must be resolved. However, there is little empirical evidence concerning the effectiveness of these guidelines. Aims: This paper investigates the three techniques that can be used to improve the reliability (i.e. the consensus among reviewers) of quality assessments, specifically, the number of reviewers, the use of a set of evaluation criteria and consultation among reviewers. We undertook a series of studies to investigate these factors. Method: Two studies involved four research papers and eight reviewers using a quality checklist with nine questions. The first study was based on individual assessments, the second study on joint assessments with a period of inter-rater discussion. A third more formal randomised block experiment involved 48 reviewers assessing two of the papers used previously in teams of one, two and three persons to assess the impact of discussion among teams of different size using the evaluations of the "teams" of one person as a control. Results: For the first two studies, the inter-rater reliability was poor for individual assessments, but better for joint evaluations. However, the results of the third study contradicted the results of study 2. Inter-rater reliability was poor for all groups but worse for teams of two or three than for individuals. Conclusions: When performing quality assessments for systematic literature reviews, we recommend using three independent reviewers and adopting the median assessment. A quality checklist seems useful but it is difficult to ensure that the checklist is both appropriate and understood by reviewers. Furthermore, future experiments should ensure participants are given more time to understand the quality checklist and to evaluate the research papers.}},
  author       = {{Kitchenham, Barbara and Sjøberg, Dag and Dybå, Tore and Pfahl, Dietmar and Brereton, Pearl and Budgen, David and Höst, Martin and Runeson, Per}},
  issn         = {{0950-5849}},
  language     = {{eng}},
  number       = {{8}},
  pages        = {{804--819}},
  publisher    = {{Elsevier}},
  series       = {{Information and Software Technology}},
  title        = {{Three Empirical Studies on the Agreement of Reviewers about the Quality of Software Engineer ing Experiments}},
  url          = {{http://dx.doi.org/10.1016/j.infsof.2011.11.008}},
  doi          = {{10.1016/j.infsof.2011.11.008}},
  volume       = {{54}},
  year         = {{2012}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Three Empirical Studies on the Agreement of Reviewers about the Quality of Software Engineer ing Experiments