Can we evaluate the quality of software engineering experiments?

Kitchenham, Barbara; Sjøberg, Dag I. K.; Brereton, O. Pearl; Dybå, Tore; Höst, Martin; Pfahl, Dietmar; Runeson, Per

Can we evaluate the quality of software engineering experiments?

Mark

Kitchenham, Barbara ; Sjøberg, Dag I. K. ; Brereton, O. Pearl ; Dybå, Tore ; Höst, Martin ^LU ; Pfahl, Dietmar ^LU and Runeson, Per ^LU

(2010) ESEM '10: Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, September 16-17 p.1-8

Abstract: Context: The authors wanted to assess whether the quality of published human-centric software engineering experiments was improving. This required a reliable means of assessing the quality of such experiments.

Aims: The aims of the study were to confirm the usability of a quality evaluation checklist, determine how many reviewers were needed per paper that reports an experiment, and specify an appropriate process for evaluating quality.

Method: With eight reviewers and four papers describing human-centric software engineering experiments, we used a quality checklist with nine questions. We conducted the study in two parts: the first was based on individual assessments and the second on collaborative evaluations.... (More); Context: The authors wanted to assess whether the quality of published human-centric software engineering experiments was improving. This required a reliable means of assessing the quality of such experiments.

Aims: The aims of the study were to confirm the usability of a quality evaluation checklist, determine how many reviewers were needed per paper that reports an experiment, and specify an appropriate process for evaluating quality.

Method: With eight reviewers and four papers describing human-centric software engineering experiments, we used a quality checklist with nine questions. We conducted the study in two parts: the first was based on individual assessments and the second on collaborative evaluations.

Results: The inter-rater reliability was poor for individual assessments but much better for joint evaluations. Four reviewers working in two pairs with discussion were more reliable than eight reviewers with no discussion. The sum of the nine criteria was more reliable than individual questions or a simple overall assessment.

Conclusions: If quality evaluation is critical, more than two reviewers are required and a round of discussion is necessary. We advise using quality criteria and basing the final assessment on the sum of the aggregated criteria. The restricted number of papers used and the relatively extensive expertise of the reviewers limit our results. In addition, the results of the second part of the study could have been affected by removing a time restriction on the review as well as the consultation process. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/1680334

author

Kitchenham, Barbara ; Sjøberg, Dag I. K. ; Brereton, O. Pearl ; Dybå, Tore ; Höst, Martin ^LU ; Pfahl, Dietmar ^LU and Runeson, Per ^LU

organization

publishing date

2010

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer Sciences

host publication

[Host publication title missing]

pages

1 - 8

publisher

Association for Computing Machinery (ACM)

conference name

ESEM '10: Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, September 16-17

conference location

Bolzano, Italy

conference dates

2010-09-16 - 2010-09-17

external identifiers

scopus:78149243977

ISBN

978-1-4503-0039-1

DOI

10.1145/1852786.1852789

language

English

LU publication?

yes

id

53991f99-a26e-423f-bbb5-4a903f0e5a26 (old id 1680334)

date added to LUP

2016-04-04 11:22:19

date last changed

2025-10-14 13:09:18

@inproceedings{53991f99-a26e-423f-bbb5-4a903f0e5a26,
  abstract     = {{Context: The authors wanted to assess whether the quality of published human-centric software engineering experiments was improving. This required a reliable means of assessing the quality of such experiments. <br/><br>
Aims: The aims of the study were to confirm the usability of a quality evaluation checklist, determine how many reviewers were needed per paper that reports an experiment, and specify an appropriate process for evaluating quality. <br/><br>
Method: With eight reviewers and four papers describing human-centric software engineering experiments, we used a quality checklist with nine questions. We conducted the study in two parts: the first was based on individual assessments and the second on collaborative evaluations. <br/><br>
Results: The inter-rater reliability was poor for individual assessments but much better for joint evaluations. Four reviewers working in two pairs with discussion were more reliable than eight reviewers with no discussion. The sum of the nine criteria was more reliable than individual questions or a simple overall assessment. <br/><br>
Conclusions: If quality evaluation is critical, more than two reviewers are required and a round of discussion is necessary. We advise using quality criteria and basing the final assessment on the sum of the aggregated criteria. The restricted number of papers used and the relatively extensive expertise of the reviewers limit our results. In addition, the results of the second part of the study could have been affected by removing a time restriction on the review as well as the consultation process.}},
  author       = {{Kitchenham, Barbara and Sjøberg, Dag I. K. and Brereton, O. Pearl and Dybå, Tore and Höst, Martin and Pfahl, Dietmar and Runeson, Per}},
  booktitle    = {{[Host publication title missing]}},
  isbn         = {{978-1-4503-0039-1}},
  language     = {{eng}},
  pages        = {{1--8}},
  publisher    = {{Association for Computing Machinery (ACM)}},
  title        = {{Can we evaluate the quality of software engineering experiments?}},
  url          = {{https://lup.lub.lu.se/search/files/7477441/ESEM2010_5_Kitchenham_Sjoberg_Brereton_Budgen_Dyba_Host_Pfahl_Runeson.pdf}},
  doi          = {{10.1145/1852786.1852789}},
  year         = {{2010}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Can we evaluate the quality of software engineering experiments?