Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

A Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects

Borg, Markus LU ; Runeson, Per LU orcid ; Johansson, Jens and Mäntylä, Mika (2014) 8th International Symposium on Empirical Software Engineering and Measurement
Abstract
Context: Duplicate detection is a fundamental part of issue management. Systems able to predict whether a new defect report will be closed as a duplicate, may decrease costs by limiting rework and collecting related pieces of information. Goal: Our work explores using Apache Lucene for large-scale duplicate detection based on textual content. Also, we evaluate the previous claim that results are improved if the title is weighted as more important than the description. Method: We conduct a conceptual replication of a well-cited study conducted at Sony Ericsson, using Lucene for searching in the public Android defect repository. In line with the original study, we explore how varying the weighting of the title and the description affects the... (More)
Context: Duplicate detection is a fundamental part of issue management. Systems able to predict whether a new defect report will be closed as a duplicate, may decrease costs by limiting rework and collecting related pieces of information. Goal: Our work explores using Apache Lucene for large-scale duplicate detection based on textual content. Also, we evaluate the previous claim that results are improved if the title is weighted as more important than the description. Method: We conduct a conceptual replication of a well-cited study conducted at Sony Ericsson, using Lucene for searching in the public Android defect repository. In line with the original study, we explore how varying the weighting of the title and the description affects the accuracy. Results: We show that Lucene obtains the best results when the defect report title is weighted three times higher than the description, a bigger difference than has been previously acknowledged. Conclusions: Our work shows the potential of using Lucene as a scalable solution for duplicate detection. (Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Chapter in Book/Report/Conference proceeding
publication status
published
subject
keywords
software evolution, issue management, information retrieval, replication
host publication
[Host publication title missing]
pages
4 pages
conference name
8th International Symposium on Empirical Software Engineering and Measurement
conference location
Turin, Italy
conference dates
2014-09-18
external identifiers
  • scopus:84907853931
DOI
10.1145/2652524.2652556
project
Embedded Applications Software Engineering
language
English
LU publication?
yes
id
ad808def-3b14-48e4-a0b7-cf22dca7f570 (old id 4647034)
date added to LUP
2016-04-04 13:24:40
date last changed
2023-01-06 02:17:03
@inproceedings{ad808def-3b14-48e4-a0b7-cf22dca7f570,
  abstract     = {{Context: Duplicate detection is a fundamental part of issue management. Systems able to predict whether a new defect report will be closed as a duplicate, may decrease costs by limiting rework and collecting related pieces of information. Goal: Our work explores using Apache Lucene for large-scale duplicate detection based on textual content. Also, we evaluate the previous claim that results are improved if the title is weighted as more important than the description. Method: We conduct a conceptual replication of a well-cited study conducted at Sony Ericsson, using Lucene for searching in the public Android defect repository. In line with the original study, we explore how varying the weighting of the title and the description affects the accuracy. Results: We show that Lucene obtains the best results when the defect report title is weighted three times higher than the description, a bigger difference than has been previously acknowledged. Conclusions: Our work shows the potential of using Lucene as a scalable solution for duplicate detection.}},
  author       = {{Borg, Markus and Runeson, Per and Johansson, Jens and Mäntylä, Mika}},
  booktitle    = {{[Host publication title missing]}},
  keywords     = {{software evolution; issue management; information retrieval; replication}},
  language     = {{eng}},
  title        = {{A Replicated Study on Duplicate Detection: Using Apache Lucene to Search Among Android Defects}},
  url          = {{https://lup.lub.lu.se/search/files/6113208/4647035.pdf}},
  doi          = {{10.1145/2652524.2652556}},
  year         = {{2014}},
}