SEDGE : Symbolic example data generation for dataflow programs

Li, Kaituo; Reichenbach, Christoph; Smaragdakis, Yannis; Diao, Yanlei; Csallner, Christoph

SEDGE : Symbolic example data generation for dataflow programs

Mark

; Smaragdakis, Yannis ; Diao, Yanlei and Csallner, Christoph (2013) 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 p.235-245

Abstract: Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using... (More); Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using input data as concrete aids in the solution process. The approach resembles dynamic-symbolic (a.k.a. 'concolic') execution in a conventional programming language, adapted to the unique features of the dataflow domain. In third-party benchmarks, SEDGE achieves higher coverage than past techniques for 5 out of 20 PigMix benchmarks and 7 out of 11 SDSS benchmarks and (with equal coverage for the rest of the benchmarks). We also show that our targeting of the high-level dataflow language pays off: for complex programs, state-of-the-art dynamic-symbolic execution at the level of the generated map-reduce code (instead of the original dataflow program) requires many more test cases or achieves much lower coverage than our approach.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/90e0c5d7-053d-46f3-bb75-8f8fd266687c

author

Li, Kaituo ; Reichenbach, Christoph ^LU

; Smaragdakis, Yannis ; Diao, Yanlei and Csallner, Christoph

publishing date

2013-12-01

type

Chapter in Book/Report/Conference proceeding

publication status

published

subject

Computer Sciences

keywords

data flow analysis, program testing, programming languages, reasoning about programs, specification languages

host publication

2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings

article number

6693083

pages

11 pages

conference name

2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013

conference location

Palo Alto, CA, United States

conference dates

2013-11-11 - 2013-11-15

external identifiers

scopus:84893566097

ISBN

9781479902156

DOI

10.1109/ASE.2013.6693083

language

English

LU publication?

no

id

90e0c5d7-053d-46f3-bb75-8f8fd266687c

date added to LUP

2019-03-29 19:42:53

date last changed

2025-10-14 10:20:16

@inproceedings{90e0c5d7-053d-46f3-bb75-8f8fd266687c,
  abstract     = {{<p>Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using input data as concrete aids in the solution process. The approach resembles dynamic-symbolic (a.k.a. 'concolic') execution in a conventional programming language, adapted to the unique features of the dataflow domain. In third-party benchmarks, SEDGE achieves higher coverage than past techniques for 5 out of 20 PigMix benchmarks and 7 out of 11 SDSS benchmarks and (with equal coverage for the rest of the benchmarks). We also show that our targeting of the high-level dataflow language pays off: for complex programs, state-of-the-art dynamic-symbolic execution at the level of the generated map-reduce code (instead of the original dataflow program) requires many more test cases or achieves much lower coverage than our approach.</p>}},
  author       = {{Li, Kaituo and Reichenbach, Christoph and Smaragdakis, Yannis and Diao, Yanlei and Csallner, Christoph}},
  booktitle    = {{2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings}},
  isbn         = {{9781479902156}},
  keywords     = {{data flow analysis; program testing; programming languages; reasoning about programs; specification languages}},
  language     = {{eng}},
  month        = {{12}},
  pages        = {{235--245}},
  title        = {{SEDGE : Symbolic example data generation for dataflow programs}},
  url          = {{http://dx.doi.org/10.1109/ASE.2013.6693083}},
  doi          = {{10.1109/ASE.2013.6693083}},
  year         = {{2013}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

SEDGE : Symbolic example data generation for dataflow programs