Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair

Created by W.Langdon from gp-bibliography.bib Revision:1.4202

  author =       "Edward K. Smith and Earl T. Barr and 
                 Claire {Le Goues} and Yuriy Brun",
  title =        "Is the Cure Worse Than the Disease? Overfitting in
                 Automated Program Repair",
  booktitle =    "10th Joint Meeting of the European Software
                 Engineering Conference and the ACM SIGSOFT Symposium on
                 the Foundations of Software Engineering (ESEC/FSE
  year =         "2015",
  editor =       "Mark Harman and Patrick Heymans",
  pages =        "532--543",
  address =      "Bergamo, Italy",
  month =        aug # " 30 - " # sep # " 4",
  publisher =    "ACM",
  keywords =     "genetic algorithms, genetic programming, SBSE,
                 GenProg, TrpAutoRepair (RSRepair), IntroClass,
                 automated program repair, empirical evaluation,
                 independent evaluation, Klee",
  isbn13 =       "978-1-4503-3675-8",
  URL =          "",
  DOI =          "doi:10.1145/2786805.2786825",
  acmid =        "2786825",
  size =         "12 pages",
  abstract =     "Automated program repair has shown promise for
                 reducing the significant manual effort debugging
                 requires. This paper addresses a deficit of earlier
                 evaluations of automated repair techniques caused by
                 repairing programs and evaluating generated patches'
                 correctness using the same set of tests. Since tests
                 are an imperfect metric of program correctness,
                 evaluations of this type do not discriminate between
                 correct patches and patches that overfit the available
                 tests and break untested but desired functionality.
                 This paper evaluates two well-studied repair tools,
                 GenProg and TrpAutoRepair, on a publicly available
                 benchmark of bugs, each with a human-written patch. By
                 evaluating patches using tests independent from those
                 used during repair, we find that the tools are unlikely
                 to improve the proportion of independent tests passed,
                 and that the quality of the patches is proportional to
                 the coverage of the test suite used during repair. For
                 programs that pass most tests, the tools are as likely
                 to break tests as to fix them. However, novice
                 developers also overfit, and automated repair performs
                 no worse than these developers. In addition to over
                 fitting, we measure the effects of test suite coverage,
                 test suite provenance, and starting program quality, as
                 well as the difference in quality between
                 novice-developer-written and tool-generated patches
                 when quality is assessed with a test suite independent
                 from the one used for patch generation",
  notes =        "University of Massachusetts at Amherst, USA;
                 University College London, UK; Carnegie Mellon
                 University, USA; University of Massachusetts,


                 Tries n-version programming with majority vote. White
                 box tests different from test suite used to train GP.
                 These second set automatically generated using Klee.
                 200 students at UC Davis.

                 Cliff's Delta test described as measuring an effect
                 size but Wikipedia's definition (2015 Sep 21) suggests
                 its just another non-parametric significance
                 statistical test.

                 p541 'Synthesis techniques...not suitable for legacy

Genetic Programming entries for Edward K Smith Earl Barr Claire Le Goues Yuriy Brun