Active Learning of Regular Expressions for Entity Extraction

Created by W.Langdon from gp-bibliography.bib Revision:1.3872

@Article{Bartoli:2017:ieeeTC,
  author =       "A. Bartoli and A. {De Lorenzo} and E. Medvet and 
                 F. Tarlao",
  journal =      "IEEE Transactions on Cybernetics",
  title =        "Active Learning of Regular Expressions for Entity
                 Extraction",
  year =         "2017",
  abstract =     "We consider the automatic synthesis of an entity
                 extractor, in the form of a regular expression, from
                 examples of the desired extractions in an unstructured
                 text stream. This is a long-standing problem for which
                 many different approaches have been proposed, which all
                 require the preliminary construction of a large dataset
                 fully annotated by the user. In this paper, we propose
                 an active learning approach aimed at minimizing the
                 user annotation effort: the user annotates only one
                 desired extraction and then merely answers extraction
                 queries generated by the system. During the learning
                 process, the system digs into the input text for
                 selecting the most appropriate extraction query to be
                 submitted to the user in order to improve the current
                 extractor. We construct candidate solutions with
                 genetic programming (GP) and select queries with a form
                 of querying-by-committee, i.e., based on a measure of
                 disagreement within the best candidate solutions. All
                 the components of our system are carefully tailored to
                 the peculiarities of active learning with GP and of
                 entity extraction from unstructured text. We evaluate
                 our proposal in depth, on a number of challenging
                 datasets and based on a realistic estimate of the user
                 effort involved in answering each single query. The
                 results demonstrate high accuracy with significant
                 savings in terms of computational effort, annotated
                 characters, and execution time over a state-of-the-art
                 baseline.",
  keywords =     "genetic algorithms, genetic programming",
  DOI =          "doi:10.1109/TCYB.2017.2680466",
  ISSN =         "2168-2267",
  notes =        "Also known as \cite{7886274}",
}

Genetic Programming entries for Alberto Bartoli Andrea De Lorenzo Eric Medvet Fabiano Tarlao

Citations