Machine Learning Techniques for Document Processing and Web Security

Created by W.Langdon from gp-bibliography.bib Revision:1.4549

  author =       "Enrico Sorio",
  title =        "Machine Learning Techniques for Document Processing
                 and Web Security",
  school =       "Universita Degli Studi di Trieste",
  year =         "2013?",
  type =         "ING-INF/05",
  address =      "Italy",
  month =        "13 " # mar # "?",
  keywords =     "genetic algorithms, genetic programming, machine
                 learning, document understanding, NLP, web security,
                 cloud computing",
  nbn =          "urn:nbn:it:units-9913",
  URL =          "",
  URL =          "",
  URL =          "",
  size =         "133 pages",
  abstract =     "The task of extracting structured information from
                 documents that are. unstructured or whose structure is
                 unknown is of uttermost importance in many application
                 domains, e.g., office automation, knowledge management,
                 machine-to-machine interactions. In practice, this
                 information extraction task can be automated only to a
                 very limited extent or subject to strong assumptions
                 and constraints on the execution environment.

                 In this thesis work I will present several novel
                 application of machine learning techniques aimed at
                 extending the scope and opportunities for automation of
                 information extraction from documents of different
                 types, ranging from printed invoices to structured XML
                 documents, to potentially malicious documents exposed
                 on the web.

                 The main results of this thesis consist in the design,
                 development and experimental evaluation of a system for
                 information extraction from printed documents. My
                 approach is designed for scenarios in which the set of
                 possible documents layouts is unknown and may evolve
                 over time. The system uses the layout information to
                 define layout-specific extraction rules that can be
                 used to extract information from a document. As far as
                 I know, this is the first information extraction system
                 that is able to detect if the document under analysis
                 has an unseen layout and hence needs new extraction
                 rules. In such case, it uses a probability based
                 machine learning algorithm in order to build those
                 extraction rules using just the document under
                 analysis. Another novel contribution of our system is
                 that it continuously exploits the feedback from human
                 operators in order to improve its extraction ability.

                 I investigate a method for the automatic detection and
                 correction of OCR errors. The algorithm uses
                 domain-knowledge about possible misrecognition of
                 characters and about the type of the extracted
                 information to propose and validate corrections.

                 I propose a system for the automatic generation of
                 regular expression for text-extraction tasks. The
                 system is based on genetic programming and uses a set
                 of user-provided labelled examples to drive the
                 evolutionary search for a regular expression suitable
                 for the specified task.

                 As regards information extraction from structured
                 document, I present an approach, based on genetic
                 programming, for schema synthesis starting from a set
                 of XML sample documents. The tool takes as input one or
                 more XML documents and automatically produces a schema,
                 in DTD language, which describes the structure of the
                 input documents.

                 Finally I will move to the web security. I attempt to
                 assess the ability of Italian public administrations to
                 be in full control of the respective web sites.
                 Moreover, I developed a technique for the detection of
                 certain types of fraudulent intrusions that are
                 becoming of practical interest on a large scale.",
  notes =        "Supervisor/Tutor: Medvet, Eric and Bartoli, Alberto",

Genetic Programming entries for Enrico Sorio