Automated retrieval and extraction of training course information from unstructured web pages

Created by W.Langdon from gp-bibliography.bib Revision:1.4333

  author =       "Daniela Xhemali",
  title =        "Automated retrieval and extraction of training course
                 information from unstructured web pages",
  school =       "Loughborough University",
  year =         "2010",
  type =         "Engineering Doctorate",
  address =      "Leicestershire, LE11 3TU, UK",
  month =        "9 " # jul,
  keywords =     "genetic algorithms, genetic programming, Web page,
                 Information Retrieval, Information Extraction, Web
                 Classifier, Naive Bayes Classifiers, Regular
  URL =          "",
  URL =          "",
  size =         "236 pages",
  abstract =     "Web Information Extraction (WIE) is the discipline
                 dealing with the discovery, processing and extraction
                 of specific pieces of information from semi-structured
                 or unstructured web pages. The World Wide Web comprises
                 billions of web pages and there is much need for
                 systems that will locate, extract and integrate the
                 acquired knowledge into organisations practices. There
                 are some commercial, automated web extraction software
                 packages, however their success comes from heavily
                 involving their users in the process of finding the
                 relevant web pages, preparing the system to recognise
                 items of interest on these pages and manually dealing
                 with the evaluation and storage of the extracted
                 results. This research has explored WIE, specifically
                 with regard to the automation of the extraction and
                 validation of online training information. The work
                 also includes research and development in the area of
                 automated Web Information Retrieval (WIR), more
                 specifically in Web Searching (or Crawling) and Web
                 Classification. Different technologies were considered,
                 however after much consideration, Naive Bayes Networks
                 were chosen as the most suitable for the development of
                 the classification system. The extraction part of the
                 system used Genetic Programming (GP) for the generation
                 of web extraction solutions. Specifically, GP was used
                 to evolve Regular Expressions, which were then used to
                 extract specific training course information from the
                 web such as: course names, prices, dates and locations.
                 The experimental results indicate that all three
                 aspects of this research perform very well, with the
                 Web Crawler outperforming existing crawling systems,
                 the Web Classifier performing with an accuracy of over
                 95percent and a precision of over 98percent, and the
                 Web Extractor achieving an accuracy of over 94percent
                 for the extraction of course titles and an accuracy of
                 just under 67percent for the extraction of other course
                 attributes such as dates, prices and locations.
                 Furthermore, the overall work is of great significance
                 to the sponsoring company, as it simplifies and
                 improves the existing time-consuming, labour-intensive
                 and error-prone manual techniques, as will be discussed
                 in this thesis. The prototype developed in this
                 research works in the background and requires very
                 little, often no, human assistance.",
  notes =        "Sorting programs p218-219. Daniela Birdsall",

Genetic Programming entries for Daniela Birdsall