A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

  author =       "Yuxin Chen",
  title =        "A Novel Hybrid Focused Crawling Algorithm to Build
                 Domain-Specific Collections",
  school =       "Virginia Polytechnic Institute and State University",
  year =         "2007",
  address =      "Blacksburg, Virginia, USA",
  month =        feb # " 5",
  keywords =     "genetic algorithms, genetic programming, digital
                 libraries, focused crawler, classification,
  URL =          "http://scholar.lib.vt.edu/theses/available/etd-02162007-005107/",
  URL =          "http://scholar.lib.vt.edu/theses/available/etd-02162007-005107/unrestricted/YuxinDissertation_etd_final1.pdf",
  URN =          "etd-02162007-005107",
  size =         "85 pages",
  abstract =     "The Web, containing a large amount of useful
                 information and resources, is expanding rapidly.
                 Collecting domain-specific documents/information from
                 the Web is one of the most important methods to build
                 digital libraries for the scientific community. Focused
                 Crawlers can selectively retrieve Web documents
                 relevant to a specific domain to build collections for
                 domain-specific search engines or digital libraries.
                 Traditional focused crawlers normally adopting the
                 simple Vector Space Model and local Web search
                 algorithms typically only find relevant Web pages with
                 low precision. Recall also often is low, since they
                 explore a limited sub-graph of the Web that surrounds
                 the starting URL set, and will ignore relevant pages
                 outside this sub-graph. In this work, we investigated
                 how to apply an inductive machine learning algorithm
                 and meta-search technique, to the traditional focused
                 crawling process, to overcome the above mentioned
                 problems and to improve performance. We proposed a
                 novel hybrid focused crawling framework based on
                 Genetic Programming (GP) and meta-search. We showed
                 that our novel hybrid framework can be applied to
                 traditional focused crawlers to accurately find more
                 relevant Web documents for the use of digital libraries
                 and domain-specific search engines. The framework is
                 validated through experiments performed on test
                 documents from the Open Directory Project. Our studies
                 have shown that improvement can be achieved relative to
                 the traditional focused crawler if genetic programming
                 and meta-search methods are introduced into the focused
                 crawling process.",

