Dealing with Data Sparsity in Drug Named Entity Recognition

Created by W.Langdon from gp-bibliography.bib Revision:1.4340

  author =       "Dimitrios Piliouras and Ioannis Korkontzelos and 
                 Andrew Dowsey and Sophia Ananiadou",
  title =        "Dealing with Data Sparsity in Drug Named Entity
  booktitle =    "IEEE International Conference on Healthcare
                 Informatics (ICHI 2013)",
  year =         "2013",
  month =        sep,
  pages =        "14--21",
  keywords =     "genetic algorithms, genetic programming, artificial
                 intelligence, drugs, medical computing, natural
                 language processing, pattern classification, BioNLP
                 tasks, automatic annotations, biomedical natural
                 language processing tasks, data sparsity, drug named
                 entity recognition, drug-NER, gold-standard data,
                 manual annotations, voting system, Data models,
                 Dictionaries, Drugs, Proteins, Training, Training data,
  DOI =          "doi:10.1109/ICHI.2013.9",
  abstract =     "Drug Named Entity Recognition (drug-NER) is a critical
                 step for complex Biomedical Natural Language Processing
                 (BioNLP) tasks such as the extraction of
                 pharmaco-genomic, pharmaco-dynamic and pharmaco-kinetic
                 parameters. Large quantities of high quality training
                 data are almost always a prerequisite for employing
                 supervised machine-learning (ML) techniques to achieve
                 high classification performance. However, the human
                 labour needed to produce and maintain such resources is
                 a detrimental limitation. In this study, we attempt to
                 improve the performance of drug NER without relying
                 exclusively on manual annotations. Instead, we use
                 either a small gold-standard corpus (120 abstracts) or
                 no corpus at all. In our approach, we use a voting
                 system to combine a number of heterogeneous models to
                 enhance performance. Moreover, 11 regular-expressions
                 that capture common drug suffixes were evolved via
                 genetic-programming. We evaluate our approach against
                 state-of-the-art recognisers trained on manual
                 annotations, automatic annotations and a mixture of
                 both. Aggregate classifiers are shown to improve
                 performance, achieving a maximum F-score of 95percent.
                 In addition, combined models trained on mixed data are
                 shown to achieve comparable performance to models
                 trained exclusively on gold-standard data.",
  notes =        "DrugBank, PK PharmacoKinetic corpus, 360 articles,
                 maximum-entropy maxent.sf openNLP, 8 features per
                 token, ANN perceptron. p17 Silver data='anotated by
                 direct string matching dictionary entries', AcroMine
                 negative, p18 GP Evolving strin-simularity patterns (ie
                 regular expressions) USAN stem grouping restrictiing to
                 _last_ 4,5, o6 characters gives 'major positive
                 effect'. 200*(pop=10000,generations=80). anti-bloat
                 (max tree depth=10 no space character in terminal set)
                 p19 best-evolved GP tree (fig and Table 3). p19
                 'gold-standard data not' needed for drug-NER. 2013 and
                 still says 'data sparsity is pervasive...' p20 data not
                 split into ttraining and holdout sets. Also known as

Genetic Programming entries for Dimitrios Piliouras Ioannis Korkontzelos Andrew Dowsey Sophia Ananiadou