Dealing with Data Sparsity in Drug Named Entity Recognition

Created by W.Langdon from gp-bibliography.bib Revision:1.3872

@InProceedings{Piliouras:2013:ICHI,
  author =       "Dimitrios Piliouras and Ioannis Korkontzelos and 
                 Andrew Dowsey and Sophia Ananiadou",
  title =        "Dealing with Data Sparsity in Drug Named Entity
                 Recognition",
  booktitle =    "IEEE International Conference on Healthcare
                 Informatics (ICHI 2013)",
  year =         "2013",
  month =        sep,
  pages =        "14--21",
  keywords =     "genetic algorithms, genetic programming, artificial
                 intelligence, drugs, medical computing, natural
                 language processing, pattern classification, BioNLP
                 tasks, automatic annotations, biomedical natural
                 language processing tasks, data sparsity, drug named
                 entity recognition, drug-NER, gold-standard data,
                 manual annotations, voting system, Data models,
                 Dictionaries, Drugs, Proteins, Training, Training data,
                 data-sparsity",
  DOI =          "doi:10.1109/ICHI.2013.9",
  abstract =     "Drug Named Entity Recognition (drug-NER) is a critical
                 step for complex Biomedical Natural Language Processing
                 (BioNLP) tasks such as the extraction of
                 pharmaco-genomic, pharmaco-dynamic and pharmaco-kinetic
                 parameters. Large quantities of high quality training
                 data are almost always a prerequisite for employing
                 supervised machine-learning (ML) techniques to achieve
                 high classification performance. However, the human
                 labour needed to produce and maintain such resources is
                 a detrimental limitation. In this study, we attempt to
                 improve the performance of drug NER without relying
                 exclusively on manual annotations. Instead, we use
                 either a small gold-standard corpus (120 abstracts) or
                 no corpus at all. In our approach, we use a voting
                 system to combine a number of heterogeneous models to
                 enhance performance. Moreover, 11 regular-expressions
                 that capture common drug suffixes were evolved via
                 genetic-programming. We evaluate our approach against
                 state-of-the-art recognisers trained on manual
                 annotations, automatic annotations and a mixture of
                 both. Aggregate classifiers are shown to improve
                 performance, achieving a maximum F-score of 95percent.
                 In addition, combined models trained on mixed data are
                 shown to achieve comparable performance to models
                 trained exclusively on gold-standard data.",
  notes =        "DrugBank, PK PharmacoKinetic corpus, 360 articles,
                 maximum-entropy maxent.sf openNLP, 8 features per
                 token, ANN perceptron. p17 Silver data='anotated by
                 direct string matching dictionary entries', AcroMine
                 negative, p18 GP Evolving strin-simularity patterns (ie
                 regular expressions) USAN stem grouping restrictiing to
                 _last_ 4,5, o6 characters gives 'major positive
                 effect'. 200*(pop=10000,generations=80). anti-bloat
                 (max tree depth=10 no space character in terminal set)
                 p19 best-evolved GP tree (fig and Table 3). p19
                 'gold-standard data not' needed for drug-NER. 2013 and
                 still says 'data sparsity is pervasive...' p20 data not
                 split into ttraining and holdout sets. Also known as
                 \cite{6680456}",
}

Genetic Programming entries for Dimitrios Piliouras Ioannis Korkontzelos Andrew Dowsey Sophia Ananiadou

Citations