Where academic tradition
meets the exciting future

New Techniques for Disambiguation in Natural Language and Their Application to Biological Text

Filip Ginter, Jorma Boberg, Jouni Järvinen, Tapio Salakoski, New Techniques for Disambiguation in Natural Language and Their Application to Biological Text. Journal of Machine Learning Research 5, 605–621, 2004.

Abstract:

We study the problems of disambiguation in natural language, focusing on the problem of gene vs. protein name disambiguation in biological text and also considering the problem of context-sensitive spelling error correction. We introduce a new family of classifiers based on ordering and weighting the feature vectors obtained from word counts and word co-occurrence in the text, and inspect several concrete classifiers from this family. We obtain the most accurate prediction when weighting by positions of the words in the context. On the gene/protein name disambiguation problem, this classifier outperforms both the Naive Bayes and SNoW baseline classifiers. We also study the effect of the smoothing techniques with the Naive Bayes classifier, the collocation features, and the context length on the classification accuracy and show that correct setting of the context length is important and also problem-dependent.

Files:

Full publication in PDF-format

BibTeX entry:

@ARTICLE{jGiBoJaSa04a,
  title = {New Techniques for Disambiguation in Natural Language and Their Application to Biological Text},
  author = {Ginter, Filip and Boberg, Jorma and Järvinen, Jouni and Salakoski, Tapio},
  journal = {Journal of Machine Learning Research},
  volume = {5},
  pages = {605–621},
  year = {2004},
  keywords = {biological text, gene vs. protein name disambiguation, textual data mining, word sense disambiguation, context-sensitive spelling error correction},
}

Belongs to TUCS Research Unit(s): Turku BioNLP Group

Publication Forum rating of this publication: level 3

Edit publication