Where academic tradition
meets the exciting future

BioInfer: A Corpus for Information Extraction in the Biomedical Domain

Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorma Boberg, Jouni Järvinen, Tapio Salakoski, BioInfer: A Corpus for Information Extraction in the Biomedical Domain. BMC Bioinformatics 8(50), 2007.



Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.


We present BioInfer (Bio Information Extraction Resource), a new
public resource providing an annotated corpus of biomedical English.
We describe an annotation scheme capturing named entities and their
relationships along with a dependency analysis of sentence syntax.
We further present ontologies defining the types of entities and
relationships annotated in the corpus. Currently, the corpus contains
1100 sentences from abstracts of biomedical research articles
annotated for relationships, named entities, as well as syntactic
dependencies. Supporting software is provided with the corpus. The
corpus is unique in the domain in combining these annotation types
for a single set of sentences, and in the level of detail of the
relationship annotation.


We introduce a corpus targeted at protein, gene, and RNA
relationships which serves as a resource for the development of information extraction systems and their components such as parsers
and domain analyzers. The corpus will be maintained and further
developed with a current version being available at http://www.it.utu.fi/BioInfer.

BibTeX entry:

  title = {BioInfer: A Corpus for Information Extraction in the Biomedical Domain},
  author = {Pyysalo, Sampo and Ginter, Filip and Heimonen, Juho and Björne, Jari and Boberg, Jorma and Järvinen, Jouni and Salakoski, Tapio},
  journal = {BMC Bioinformatics},
  volume = {8},
  number = {50},
  year = {2007},

Belongs to TUCS Research Unit(s): Turku BioNLP Group

Publication Forum rating of this publication: level 2

Edit publication