Comparative analysis of five protein-protein interaction corpora

Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari Björne, Filip Ginter, Tapio Salakoski, Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics 9 (supplement 3)(6), 2008.



Growing interest in the application of natural language processing
methods to biomedical text has led to an increasing number of corpora
and methods targeting protein-protein interaction (PPI) extraction.
However, there is no general consensus regarding PPI annotation and
consequently resources are largely incompatible and methods are
difficult to evaluate.


We present the first comparative evaluation of the diverse PPI corpora,
performing quantitative evaluation using two separate information
extraction methods as well as detailed statistical and qualitative
analyses of their properties. For the evaluation, we unify the corpus
PPI annotations to a shared level of information, consisting of
undirected, untyped binary interactions of non-static types with no
identification of the words specifying the interaction, no negations,
and no interaction certainty.

We find that the F-score performance of a state-of-the-art PPI
extraction method varies on average 19 percentage units and in some
cases over 30 percentage units between the different evaluated corpora.
The differences stemming from the choice of corpus can thus be
substantially larger than differences between the performance of PPI
extraction methods, which suggests definite limits on the ability to
compare methods evaluated on different resources. We analyse a number of
potential sources for these differences and identify factors explaining
approximately half of the variance. We further suggest ways in which the
difficulty of the PPI extraction tasks codified by different corpora can
be determined to advance comparability. Our analysis also identifies
points of agreement and disagreement in PPI corpus annotation that are
rarely explicitly stated by the authors of the corpora.


Our comparative analysis uncovers key similarities and differences
between the diverse PPI corpora, thus taking an important step towards
standardization. In the course of this study we have created a major
practical contribution in converting the corpora into a shared format.
The conversion software is freely available at

BibTeX entry:

  title = {Comparative analysis of five protein-protein interaction corpora},
  author = {Pyysalo, Sampo and Airola, Antti and Heimonen, Juho and Björne, Jari and Ginter, Filip and Salakoski, Tapio},
  journal = {BMC Bioinformatics},
  volume = {9 (supplement 3)},
  number = {6},
  year = {2008},

Belongs to TUCS Research Unit(s): Turku BioNLP Group

Publication Forum rating of this publication: level 2

