Where academic tradition
meets the exciting future

Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank

Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen, Tapio Salakoski, Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank. In: Marco Passarotti, Adam PrzepiĆ³rkowski, Sabina Raynaud, Frank van Eynde (Eds.), Proceedings of the eighth international workshop on Treebanks and Linguistic Theories, 95-105, EduCatt, 2009.

Abstract:

In this work, we present the first results obtained during the annotation of
a general Finnish treebank in the Stanford Dependency scheme. We find
that the scheme is a suitable syntax representation for Finnish, with only minor
modifications needed. The treebank is based on text from the Finnish
Wikipedia, ensuring its free distribution and broad topical variance. To assess
the suitability of Wikipedia text as the basis of a treebank, we analyze its
grammaticality and find the quality of the language surprisingly high, with
97.2% of the sentences judged as grammatical. The treebank currently consists
of 60 fully annotated articles and is freely available.

Files:

Full publication in PDF-format

BibTeX entry:

@INPROCEEDINGS{inpHaGiLaViSa09a,
  title = {Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank},
  booktitle = {Proceedings of the eighth international workshop on Treebanks and Linguistic Theories},
  author = {Haverinen, Katri and Ginter, Filip and Laippala, Veronika and Viljanen, Timo and Salakoski, Tapio},
  editor = {Passarotti, Marco and PrzepiĆ³rkowski, Adam and Raynaud, Sabina and van Eynde, Frank},
  publisher = {EduCatt},
  pages = {95-105},
  year = {2009},
  keywords = {Finnish, treebank, syntax, dependency parsing},
}

Belongs to TUCS Research Unit(s): Turku BioNLP Group

Edit publication