Where academic tradition
meets the exciting future

Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish

Jenna Kanerva, Juhani Luotolahti, Veronika Laippala, Filip Ginter, Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish. In: Andrius Utka, Gintarė Grigonytė, Jurgita Kapočiūtė-Dzikienė, Jurgita Vaičenonienė (Eds.), Proceedings of the Sixth International Conference Baltic HLT 2014, 184–191, IOS Press, 2014.

Abstract:

In this paper, we report on the development of a large-scale Finnish Internet parsebank, currently consisting of 1.5 billion tokens in 116 million sentences. The data is fully morphologically and syntactically analyzed and it has been used to extract flat and syntactic n-gram collections, as well as verb-argument and noun-argument n-grams. Additionally, distributional vector space representations of the words are induced using the word2vec method. All n-gram collections as well as the vector space models are made available under an open license.

BibTeX entry:

@INPROCEEDINGS{inpKaLuLaGi14a,
  title = {Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish},
  booktitle = {Proceedings of the Sixth International Conference Baltic HLT 2014},
  author = {Kanerva, Jenna and Luotolahti, Juhani and Laippala, Veronika and Ginter, Filip},
  editor = {Utka, Andrius and Grigonytė, Gintarė and Kapočiūtė-Dzikienė, Jurgita and Vaičenonienė, Jurgita},
  publisher = {IOS Press},
  pages = {184–191},
  year = {2014},
}

Belongs to TUCS Research Unit(s): Turku BioNLP Group

Edit publication