Where academic tradition
meets the exciting future

A Prototype-matching System for Scientific Abstract Collection Semantic Clustering

Antonina Kloptchenko, Barbro Back, Ari Visa, Jarmo Toivonen, Hannu Vanharanta, A Prototype-matching System for Scientific Abstract Collection Semantic Clustering. TUCS Technical Reports 465, Turku Centre for Computer Science, 2002.

Abstract:

The growth of digitally available text information has created a need for effective text processing tools. Document clustering aims at solving some of the text processing problems, such as text categorization, topic discovery, text browsing and searching, retrieval by content and organizing retrieval results on the Web. We have used an information retrieval by content method built on prototype matching clustering of a scientific text collection, which in our case are the abstracts from The Hawaii International Conference on System Science 2001. Our aim is to retrieve the documents from a conference paper collection according to similarities in their contents and semantic structures. Our prototype-matching information retrieval method consists of document pre-processing, “smart” document encoding on different syntactic levels, clustering document histograms using a vector quantization algorithm, and matching those histograms for every document against a prototype. In the report, we position our methods among the existing document clustering methods, explain the motivation behind the clustering of scientific conference papers, and give an example of using our prototype tool for information retrieval by content on the scientific abstract collection. The method offers a promising alternative for task of information retrieval by content from scientific text collections.

Files:

Full publication in PDF-format

BibTeX entry:

@TECHREPORT{tKlBaViToVa02a,
  title = {A Prototype-matching System for Scientific Abstract Collection Semantic Clustering},
  author = {Kloptchenko, Antonina and Back, Barbro and Visa, Ari and Toivonen, Jarmo and Vanharanta, Hannu},
  number = {465},
  series = {TUCS Technical Reports},
  publisher = {Turku Centre for Computer Science},
  year = {2002},
  keywords = {text clustering, information retrieval by content, scientific text collection},
  ISBN = {952-12-1019-2},
}

Belongs to TUCS Research Unit(s): Data Mining and Knowledge Management Laboratory

Edit publication