Corpora resources for Ukrainian: monolingual and parallel corpora

Corpus Linguistics
Monday 1 February 2016, 16:00 - 17:00
Michael Sadler SR (1.19)

Dmitri Sitchinava, Russian Language Institute of the Russian Academy of Sciences / Higher School of Economics, Moscow.

Ukrainian is a language with many online resources but it is still under-represented in the world of linguistic corpora, as it lacks a comprehensive national-like corpus. However, there are some teams that work on separate projects, including the corpus of the Computer Linguistic Library (including prose, poetry, and folklore). There also exist parallel Ukrainian corpora with Russian, Polish, and Bulgarian allowing for a detailed comparison against closely related languages. Some morphological analyzers for Ukrainian are being used for corpus markup; however the experience of using them exhibits some morphological and orthographical variability that goes beyond the Soviet-era standards. This needs to be described adequately in corpora.