The majority of research activity in the field of computer and corpus linguistics and human language technology in Croatia is supported by the Ministry of Science, Education and Sport through projects linked with human language technologies and the Ministry of Culture through cultural heritage digitalisation projects.
In this decade the Croatian Language and Linguistics Institute and the University of Zagreb’s Faculty of the Humanities and Social Sciences have emerged as the leading national centres for the development of human language technologies and our free-of-cost accessible digital language treasury with dictionaries, grammar books and orthographic handbooks.
Innovative human language technologies (HLT) are mediators that will allow Croatian citizens to participate in the principal social and economic trends of the European and global knowledge societies. In mid-2013, hopefully as of July 1st, the Croatian language will become the 24th official language of the European Union. Presently some twenty European languages, used by less then 10 million people, as is Croatian, face the danger of digital extinction as a result of their unfavourable on-line presence and the poor level of development of language resources, that is to say a source of linguistic texts stored in electronic form, and language tools, i.e. applications with which to make use of existing digital resources. As great opportunities are opening for us in regional markets, still not utilised as a result of linguistic obstacles—the challenges of HLTs should be made a national priority, as, for example, transport infrastructure is. We ardently desire, in spite of the economic crisis, to exit the sphere of those citizens of the European Union who will find themselves both socially and economically short-changed for no reason other than that they speak only their native tongue. Multilingual HLT has become a channel for instantaneous, simple and inexpensive communication and interaction, bypassing the language barrier with free translation services such as Google Translate.
In this decade the Croatian Language and Linguistics Institute has emerged as the leading national centre for the development of our human language technologies and our free-of-cost accessible digital language treasury with dictionaries, grammar books and orthographic handbooks. Its dominance is shared with the computer-aided linguistics experts at the University of Zagreb’s Faculty of the Humanities and Social Sciences, known for their pioneering work in the introduction of human language technology innovations for the Croatian language at the same time forty-two years ago that Ralph Gorin of Stanford University developed the first computer language checker—English Spell Check. At the Institute for Linguistics of the University of Zagreb’s Faculty of the Humanities and Social Sciences Željko Bujas compiled the first Croatian language computer corpus. This education institution would dominate computer linguistics over the following several decades including the computer processing of old Croatian writers—undertaken in 1980. The compiling of a one million entry corpus of Standard Croatian was launched in 1976 under the leadership of academician Milan Moguš. The compilation of the Croatian National Corpus, which currently covers 101.3 million tokens, was launched in 1998 under the leadership of Marko Tadić DSc, who has gone on to become the leading expert in computer and corpus linguistics in Croatia. The current largest Croatian language corpus, hrWaC, was compiled at the same faculty in 2010, and contains 1.3 billion words – tokens, collected from hr Internet domains.
The start of the 21st century saw the digitalisation of old Croatian single and multi language dictionaries at the same faculty under the leadership of Damir Boras DSc together with the popular Croatian language Internet portal, available at the Croatian Old Dictionary portal at. In 2004 the Croatian Language and Linguistics Institute for its part launched the compilation of a comprehensive corpus under the title of the Croatian Language Repository that includes texts from the 11th century to the contemporary era. The repository is divided into three chief corpora (Old Croatian, Middle Croatian and Modern Croatian) where for the first two the key problem of a diachronic corpus are addressed, which in the case of Croatian means the transliteration from three different scripts (Glagolitic, Cyrillic and Latin), the resolution of non-standard orthographic solutions and individual variations in the use of various scripts, explains the Institute’s director Željko Jozić DSc. The Institute also maintains practical on-line language counselling.
The majority of research activity in the field of computer and corpus linguistics and human language technology in Croatia is supported by the Ministry of Science, Education and Sport through projects linked with human language technologies and the Ministry of Culture through cultural heritage digitalisation projects—unlike commercial HLT markets like those in the USA and some Asian countries.
Some vital projects were launched at the University of Zagreb’s Faculty of the Humanities and Social Sciences using the same budget sources five years ago related to the development of Croatian language resources. These are, among other programmes, computer-aided linguistic models and language technologies for Croatian that compile and maintain an entire range of language resources and tools such as the Croatian National Corpus, the Croatian-English Parallel Corpus, the Croatian Morphological Lexicon, the Croatian Dependency Treebank and others. These programs include the digitalisation of collected language data and the direct augmentation of the number of accessible language resources for the Croatian language.
New opportunities and HLTs open the Internet chapter in the preservation of the Croatian language among the multi-million strong polygottal Croatian Diaspora from Alaska to the Tierra del Fuego, and from the south of Africa to Australia and New Zealand. Network access to Croatian language resources is a source of hope that the number of Croatian language speakers among Croatians living abroad will grow.
Text by: Vesna Kukavica