Project

Project title: Text Embeddings – Serbian Language Applications

Acronym: TESLA

Subprogram: Artificial intelligence

Participating Scientific and Research Organizations (SROs) and their acronyms:

University of Belgrade, Faculty of Mining and Geology (FMGUB),
University of Belgrade Faculty of Philology (FPUB)

Principal Investigator (PI): Ranka Stanković

Abstract:

Background of the research problem. Recent advances in NLP (Natural Language Processing) have resulted in the development of pre-trained language models such as GPT series (Generative Pre-Training), BERT (Bidirectional Encoder Representations from Transformers) and their derivatives. These models are based on Deep Learning (DL) and context-aware text embeddings with benchmarks indicating that they outperform conventional models.

Methods. DL models and tools developed in the project will build on experience from GPT, BERT and multilingual BERT, but also on the experience of the project team members in the areas of NLP, large corpora, and different specific NLP problems and tasks, by quantifying diversity of language phenomena in corpora.

Novelty. The TESLA project aims to develop open-source, DL-based, pre-trained language models based on text embeddings, especially for Serbian, and fine-tune them for specific NLP tasks. These models will address the specific features of the Serbian language and thus outperform multilingual versions of existing language models applied to texts in Serbian.

Impact. DL models and tools developed in this project will secure the digital inclusion of the Serbian language, being all open-sourced, available for use in numerous applications in academia, industry, and services, such as generating text summaries, paraphrasing, lexical relation discovering, creating different Serbian-speaking chatbots, etc.

Expected results. The major project result will be a set of pre-trained DL-based language models that are novel for Serbian and would represent a breakthrough in Serbian NLP. Task-specific and domain-specific NLP tools for Serbian, such as named entity, relation extraction and text generation tools and chatbots, will be produced to demonstrate and showcase their usefulness and performance. Tools will be publicly available on a new web portal, with user-friendly explanatory visualizations, such as language patterns and phenomena detected in texts.