Objectives

The wider objective of the project is the development of pre-trained, general-purpose machine learning language models for Serbian using text embeddings, as well as to build additional infrastructure and supporting tools that will secure digital inclusion of the Serbian language. To achieve this goal specific objectives are established:

O1. to build static and contextualized (dynamic) word embeddings for Serbian

O2. to train models for Serbian to enable multi-layered annotation (morphosyntax, named entity recognition and linking, relation extraction for knowledge graph building, sentiment analysis) relying on text embeddings.

O3. to build a set of language models for text generation, question answering chatbots, summarization.

O4. to develop a Serbian language tool portal and its task-specific tools

Specific objectives are directly mapped with WPs: O1↔WP1 and WP2, O2-O4↔WP3, while WP4 is aimed at reaching the academic and business community and WP5 has to ensure effective management of activities, quality and risk control, legal and societal aspects.

SMART

S: The proposed project is specific since it prepares, processes and explores language resources for Serbian in a new way, providing novel and valuable material for further development of NLP models and tools. With open and user-friendly access supported by open training materials, it brings opportunity to academia and industry to use the results of TESLA as a building block for new versatile applications.

M: All results (annotated datasets, lexical data, language models) are measurable, expressed in the size of a text collection, number of sentences or tokens annotated, number of entries in lexicon, number of question-and-answer pairs, models trained, which facilitates tracking of the progress on the project site: the milestones are carefully determined, with awareness of the time needed for the fulfillment of each activity.

A: Having in mind that one part of the TESLA team was deeply involved in the production and exploitation of several Serbian corpora, tagging, NER and word-embedding models, and has experience in the proposed technologies, the wider and specific objectives O1–O4 are achievable and realistic.

R: TESLA will be of great relevance to the scientific and wider community. The scientific community can reuse annotated datasets for other model training while models can be used as building blocks for complex solutions, while industry will be able to apply project results to commercial solutions. Wider community will benefit mostly from the developed web portal with end-to-end-solution.

T: The activities defined in TESLA are timely planned to fulfill and implement established milestones.