News & Updates

We Trained the Ukrainian Language Model

presentation

YouScan is a social media monitoring company that always stays at the cutting edge of technology to provide the best AI-based solutions for our customers. We wanted to make our contribution to Open Source and released the Ukrainian language model that was trained by our data science team.

Vitalii Radchenko

Vitalii Radchenko

Data Scientist

10 July 2020

We have recently published articles about our sentiment classification and aspects classification, in which we described the results that we had achieved thanks to their Open Source nature and modern technology.

Then we decided to keep on exploring Open Source, so we trained the Ukrainian language model on 30Gb data from multiple sources (social media, news and Wikipedia) and tried to make it applicable for different tasks.

Pretraining details

  • Ukrainian RoBERTa was trained with the code provided in the HuggingFace tutorial

  • The currently released model follows roberta-base-cased model architecture (12-layer, 768-hidden, 12-heads, 125M parameters)

  • The model was trained on 4xV100 (85 hours)

  • Training configuration you can find on our github page

Evaluation

ukr-roberta-base was tested on internal YouScan tasks. We achieved an average increase of 2 percent (fscore) compared to mBERT on multiclass and multilabel classification tasks.

Usage

You can use this model for a language modeling task or fine-tune it for any other task. We provided all model weights and vocabulary for correct usage, but the most comfortable way is to use it via the HuggingFace Transformers library as demonstrated in the example below:

from transformers import pipeline, RobertaForMaskedLM, RobertaTokenizer

model = RobertaForMaskedLM.from_pretrained("ukr-roberta-base")
tokenizer = RobertaTokenizer.from_pretrained("ukr-roberta-base")

fill_mask = pipeline(''fill-mask'', model=model, tokenizer=tokenizer)
fill_mask("Тарас Шевченко – великий український <mask>.")

# [{'sequence': '<s> Тарас Шевченко – великий український поет.</s>',
# 'score': 0.48607954382896423,
# 'token': 11426},
# {'sequence': '<s> Тарас Шевченко – великий український письменник.</s>',
# 'score': 0.23330871760845184,
# 'token': 10121},
# {'sequence': '<s> Тарас Шевченко – великий український художник.</s>',
# 'score': 0.06583040952682495,
# 'token': 12836},
# {'sequence': '<s> Тарас Шевченко – великий український князь.</s>',
# 'score': 0.021497823297977448,
# 'token': 17247},
# {'sequence': '<s> Тарас Шевченко – великий український народ.</s>',
# 'score': 0.020411811769008636,
# 'token': 1021}]

Moreover, you can test it directly on our HuggingFace model page.

ukrainian language modelukrainian language model

We hope that our experience will help other NLP researchers and practitioners build their expertise in the Ukrainian language. If you have any questions or need any help, feel free to contact us.

Transforme milhões de conversas online en uma fonte de insights de mercado

Monitore e analise as menções em redes sociais sem esforço com a YouScan. Garanta que sua marca permaneça saudável e destaque-se com seu público-alvo. 

Ao solicitar uma demo, você terá a oportunidade de:

  • Compartilhar suas necessidades e especificações de Social Listening;

  • Ter uma demonstração personalizada mostrando YouScan pode atender suas necessidades;

  • Explorar soluções e estratégias personalizadas projetadas para os desafios únicos da sua marca.



Basta enviar suas informações, e nossos especialistas irão guiá-lo através da abordagem inovadora da YouScan para a escuta de redes sociais impulsionada por IA.