LilowLa – Resources

Corpora

Corpora for different languages of the Iberian Peninsula

A collection of parallel and/or monolingual corpora from different languages of the Iberian Peninsula. The languages included are:

Corporal for Valencian (dialect of Catalan)

Valencian’s corpora crawled from different Valencian government-owned websites, the Diari Oficial de la Generalitat Valenciana and the Boletín Oficial de la UMH. The repository is divided into directories according to the origin of the corpora. Each directory indicates whether the corpus is parallel or monolingual.

Corpora for some Mayan languages

This repository contains a number of parallel corpora between several Mayan languages and Spanish.

Software

Tune ‘n’ distill

This repository contains a pipeline to tune the mBART50 NMT pre-trained model to low-resource language pairs, and then distill the resulting system to obtain lightweight and more sustainable models. The pipeline allows training lightweight models for the translation between English and a specific low-resource language, even if mBART50 has not been pre-trained with the low-resource language.

tan-maya

Scripts and code for training bilingual and multilingual NMT models of Mayan languages.

MaTiLDA

This repository contains the code needed to run multi-task learning data augmentation (MaTiLDA), a method for data augmentation in neural machine translation.

jw_crawler

Webcrawler using Selenium and Firefox that retrieves parallel text from the official Jehova’s Witnesses’ website, jw.org, using the sitemap file to visit all available URLs in the languages specified.

Idiomata Cognitor

A multilingual language classifier for several Romance languages. The repository contains an explanation of the languages the classifier is able to recognize, the training details and scripts for using the classifier or training a new one.

Translation models

Coming soon.