Corpora

PILAR: Corpora for different languages of the Iberian Peninsula

A collection of parallel and/or monolingual corpora from different languages of the Iberian Peninsula. The languages included are:

Balearic: Monolingual and parallel (with Spanish) corpus crawled from the Bolletí Oficial de les Illes Balears.
Aranese: Monolingual corpus extracted from the Antòni Nogués collection and Aranese-Catalan parallel corpus crawled from the Diari Oficial de la Generalitat de Catalunya.
Aragonese: Monolingual corpus crawled from different websites.

FLORES+: The PILAR corpus also contains the FLORES+ dev and devtest datasets used in the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain organized by Universitat d’Alacant and Universitat Oberta de Catalunya. It covers the following languages: Aragonese, Aranese, Asturian and Valencia.

Corporal for Valencian (dialect of Catalan)

Valencian’s corpora crawled from different Valencian government-owned websites, the Diari Oficial de la Generalitat Valenciana and the Boletín Oficial de la UMH. The repository is divided into directories according to the origin of the corpora. Each directory indicates whether the corpus is parallel or monolingual.

MayanV: Corpora for some Mayan languages

This repository contains a number of parallel corpora between several Mayan languages and Spanish.

Software

Tune ‘n’ distill

This repository contains a pipeline to tune the mBART50 NMT pre-trained model to low-resource language pairs, and then distill the resulting system to obtain lightweight and more sustainable models. The pipeline allows training lightweight models for the translation between English and a specific low-resource language, even if mBART50 has not been pre-trained with the low-resource language.

tan-maya

Scripts and code for training bilingual and multilingual NMT models of Mayan languages.

MaTiLDA

This repository contains the code needed to run multi-task learning data augmentation (MaTiLDA), a method for data augmentation in neural machine translation.

jw_crawler

Webcrawler using Selenium and Firefox that retrieves parallel text from the official Jehova’s Witnesses’ website, jw.org, using the sitemap file to visit all available URLs in the languages specified.

Idiomata Cognitor

A multilingual language classifier for several Romance languages. The repository contains an explanation of the languages the classifier is able to recognize, the training details and scripts for using the classifier or training a new one.

URL2lang

Tool implemented in Python that allows to infer if a URL links to a document in a certain language. You can get the most likely language or the probability that a URL links to a document in a given language without access to the content of the linked document.

Parallel URLs Classifier

Tool implemented in Python that allows to infer whether a pair of URLs link to parallel documents (i.e., documents with the same content but written in different languages) without accessing their content. You can either get a textual description positive/negative or the probability that the URL pair links to parallel documents.

Translation models

Many-to-many translation models presented at the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. These models are capable of translating between several languages of the Iberian Peninsula (Spanish ↔ Asturian, Spanish ↔ Aragonese, Spanish ↔ Aranese, Spanish ↔ Galician, Spanish ↔ Catalan, Spanish ↔ Valencian, Catalan ↔ Aranese).

Transducens

Research on machine translation, digital libraries and computer-assisted education

LilowLa – Resources