Resources

Software

Software developed by the group during the last years:

  • Smart segmentation: Morphological segmentation using Apertium resources. Useful as a pre-processing step before using BPE for training neural machine translation systems. Funded by the EU through the GoURMET project (grant agreement id 825299). Download it from GitHub.
  • LinguaCrawl: Crawler implemented in Python3 to crawl a number of top-level domains to download any text documents in the languages specified by the user. Funded by the EU through the GoURMET project (grant agreement id 825299). Download it from GitHub.
  • Apertium: Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs. For more information, access the project’s web page, or download it from GitHub.
  • Bitextor: Bitextor is an application created to generate translation memories using multilingual websites as a corpus source. It downloads an entire website and applies a set of heuristics (based mainly on HTML tag structure and text block length) to find bitexts. Download it from GitHub.
  • TagAligner: Parallel text aligner dessigned to generate transation memories (TMX files) from two files tagged with any kind of XML-based tags. The application uses the tag structure and the text blok length to perform the alignment. Download it from SourceForge.
  • DocTrans: Free/open-source piece of software implementing a method based on SMT techniques to retrieve documents which are a plausible translation of a given source text. The method provides the terms to use in a query to retrieve the document translation of the source document provided as input. It relies on the free-/open-source SMT system Moses and was last tested with revision 2281. Download it from Google Code.
  • bitext2tmx: bitext2tmx is a program to align and segment corresponding translated sentences, contained in two plain text files, and generate a translation memory (TMX format) from them for use in computer-aided/assisted translation (CAT) applications. Download it from SourceForge.
  • Orthoepikon: Orthoepikon is a set of open-source tools to turn XML pronunciation dictionaries and rule files into fast finite-state processors that make simple pronunciation annotations to plain, HTML or RTF texts so that they can be read aloud correctly by learners of a language. Download it from SourceForge. 
  • Authority: This open-source tool assists authority control in bibliographic catalogues when external features (such as the citations found in scientific articles) are not available for the disambiguation of creator names. This tool is based on similarity measures between the variants of author names combined with a parser which interprets the dates and periods associated with the creator. Download it from GitHub.