Ruta de navegación



Fecha de primera publicación: 25/02/2022

Andoni Azpeitia Zaldua :   ”Datuen Ustiapena Itzulpen Automatikorako”.

Zuzendariak_Directores: Eneko Aguirre Bengoa/ María Aranzazu  del Pozo Echezarreta.

2022_03_04, 11:00  Sala Ada Lovelace aretoa.


Data-driven machine translation, first based on statistical machine translation (SMT) and later based on neural machine translation (NMT), has become the dominant approach in recent years. These type of systems are fed with parallel corpora (data collections with the same text written in two different languages) in a training process. The main advantage of machine translation is the ability to automatically extract knowledge from data, but in the same way, its capability to generalise knowledge is also conditioned by the examples observed in the training corpus.

The main goal of this thesis is to improve the quality of parallel corpora working on three different aspects: increasing corpora size, adapting corpora to the target domain and filtering noisy data. For this purpose, investigations carried out in the following four research fields are presented: document alignment, sentence alignment, data selection and parallel sentence filtering. Because all investigations have been performed in the context of real projects, the portability of the methods explored has been a pursued objective, in addition to quality improvement, throughout the thesis.

In the document alignment research line, a novel document similarity metric has been proposed. In addition to being effective, this metric does not require model training and it is language independent. Regarding sentence alignment, the similarity method developed for document alignment has been adapted taking into account that sentences contain less information than documents. For data selection, relative term frequencies have been explored to select valuable bitexts from more abundant corpora, also achieving high portability and competitive results. Finally, parallel sentence filtering has been treated as a particular case of sentence alignment, exploiting the similarity between sentences to filter out harmful data.

To test the usefulness of the proposed methods a wide variety of evaluations have been carried out against other state-of-the-art systems using corpora under free licence, improving the state-of-the-art in many cases. Regarding  sentence alignment, best results were obtained in an international shared task for two consecutive years. Finally, a dataset composed by almost 600.000 parallel sentences with translations written in Basque and Spanish in the news domain has been created using the developed sentence alignment methods and shared with the community.

Filtro por temas