Ezagutu UPV/EHUko Informatika Fakultatea
Fakultatea da erreferentziazko ikastegia informatikako eta adimen artifizialeko prestakuntza eta ezagutza teknikoa/zientifikoa jasotzeko.
Fakultatea da erreferentziazko ikastegia informatikako eta adimen artifizialeko prestakuntza eta ezagutza teknikoa/zientifikoa jasotzeko.
Lehenengo argitaratze data: 2025/03/21
Egilea: Elena Zotova Romanova
Izenburua: Muntilingual Information Extraction in Clinical Texts Using Deep Learning Approaches
Zuzendariak: German Rigau / Montserrat Cuadros
Eguna: 2025ko martxoaren 24an
Ordua: 11:00h
Lekua: Gipuzkoako Ingeniaritza Eskola (Donostia)
Abstract:
"Healthcare practice and biomedical research generate large volumes of digitised, unstructured data in multiple languages, which remain underutilised despite their potential to enhance healthcare delivery, support trainee education, and advance biomedical research. Transforming this data into structured, actionable information requires Natural Language Processing (NLP) techniques. Within NLP, this task is called Information Extraction (IE). This thesis is part of the growth area of biomedical NLP. It addresses key challenges in biomedical information extraction, focusing on entity recognition, entity linking and the interoperability of clinical terminologies. It makes three primary contributions: (i) the development of a method for clinical identifiers mapping and annotated data augmentation, (ii) the design and evaluation of biomedical entity linking systems with semantic textual similarity methods, and (iii) the exploration of generative approaches for biomedical entity linking. Throughout, state-of-the-art deep learning techniques are used.
First, the thesis presents ClinIDMap, a prototype tool for clinical ID mapping which integrates multiple biomedical knowledge bases (e.g., ICD-10, SNOMED CT, UMLS) and connects them with general-purpose lexical resources (Wikidata and WordNet). The tool facilitates corpus annotation and data augmentation. Experiments demonstrate that corpus annotations transferred between terminologies retain high model performance, underscoring the method's utility for overcoming data scarcity. Second, the thesis explores methods for biomedical entity linking (BioEL) in non-English languages, particularly Spanish. By leveraging semantic textual similarity methods and supervised ranking via cross-encoders, the entity-linking models achieve higher performance compared to symbolic methods. The proposed methods are validated through participation in shared tasks, where the systems achieved top rankings. Third, the thesis studies the topic of generative models for biomedical entity linking, employing encoder-decoder and decoder-only architectures. These systems generate entity descriptions in knowledge bases, which makes linking them to the KBs a text-to-text problem. Experiments reveal that context incorporation and data augmentation improve models' capacity to generalise. However, challenges remain in handling unseen data and stabilising performance in zero-shot settings."