Tesis defendidas

Tesis defendidas del programa actual

Hizkuntza-ulermenari ekarpenak: N-gramen arteko atentzio eta lerrokatzeak antzekotasun eta inferentzia interpretagarrirako.

Dirección:: AGIRRE BENGOA, ENEKO;; MARITXALAR ANGLADA, MONTSERRAT
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2018
Resumen:: Hizkuntzaren Prozesamenduaren bitartez hezkuntzaren alorreko sistema adimendunak hobetzea posible da, ikasleen eta irakasleen lan-karga nabarmenki arinduz. Tesi honetan esaldi-mailako hizkuntza-ulermena aztertu eta proposamen berrien bitartez sistema adimendunen hizkuntza-ulermena areagotzen dugu, sistemei erabiltzailearen esaldiak modu zehatzagoan interpretatzeko gaitasuna emanez. Esaldiak modu finean interpretatzeko gaitasunak feedbacka modu automatikoan sortzeko aukera ematen baitu. Tesi hau garatzeko hizkuntza-ulermenean sakondu dugu antzekotasun semantikoari eta inferentzia logikoari dagokien ezaugarriak eta sistemak aztertuz. Bereziki, esaldi barneko hitzak multzotan egituratuz eta lerrokatuz esaldiak hobeto modelatu daitezkeela erakutsi dugu. Horretarako, hitz solteak lerrokatzen dituen aurrekarien egoerako neurona-sare sistema bat inplementatu eta n-grama arbitrarioak lerrokatzeko moldaketak egin ditugu. Hitzen arteko lerrokatzea aspalditik ezaguna bada ere, tesi honek, lehen aldiz, n-grama arbitrarioak atentzio-mekanismo baten bitartez lerrokatzeko proposamenak plazaratzen ditu. Gainera, esaldien arteko antzekotasunak eta desberdintasunak modu zehatzean identifikatzeko, esaldien interpretagarritasuna areagotzeko eta ikasleei feedback zehatza emateko geruza berri bat sortu dugu: iSTS. Antzekotasun semantikoa eta inferentzia logikoa biltzen dituen geruza horrekin chunkak lerrokatu ditugu, eta ikasleei feedback zehatza emateko gai izan garela frogatu dugu hezkuntzaren testuinguruko bi ebaluazioeszenariotan. Tesi honekin batera hainbat sistema eta datu-multzo argitaratu dira etorkizunean komunitate zientifikoak ikertzen jarrai dezan.
Ver más...

Sentimenduen analisi automatikorantz: oinarrizko baliabideen sorkuntza eta hizkuntza maila ezberdinetako balentzia-aldatzaileen identifikazioa/Towards the automatic analsis of sentiments in Basque: the creation of basic resources and the identification of valence shifters in different language levels.

Dirección:: GOJENOLA GALLETEBEITIA, KOLDOBIKA;; IRUSKIETA QUINTIAN, MIKEL
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2019
Resumen:: Tesi-lan honetan, hizkuntzalaritza aplikatuaren ikuspegitik, euskarazko sentimendu analisian lehen urratsak egin dira. Bi helburu nagusi egon dira tesi-proiektuan. Alde batetik, sentimendu analisia egiteko oinarrizko baliabideak sortu ditugu euskararentzat. Zehatz esanda, Euskarazko Iritzi Corpusa, Sentitegi izeneko euskarazko sentimendu lexikoia eta dokumentu-mailako sentimendu sailkatzailea garatu ditugu. Corpusak sei domeinuetako 240 iritzi-testu biltzen ditu. RST hurbilpenaz baliatuta, corpuseko diskurtso-informazioa etiketatuta dago. Gainera, iritzi-testuen orientazio semantikoa ere etiketatuta dago. Sentimendu lexikoiari dagokionez, 1.237 hitzez osatuta dago eta bertako sarrerek -5 eta +5 arteko sentimendu balentzia dute. Sentimendu lexikoia sortzeko itzulpen metodologia zehatz bat jarraitu dugu. Azkenik, dokumentu mailako sentimendu sailkatzailea ere garatu dugu. Tresnaren oinarrian aurretik aipatu dugu sentimendu lexikoia dago eta, horretaz gain, baditu beste zenbait erregela ere. Beste aldetik, sentimendu analisiaren lanketa teoriko bat ere egin dugu. Sentimendu sailkapena lexikoian oinarrituz egin nahi bada, hitzen sentimendu balentzia jakitearekin ez da nahikoa, izan ere, testuetan badaude zenbait fenomeno hitz horien sentimendu balentzia eragiten dutenak. Horiei testuinguruzko balentzia aldatzaileak deitzen zaie eta horiek euskaran nola agertzen diren landu dugu. Gramatika maila bakoitzeko balentzia aldatzaile mota bat landu dugu: fonologian, bustidura adierazkorra; morfologian, morfemak; sintaxian, ezeztapen-markak eta, azkenik, diskurtsoan, diskurtso erlazioak eta unitate zentrala. Emaitzek erakusten dutenez, balentzia aldatzaileek hitzen edo sintagmen sentimendu balentzia indartu edo ahuldu egiten dute. Ahultze horren intentsitatearen arabera, sentimendu balentziaren zeinuan aldaketa gerta liteke, positiboa dena negatibo bilakatuz edo alderantziz. Azkenik, kasu batzuetan, balentzia aldatzaileak ez du eraginik sortzen.
Ver más...

Application of singing synthesis techniquest to bertsolaritza

Dirección:: NAVAS CORDON, EVA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2020
Resumen:: This thesis focuses on the development of a new bertsolaritza singing voice synthesis system using as base original bertsolaritza live session recordings. The challenge of this work is not only the implementation of a singing voice synthesis system. The recorded corpus of bertsolaritza contains the transcriptions of improvised verses, but the audio files contain multiple elements that are not singing voice. As the majority of the recorded audios are live sessions, the voice of a speaker, applause of the public and noise are part of the database. In addition, the musical labeling of the singing voice is not included in the database. With a database of these properties, the aim of this work is to create methods to clean, segment and label the audios in the bertsolaritza and analyze the possibility of using them to create synthesis models for bertsolaritza singing voice synthesis. We have developed methods to automatically obtain the singing voice segments in the recordings, creating new speech and singing voice classification algorithms. The segmentation of bertso utterances and phonemes has been performed in a multi-singer database. The segmentation algorithms proposed have the capacity to align material from unseen bertsolaris in the future. After that, we analyzed the musical properties of the bertsolaritza art and compared the theoretical melodies in the database with the actual interpretation of them. We defined automatic systems to musically label the bertsolaritza singing voice generating a fully labeled bertsolaritza database. Musical labeling included vibrato and we analyzed the use of it in each bertsolari. We evaluated all automatic labeling systems in the process. After creating a labeled database of bertso recordings we generated singing voice synthesis systems using HMMs and DNNs. We included fo normalization, tempo adaptation and vibrato prediction techniques in these systems. We defined methods to automatically adapt music scores for each bertsolari considering the pitch range of each bertsolari. We evaluated synthesis models created for different bertsolaris in a subjective and objective way obtaining good results. The contributions of this thesis are related to bertsolaritza and singing voice synthesis. We added new information levels to the bertsolaritza corpus with the segmentation of singing voice, the alignment of utterances and phonemes and the subsequent musical labeling. These labeling methods need no manual supervision and therefore we created tools to increase the labeled database in the future. We created a multi-singer singing voice database that is considerably bigger than any state of the art singing voice databases. Finally we defined systems to synthesize bertsolaritza singing voice using different singers and technologies obtaining positive results.
Ver más...

Hitzen arteko antzekotasuna:ezagutza-baseetan oinarritutako tekniken ekarpenak

Dirección:: AGIRRE BENGOA, ENEKO;; SOROA ECHAVE, AITOR
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2018
Resumen:: Eredu konputazionalekin sortutako hitzen errepresentazio semantikoak gakoa dira hizkuntzaren prozesamenduko hainbat atazatan, eta errepresentazio horien kalitatea ebaluatzeko hitzen arteko antzekotasuna erabiltzen da. Antzekotasun-ataza hizkuntzaren prozesamenduaren alorrean kokatzen da, lexiko-semantikan, eta, hurrengo urratsak ditu: lehenik, hitzen arteko antzekotasuna hitzen errepresentazioen bidez kalkulatzen da; ondoren, antzekotasun hori gizakien antzekotasun-irizpideekin konparatzen da. Eredu konputazionalaren emaitzak zenbat eta gizakion irizpideetatik hurbilago egon, orduan eta kalitate hobea izango dute hitzen errepresentazioek. Lan honetan antzekotasunaren kasu orokorragoarekin ere lan egin dugu, ahaidetasunarekin. Hitzen errepresentazioan testu-corpusetan oinarritutako metodoak eta ezagutza-baseetan oinarritutakoak daude. Aurreneko familian hainbat eredu daude, baina, lan honetan neurona-sareetan oinarritutakoak erabili ditugu. Metodo horiek hitzen esanahiak testuetako hitz-testuinguru agerkidetzen bidez inferitzen dituzte eta bektore-espazio trinko batean kodetzen. Bigarren familiakoen artean, ezagutza-baseak grafoak balira bezala tratatzen dituztenez baliatu gara, azken horien informazio estrukturala bere osotasuenan ustiatuz. Alde batetik, testu corpusetatik erauzitako errepresentazio trinkoek arrakasta handia izan dute hainbat atazatan, baina, antzekotasun- eta ahaidetasun-erlazioak nahastuta daude hitzen errepresentazioetan. Bestetik, ezagutza-baseetako errepresentazioak kalkulatzea konputazionalki garestia da, baina, ezagutza-baseetan antzekotasun- eta ahaidetasun-erlazioak esplizituak dira. Tesi-lan honen xedea antzekotasun-atazako emaitzak hobetzea da, eta, azken hori hitzen errepresentazio semantiko hobeak erdiesteko teknikez burutuko dugu. Gure hipotesi nagusia testu-corpusetako eta ezagutza-baseetako informazioa desberdina eta osagarria dela da. Gure aburuz, bi iturri horiek konbinatuz gero hitzen errepresentazioen arteko antzekotasun-emaitzak hobetuko dira, eta, ondorioz, errepresentazio hobeak izango ditugu. Hipotesi hori, gainera, elearteko erlazioetara hedatu dugu. elearteko antzekotasuna eta ahaidetasuna ere esploratuz. Izan ere, bi baliabide horiek antzekotasunaren edota ahaidetasunaren nabardura desberdinak jasotzen dituzte, eta, konbinatuz gero, antzekotasuna eta ahaidetasuna hobeto modelatuko dute. Tesi-lan honen bitartez aurreko paragrafoko hipotesiak frogatu ditugu, eta egindako ekarpenak hurrengo hirurak dira: (1) ausazko ibilbideen metodo batekin ezagutza-baseetako informazio estrukturala corpus batean kodetzea, eta azken horren hitzen errepresentazio semantikoak kalkulatzea; (2) testuko eta ezagutza-baseetako informazio semantikoa konbinatzeko hainbat metodo eta errepresentazio hibrido proposatzea; (3) aurretik proposatutako guztiak elearteko erlazioetan aplikatzea. Aipatuako metodo eta konbinaketa oro antzekotasun-atazan ebaluatu ditugu, beren emaitzak artearen egoerako metodo baliokideekin konparatuz. Gure proposamenek antzekotasun-atazako artearen egoera berdindu edo gainditu dute, eta gure hipotesiak betetzen direla ondorioztatu dugu.
Ver más...

Medidas de distancia entre lenguas basadas en corpus/Medidas de distância entre línguas baseadas em corpus.

Dirección:: ALEGRIA LOINAZ, IÑAKI;; GAMALLO OTERO, PABLO
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2020
Resumen:: El objetivo de esta tesis es plantear y verificar una metodología basada en corpus para cuantificar automáticamente la distancia entre lenguas y variantes de lenguas. Para ello se ha partido de las técnicas usadas y contrastadas en identificación de idiomas, buscando aquellas que son más robustas y pueden cuantificar cuánto se acerca un texto a un modelo de lenguaje. También como objetivo secundario hemos investigado el papel que juega la ortografía como factor de divergencia y convergencia entre lenguas. El método elegido es no-supervisado y puede aplicarse al cálculo de la distancia entre idiomas, entre períodos históricos de lenguas o entre variantes de lenguas.
Ver más...

Adverse drug reaction extraction on electronic health records written in Spanish

Dirección:: CASILLAS RUBIO, ARANTZA;; PEREZ RAMIREZ, ALICIA
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2019
Resumen:: This work focuses on the automatic extraction of Adverse Drug Reactions (ADRs) in Electronic Health Records (EHRs). That is, extracting a response to a medicine which is noxious and unintended and which occurs at doses normally used. From Natural Language Processing (NLP) perspective, this was approached as a relation extraction task in which the drug is the causative agent of a disease, sign or symptom, that is, the adverse reaction. ADR extraction from EHRs involves major challenges. First, ADRs are rare events. That is, relations between drugs and diseases found in an EHR are seldom ADRs (are often unrelated or, instead, related as treatment). This implies the inference from samples with skewed class distribution. Second, EHRs are written by experts often under time pressure, employing both rich medical jargon together with colloquial expressions (not always grammatical) and it is not infrequent to find misspells and both standard and non-standard abbreviations. All this leads to a high lexical variability. We explored several ADR detection algorithms and representations to characterize the ADR candidates. In addition, we have assessed the tolerance of the ADR detection model to external noise such as the incorrect detection of implied medical entities implied in the ADR extraction, i.e. drugs and diseases. We sttled the first steps on ADR extraction in Spanish using a corpus of real EHRs.
Ver más...

Speech recognition based strategies for on-line Computer Assisted Language Learning (CALL) systems in Basque/Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako.

Dirección:: HERNAEZ RIOJA, INMACULADA CONCEPCION;; NAVAS CORDON, EVA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2019
Resumen:: Tesi honetan, euskarazko hizketa-ezagutze automatikoaren bi inplementazio aztertzen dira, Ordenagailu Bidezko Hizkuntza Ikaskuntza (OBHI) sistemetarako: Ordenagailu Bidezko Ebakera Lanketa (OBEL) eta Ahozko Gramatika Praktika (AGP). OBEL sistema klasikoan, erabiltzaileari esaldi bat irakurrarazten zaio, eta fonema bakoitzerako puntuazio bat jasotzen du bueltan. AGPn, Hitzez Hitzeko Esaldi Egiaztapena (HHEE) teknika proposatu dugu, ariketak ebatzi ahala egiaztatzen dituen sistema. Bi sistemon oinarrian, esakuntza egiaztatzeko teknikak daude, Goodness of Pronunciation (GOP) puntuazioa, adibididez. Sistema horiek inplementatzeko, eredu akustikoak entrenatu behar dira, eta, horretarako, Basque Speecon-like datu-basea erabili dugu, euskararako publikoki erabilgarri dagoen datu-base bakarra. Eredu akustiko onak lortzearren, datu-basean egokitzapenak egin behar izan dira hiztegi alternatibadun bat sortuz, eta fasekako entrenamendua ere probatu da. % 12.21eko PER (fonemen errore-tasa) lortu da hala. Lehendabiziko sistema laborategiko baldintzetan testatu da, eta emaitza lehiakorrak lortu dira. Hala ere, tesi honetako OBEL eta AGP sistemen helburua da bezero/zerbitzari motako arkitektura batean ezartzea, ikasleek edonondik atzi dezaten. Hori ahalbidetzeko, HTML5eko zehaztapenak erabili dira audioa zerbitzarira grabatu ahala bidaltzeko, eta, gainera, onlineko batezbesteko- eta bariantza-normalizazio cepstraleko (CMVN, Cepstral Mean and Variance Normalisation) teknika berri bat proposatu da erabiltzaileek grabatutako audio-seinaleen kanal desberdintasunen eragina txikiagotzeko. Teknika hori tesi honetan aurkeztutako metodo batean oinarriturik dago: normalizazio anitzeko puntuatzea (MNS, Multi Normalization Scoring), eta onlineko ahots-aktibitatearen detektagailu (VAD, Voice Activity Detector) berri bat ere proposatu da metodo horretan oinarriturik. Azkenik, parametro desberdinak ebaluatu dira neurona-sareak erabiliz, eta ondorioztatu da GOP puntuazioa dela eraginkorrena.
Ver más...

Predicate Matrix: an interoperable lexical knowledge base for predicates

Dirección:: LAPARRA MARTIN, EGOITZ;; RIGAU CLARAMUNT, GERMAN
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2023
Resumen:: La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas.
Ver más...

Multilingual sentiment analysis in social media.

Dirección:: AGERRI GASCON, RODRIGO;; RIGAU CLARAMUNT, GERMAN
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2019
Resumen:: This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish. The thesis addresses the following challenges to build such a system: - Analysing methods for creating Sentiment lexicons, suitable for less resourced languages. - Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed. - Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks. - Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations.
Ver más...

Aldaera linguistikoen normalizazioa inferentzia fonologikoa eta morfologikoa erabiliz

Dirección:: ALEGRIA LOINAZ, IÑAKI;; MARITXALAR ANGLADA, MONTSERRAT
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2016
Resumen:: Tesi-lan hau hizkuntzaren azterketa eta prozesamenduaren arlokoa da eta testu ez-estandarren ikertze-lerroan garatu da, euskarazko testu ez-estandarren normalizazioa izanik lanaren gai nagusi Testu estandarrekin alderatuta, testu ez-estandarrek ezaugarri bereziak dituzte maila lexikoan, morfologikoan edota fonologikoan, eta haien prozesaketa erronka berri bat da. Testu horiek, oro har, ezin dira ohiko moduan prozesatu hizkuntza prozesatzeko tresna gehienak (NLP, Natural Language Processing tresnak) hizkuntza estandarretan idatzitako testuak prozesatzeko garatu direlako, eta testu ez-estandarrekin erabiltzen direnean asko jaisten da haien errendimendua. Halako testuak prozesatzeko interesa, ordea, asko zabaldu da azken urteetan: liburutegi digitalak, humanitate digitalak, soziolinguistika konputazionala, iritzien analisia eta abar. Testu ez-estandarrak normalizatuz gero, aukera dago NLP tresnak aplikatzeko testu horietan eta horretarako funtsezkoa da prozesu hori ahalik eta modurik eraginkorrenean betetzea. Tesi-lan honetan ikasketa automatikoan oinarritzen diren metodoak proposatzen dira euskarazko testu ez-estandarretan normalizazioaren ataza ebazteko. Horrekin batera, metodoek lortzen dituzten emaitzak konparatzen dira beste ikerketa batzuek lortzen dituztenekin, horrela metodoen egokitasuna aztertzeko. Konparazio hori egiteko gaztelaniazko zein eslovenierazko corpusak erabili dira, beste zenbait ikerlariren lankidetza baliatuz.
Ver más...

Euskal telebistaren sorrera, garapena eta funtzioa euskararen normalizazioaren testuinguruan

Dirección:: ELORDUY URQUIZA, MIREN AGURTZANE;; ZABALA UNZALU, MIREN IGONE
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2019
Resumen:: Tesian, Euskal Telebistaren 1982tik 2018ra arteko historiaren azterketa egin dugu, euskararen normalizazioaren ikuspegitik begiratuta. Euskal Telebista 1982an eratu zen, euskal gizarteak irrikatzen zuen burujabetza neurri batean erdiesteko aukera gertatu zen testuinguru batean. Proiektua Eusko Jaurlaritzak eraman zuen aurrera, Gernikako Estatutuan jasota zeuden eskumenez baliatuta. Hedabide berriak hiru eginkizun bete behar zituen komunitatearen zerbitzuan: herritarrei informaziorako eta parte-hartze politikorako baliabide bat eskaintzea; hezkuntza-sistema osatzea, eta euskara eta euskal kultura sustatzea eta zabaltzea. Euskara sustatzeko eta zabaltzeko betekizun horretan, bi faktore izan dira baldintzatzaile nagusiak. Lehena, euskararen estandarizazioa. ETBk euskara batua hartu zuen bere hizkuntza-eredutzat lehen unetik, baina euskararen estandarizazioa hasierako urratsetan zegoen eta, gainera, literaturarako planteatua zen. Ondorioz, ETBk hainbat erronkari egin behar izan zien aurre, besteak beste, euskara telebistako komunikaziorako lantzea, euskara batua ahozko erabilerara egokitzea, eta komunikatzaileak euskara batu berri horretan aritzeko prestatzea. Bigarren faktorea teknologia digitalaren agerpena izan da, XXI. mendearen atarian mundu globalizatua ekarri diguna. Mundu horretan, hizkuntza gutxituko hedabideei erronka zailak planteatu zaizkie, hala nola, plataforma, kanal eta eragile berrien lehia; hiztunen eta komunikatzaileen hizkuntzajarrera berriak, eta hedabideak kontsumitzeko modu berriak. Erronka horien guztien aurrean ETBk zelan ¿okatu eta zelan erantzun duen aztertzea izan da tesi honen ardatza.
Ver más...

Aditza+izena unitate fraseologikoak gaztelaniatik euskarara: azterketa eta tratamendu konputazionala.

Dirección:: ADURIZ AGIRRE, ITZIAR;; LABAKA INTXAUSPE, GORKA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2019
Resumen:: Unitate Fraseologikoak (UFak) hizkuntzek bere-bereak dituzten hitz-konbinazio idiomatikoak dira. Hizkuntzaren Prozesamenduko (HPko) tresnek kalitatezko emaitzak izan ditzaten, beharrezkoa da halakoak ondo tratatzea, baina lan horrek hainbat zailtasun ditu; besteak beste, hitzez hitzeko itzulgarritasun eza. Tesi-lan honetan, aditza+izena motako UFen azterketa linguistiko bat egin dugu, halakoek HPren alorrean sortzen dituzten bi arazo garrantzitsuri aurre egiten laguntzeko: batetik, corpusetan UFak automatikoki identifikatzeari, eta bestetik, UF horiek gaztelaniaren eta euskararen artean automatikoki itzultzeari. Azterketa linguistikotik ateratako informazioa bi atazetarako baliatu dugu, eta oso emaitza onak lortu ditugu bietan. Horrez gain, hizkuntza-baliabideen sorkuntzan ere, bi ekarpen egin ditugu tesi-lan honen baitan. Lehena, landutako UFak, ordainak eta haien inguruko informazio linguistikoa biltzen dituen datu-base bat sortzea eta sarean eskuragarri jartzea: Konbitzul. Eta bigarrena, euskarazko aditz-UFak corpus batean etiketatzea, PARSEME proiektu europarrak sorturiko irizpideei jarraituz; corpus hori ere publiko egin da, irizpide berberei jarraituz landutako beste 19 hizkuntzatako corpusekin batera.
Ver más...

Euskarazko denbora-egituren azterketa eta corpusaren sorrera/Analysis of Basque temporal constructions and creation of a corpus.

Dirección:: ARANZABE URRUZOLA, MARIA JESUS;; DIAZ DE ILARRAZA SANCHEZ, MARIA ARANZAZU
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2018
Resumen:: Ikerketa-lan honetan euskarazko denbora-informazioaren prozesamenduan le\-hen urratsak egin ditugu. Horretarako, beste hizkuntzetan egin diren lanetan eta euskarazko denbora-egituren analisi linguistikoan oinarritu gara. Informazio hori baliatuta, euskarazko denbora-egiturak automatikoki tratatzeko ezaugarri linguistiko esanguratsuenak identifikatu ditugu eta horiek kodetzeko EusTimeML markaketa-lengoaia sortu dugu. Era berean, EusTimeMLri jarraituta denbora-informazioa eskuz etiketatuta duen EusTimeBank corpusa sortu dugu. Corpus hori, euskarazko fenomenoak aztertzeko erabiltzeaz gain, tresna automatikoen garapenerako eta ebaluaziorako erabili dugu. Hain zuzen ere, tesi-lan honetan denbora-adierazpenak identifikatzeko eta normalizatzeko EusHeidelTime tresna garatu dugu eta denbora-lerroak automatikoki eratzeko KroniXa sistema sortu dugu. Tresna horiek euskararen prozesamendu-kateetan integratzeko eta beste tresna batzuekin uztartzeko urratsak egin ditugu, euskararen ulermen eta tratamendu automatikoan denbora-informazioa ere baliatu ahal izateko.
Ver más...

Datuen Ustiapena Itzulpen Automatikorako

Dirección:: AGIRRE BENGOA, ENEKO;; POZO ECHEZARRETA,MARIA ARANZAZU, DEL
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2022
Resumen:: Datuetan oinarritutako itzulpen automatikoa, azken urteotan gailendutako paradigma da. Sistema hauek datuen bidez elikatzen dira entrenamendu prozesu batean. Abantaila nagusia itzulpen berriak egin ahal izateko jakintza automatikoki erauzten dutela da, baina era berean, jakintza orokortzeko ahalmena entrenamendurako corpuseko adibideengatik mugatuta dago. Tesi honen helburu nagusia corpusen kalitatea hobetzea da hiru alderdi landuz: corpus tamaina handituz, corpusen datuak domeinura egokituz eta datu multzo zaratatsuak iragaziz. Horretarako, lau ikerlerrotan egindako ikerketak aurkezten dira. Lehendabizi, dokumentuen lerrokatzean, bi hizkuntza ezberdinetako dokumentuak lerrotzen dira. Bigarren pausu batean, esaldien lerrokatzean, aurreko pausuko dokumentu pareetako esaldi paraleloak identi katzen dira. Corpusa domeinura egokitzeko, datuen aukeraketaren bidez domeinuz kanpoko corpus handiagoetan domeinuko datu gehiago bilatzen dira. Azkenik, esaldi paraleloen iragazpenarekin entrenamendurako kaltegarriak diren itzulpenak baztertzen dira.
Ver más...

Itzulpen automatiko gainbegiratu gabea

Dirección:: AGIRRE BENGOA, ENEKO;; LABAKA INTXAUSPE, GORKA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2020
Resumen:: Modern machine translation relies on strong supervision in the form of parallel corpora. Such a requirement greatly departs from the way in which humans acquire language, and poses a major practical problem for low-resource language pairs. In this thesis, we develop a new paradigm that removes the dependency on parallel data altogether, relying on nothing but monolingual corpora to train unsupervised machine translation systems. For that purpose, our approach first aligns separately trained word representations in different languages based on their structural similarity, and uses them to initialize either a neural or a statistical machine translation system, which is further trained through iterative backtranslation. While previous attempts at learning machine translation systems from monolingual corpora had strong limitations, our work¿along with other contemporaneous developments¿is the first to report positive results in standard, large-scale settings, establishing the foundations of unsupervised machine translation and opening exciting opportunities for future research.
Ver más...

Oesophageal speech:enrichment and evaluatons

Dirección:: HERNAEZ RIOJA, INMACULADA CONCEPCION;; NAVAS CORDON, EVA
Menciones:: Tésis Internacional
Calificación:: Sobresaliente
Año:: 2021
Resumen:: After a laryngectomy (i.e. removal of the larynx) a patient can no more speak in a healthy laryngeal voice. Therefore, they need to adopt alternative methods of speaking such as oesophageal speech. In this method, speech is produced using swallowed air and the vibrations of the pharyngo-oesophageal segment, which introduces several undesired artefacts and an abnormal fundamental frequency. This makes oesophageal speech processing difficult compared to healthy speech, both auditory processing and signal processing. The aim of this thesis is to find solutions to make oesophageal speech signals easier to process, and to evaluate these solutions by exploring a wide range of evaluation metrics. First, some preliminary studies were performed to compare oesophageal speech and healthy speech. This revealed significantly lower intelligibility and higher listening effort for oesophageal speech compared to healthy speech. Intelligibility scores were comparable for familiar and non-familiar listeners of oesophageal speech. However, listeners familiar with oesophageal speech reported less effort compared to non-familiar listeners. In another experiment, oesophageal speech was reported to have more listening effort compared to healthy speech even though its intelligibility was comparable to healthy speech. On investigating neural correlates of listening effort (i.e. alpha power) using electroencephalography, a higher alpha power was observed for oesophageal speech compared to healthy speech, indicating higher listening effort. Additionally, participants with poorer cognitive abilities (i.e. working memory capacity) showed higher alpha power. Next, using several algorithms (preexisting as well as novel approaches), oesophageal speech was transformed with the aim of making it more intelligible and less effortful. The novel approach consisted of a deep neural network based voice conversion system where the source was oesophageal speech and the target was synthetic speech matched in duration with the source oesophageal speech. This helped in eliminating the source-target alignment process which is particularly prone to errors for disordered speech such as oesophageal speech. Both speaker dependent and speaker independent versions of this system were implemented. The outputs of the speaker dependent system had better short term objective intelligibility scores, automatic speech recognition performance and listener preference scores compared to unprocessed oesophageal speech. The speaker independent system had improvement in short term objective intelligibility scores but not in automatic speech recognition performance. Some other signal transformations were also performed to enhance oesophageal speech. These included removal of undesired artefacts and methods to improve fundamental frequency. Out of these methods, only removal of undesired silences had success to some degree (1.44 \% points improvement in automatic speech recognition performance), and that too only for low intelligibility oesophageal speech. Lastly, the output of these transformations were evaluated and compared with previous systems using an ensemble of evaluation metrics such as short term objective intelligibility, automatic speech recognition, subjective listening tests and neural measures obtained using electroencephalography. Results reveal that the proposed neural network based system outperformed previous systems in improving the objective intelligibility and automatic speech recognition performance of oesophageal speech. In the case of subjective evaluations, the results were mixed - some positive improvement in preference scores and no improvement in speech intelligibility and listening effort scores. Overall, the results demonstrate several possibilities and new paths to enrich oesophageal speech using modern machine learning algorithms. The outcomes would be beneficial to the disordered speech community.
Ver más...

Contributions to Information Extraction for Spanish Written Biomedical Text

Dirección:: CUADROS OLLER,MONTSERRAT;; RIGAU CLARAMUNT, GERMAN
Menciones:: Cum Laude
Calificación:: Sobresaliente Cum Laude
Año:: 2023
Resumen:: Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue and scope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field.
Ver más...

Txosten klinikoak euskararen eta gazteleraren artean itzultzen laguntzeko corpusaren bilketa eta itzultzaile automatikoaren garapena / Corpus compilation and development of a machine translation system for translating clinical reports between Basque and Spanish

Dirección:: LABAKA INTXAUSPE, GORKA;; ORONOZ ANCHORDOQUI, MAITE
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2021
Resumen:: Tesi h o netan txosten klinikoak euskararen eta gazteleraren artean itzultzen laguntzeko garatutako itzultzaile automatikoak deskribatzen dira. Txosten klinikoak euskaraz idatz daitezen sustatzeko helburuarekin, euskaratik gaztelerara itzultzeko sistemaren garapena lehenetsi da. Gure hurbilpena datuetan oinarritutakoa izan da, horretarako txosten klinikoak euskararen eta gazteleraren artean itzultzeko lagungarriak izan zitezkeen corpusak bilduz. Domeinu klinikoan terminologia aberatsa izanik, hauek ere kontuan hartu dira corpusak biltzerakoan . Tesian zehar sistema desberdinak garatu dira, horietako gehienak Itzultzaile Automatiko Neuronalak izanik. Bestalde, Itzultzaile Automatiko Estatistikoak eta Erregeletan Oinarritutako Itzultzaile Automatikoak atzeranzko itzulpena egiteko ere erabili dira. Garatutako sistemen kalitatea neurtzeaz gain, atzeranzko itzulpen bidez sortutako corpusen aniztasun lexikala ere neurtu da, eta sistema batzuk garatzeko datuen hautespena ere aplikatu da. Diseinatutako aurrerapenak nazioarteko testuinguruan kokatzeko, proposaturiko metodoak aleman etik ingelesera, eta ingelesaren eta gazteleraren artean itzultzeko ere probatu dira .
Ver más...

Adimen Artifizialeko metodoak gizarte ikerkuntzarako: analisi demografikoa, jarreren detekzioa eta joera politikoen identifikazioa

Dirección:: AGERRI GASCON, RODRIGO
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: Tesi honek ikerketa sozialaren eta adimen artifizialaren (AA) arteko elkarrekintza aztertzen du, AA teknologiak nola baliatu daitezkeen ikertuz gizarte zientzietako ikerkuntza metodologia berritzaileak proposatzeko. Ikerketak AAren gaitasunetan sakontzen du, bereziki ikasketa automatikoa eta hizkuntzaren prozesamendua baliatuta datu multzo handien azterketa, patroien identifikazioa eta ezaugarriak aurreikuspenak lantzeko, metodo tradizionalen bidez burutzea bereziki zaila izango litzatekeena. Horretarako, sare sozialetako testu eta interakzio datuak baliatuta erabiltzaileak automatikoki ezaugarritzeko metodologiak garatu dira. Metodologiek datuen erauzketa eta erabiltzaileen errepresentazioa bilatuko dute, orokortu daitezkeen eta zehatzagoak diren iragarpenak egiteko asmoarekin. Metodo hauen erabilgarritasuna frogatzeko, kasu-azterketak egin dira aplikazio praktikoak burutuz, ezaugarri demografikoen identifikazio, jarrera detekzio eta joera politikoaren inferentzia atazetan.
Ver más...

Laburpen-gaitasunaren garapena eta eskolako laburpen-testuen prozesamendua

Dirección:: DIAZ DE ILARRAZA SANCHEZ, MARIA ARANZAZU;; IRUSKIETA QUINTIAN, MIKEL
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2022
Resumen:: Tesi honetan laburpen-gaitasunaren garapenari heldu diogu, eskolako laburpen-testuen prozesamenduaren bidez. Bi helburu nagusi izan ditugu: i) laburpen-gaitasunaren egoeraren azterketa egitea; eta, horretarako, laburpenaren oinarri teorikoak finkatu ditugu eta laburpen-testuen deskribapena egin dugu. ii) Hezkuntza- eta hizkuntza-teknologiak erabiliz laburpena eskolan lantzeko eta ebaluatzeko proposamena egitea. Helburuak erdiesteko, Hizkuntzaren Prozesamenduko teknikak (bereziki diskurtsoan oinarrituz) erabili ditugu, teknika horiei ikuspegi didaktikotik helduz. Euskarazko laburpen-corpusa biltzeko sortu dugun Compress-eus tresnarekin, LabEus corpusa bildu dugu, LHko eta unibertsitateko ikasleen 1758 laburpenez osatua. Ikasleek estrakzio- eta abstrakzio-laburpenak egingo dituzte. LabEus corpusetik, 80 laburpenekin EskoLab corpusa sortu dugu, eta, laburpenak sortzeko prozesuan gertatzen dena ulertzeko, ikerketa-galderak zehaztu eta etiketatze-lana egin dugu. Ondoren, laburpenen ebaluaziorako baliabideak eta prozedurak diseinatu eta sortu ditugu. i) Metalaburpenak eratzeko algoritmoa, ii) laburpenak egiteko eta ebaluatzeko irizpideak eta errubrika, iii) laburpenaren hierarkiaren atzeraelikadura automatikoaren bi bertsio, HIMAM eta GOM metodoetan oinarrituak. Bukatzeko, sortutako baliabideekin, laburpena lantzeko hiru tailer burutu ditugu euskaraz eta ingelesez. Bi laburpen bat egiteko behar diren irizpideak barneratuz laburpen-gaitasuna garatzeko. Hirugarrenarena, bestalde, diskurtsoan oinarrituz, laburpena lantzeko teknika ezberdinak ezagutu eta horien inguruan hausnartzeko.
Ver más...

Leveraging Feedback in Conversational Question Answering Systems

Dirección:: AGIRRE BENGOA, ENEKO;; AZCUNE GALPARSORO, GORKA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2023
Resumen:: Tesi honen helburua martxan jarri eta geroko sistemek gizakiekin duten elkarregina erabiltzea da, gizakien feedbacka sistementzako ikasketa eta egokitzapen seinale bezala erabiliz. Elkarrizketa sistemek martxan jartzerakoan jasaten duten domeinu aldaketan jartzen dugu fokua. Helburu honetarako, feedback bitar esplizituaren kasua aztertzen dugu, hau baita gizakientzat feedbacka emateko seinale errazena. Sistemak martxan jarri eta gero hobetzeko, lehenik eta behin DoQA izeneko galdera-erantzun motako elkarriketez osatutako datu multzo bat eraiki dugu. Datu multzo honek crowdsourcing bidez jasotako 2.437 dialogo ditu. Aurreko lanekin konparatuz gero, DoQAk benetazko informazio beharrak islatzen ditu, datu multzo barneko elkarrizketak naturalagoak eta koherenteagoak izanik. Datu multzo sortu eta gero, feedback-weighted learning (FWL) izeneko algoritmo bat diseinatu dugu, feedback bitarra bakarrik erabiliz aurretik entrenatutako sistema gainbegiratu bat hobetzeko gai dena. Azkenik, algoritmo honen mugak aztertzen ditugu jasotako feedbacka zaratatsua den kasuetarako eta FWL moldatzen dugu eszenatoki zaratsuari aurre egiteko. Kasu honetan lortzen ditugun emaitza negatiboak erakusten dute erabiltzaileetatik jasotako feedback zaratsua modelatzearen erronka, hau ebaztea oraindik ikerkuntza galdera ireki bat delarik.
Ver más...

Generic semantics-based task-oriented dialogue system framework for human-machine interaction in industrial scenarios

Dirección:: FERNANDEZ GONZALEZ, IZASKUN;; SOROA ECHAVE, AITOR
Menciones:: Cum Laude; Tésis Industrial
Calificación:: Sobresaliente Cum Laude
Año:: 2022
Resumen:: En Industria 5.0, los trabajadores y su bienestar son cruciales en el proceso de producción. En este contexto, los sistemas de diálogo orientados a tareas permiten que los operarios deleguen las tareas más sencillas a los sistemas industriales mientras trabajan en otras más complejas. Además, la posibilidad de interactuar de forma natural con estos sistemas reduce la carga cognitiva para usarlos y genera aceptación por parte de los usuarios. Sin embargo, la mayoría de las soluciones existentes no permiten una comunicación natural, y las técnicas actuales para obtener dichos sistemas necesitan grandes cantidades de datos para ser entrenados, que son escasos en este tipo de escenarios. Esto provoca que los sistemas de diálogo orientados a tareas en el ámbito industrial sean muy específicos, lo que limita su capacidad de ser modificados o reutilizados en otros escenarios, tareas que están ligadas a un gran esfuerzo en términos de tiempo y costes. Dados estos retos, en esta tesis se combinan Tecnologías de la Web Semántica con técnicas de Procesamiento del Lenguaje Natural para desarrollar KIDE4I, un sistema de diálogo orientado a tareas semántico para entornos industriales que permite una comunicación natural entre humanos y sistemas industriales. Los módulos de KIDE4I están diseñados para ser genéricos para una sencilla adaptación a nuevos casos de uso. La ontología modular TODO es el núcleo de KIDE4I, y se encarga de modelar el dominio y el proceso de diálogo, además de almacenar las trazas generadas. KIDE4I se ha implementado y adaptado para su uso en cuatro casos de uso industriales, demostrando que el proceso de adaptación para ello no es complejo y se beneficia del uso de recursos.
Ver más...

Multilingual Information Extraction in Clinical Texts Using Deep Learning Approaches.

Dirección:: CUADROS OLLER,MONTSERRAT;; RIGAU CLARAMUNT, GERMAN
Resumen:: Healthcare practice and biomedical research generate large volumes of digitized, unstructured data in multiple languages, which remain underutilized despite their potential to enhance healthcare delivery, support trainee education, and advance biomedical research. Transforming this data into structured, actionable information requires Natural Language Processing (NLP) techniques. Within NLP, this task is referred to as Information Extraction (IE). This thesis is part of the growth area of biomedical NLP and addresses key challenges in biomedical information extraction, focusing on entity recognition, entity linking and the interoperability of clinical terminologies. It makes three primary contributions: (i) the development of a method for clinical identifiers mapping and data augmentation, (ii) the design and evaluation of biomedical entity linking systems with semantic textual similarity methods, and (iii) the exploration of generative approaches for biomedical entity linking. Throughout, state-of-the-art deep learning techniques are used. First, the thesis presents ClinIDMap, a prototype tool for clinical ID mapping which integrates multiple biomedical knowledge bases (e.g., ICD-10, SNOMED CT, UMLS) and connects them with general-purpose ontologies (Wikidata and WordNet). The tool facilitates corpus annotation and data augmentation. Experiments demonstrate that corpus annotations transferred between terminologies retain high model performance, underscoring the method's utility for overcoming data scarcity. Second, the thesis explores methods for biomedical entity linking (BioEL) in non-English languages, particularly Spanish. By leveraging semantic textual similarity methods and supervised ranking via cross-encoders the entity-linking models achieve higher performance compared to symbolic methods. The proposed methods are validated through participation in shared tasks, where the systems achieved top rankings. Third, the thesis studies the topic of generative models for biomedical entity linking, employing encoder-decoder and decoder-only architectures. These systems generate entity descriptions in knowledge bases (KBs), which makes linking them to the KBs a text-to-text problem. Experiments reveal that context incorporation and data augmentation improve models' capacity to generalize. However, challenges remain in handling unseen data and stabilizing performance in zero-shot settings.
Ver más...

Extreme multi-label deep neural classification of Spanish health records according to the International Classification of Diseases

Dirección:: CASILLAS RUBIO, ARANTZA;; PEREZ RAMIREZ, ALICIA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2022
Resumen:: Este trabajo trata sobre la minería de textos clínicos, un campo del Procesamiento del Lenguaje Natural aplicado al dominio biomédico. El objetivo es automatizar la tarea de codificación médica. Los registros electrónicos de salud (EHR) son documentos que contienen información clínica sobre la salud de un paciente. Los diagnósticos y procedimientos médicos plasmados en la Historia Clínica Electrónica están codificados con respecto a la Clasificación Internacional de Enfermedades (CIE). De hecho, la CIE es la base para identificar estadísticas de salud internacionales y el estándar para informar enfermedades y condiciones de salud. Desde la perspectiva del aprendizaje automático, el objetivo es resolver un problema extremo de clasificación de texto de múltiples etiquetas, ya que a cada registro de salud se le asignan múltiples códigos ICD de un conjunto de más de 70 000 términos de diagnóstico. Una cantidad importante de recursos se dedican a la codificación médica, una laboriosa tarea que actualmente se realiza de forma manual. Los EHR son narraciones extensas, y los codificadores médicos revisan los registros escritos por los médicos y asignan los códigos ICD correspondientes. Los textos son técnicos ya que los médicos emplean una jerga médica especializada, aunque rica en abreviaturas, acrónimos y errores ortográficos, ya que los médicos documentan los registros mientras realizan la práctica clínica real. Para abordar la clasificación automática de registros de salud, investigamos y desarrollamos un conjunto de técnicas de clasificación de texto de aprendizaje profundo.
Ver más...

Personalized Speech Synthesis Using Deep Learning.

Dirección:: HERNAEZ RIOJA, INMACULADA CONCEPCION;; NAVAS CORDON, EVA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: This thesis explores personalized speech synthesis, focusing on speaker adaptation to recreate voices for individuals who have lost their ability to speak. Recent advances in Deep Learning have enabled the creation of high-quality synthetic voices that sound natural and personalized. By leveraging large voice banks like Aholab's and using models such as Tacotron 2 and FastSpeech 2, the research demonstrates effective voice adaptation, even with limited data. The work also tackles dialect adaptation, particularly in Austrian German, using dialect embeddings to capture phonetic and prosodic nuances. Practical applications include a Basque TTS API and the creation of a personalized voice for an ALS patient, showcasing the real-world impact of these techniques.
Ver más...

Contributions to Document-Level Neural Machine Translation

Dirección:: ETCHEGOYEN , THIERRY;; LABAKA INTXAUSPE, GORKA
Resumen:: Neural machine translation (NMT) performs well at the sentence level but faces challenges on document-level phenomena, leading to inconsistencies between sentences. This thesis addresses these limitations by developing specific translation resources and methods, focusing on low-resource languages, Basque in particular. The research covers four areas: improving sentence-level models, creating corpora for document-level translation, designing context-aware modelling variants, and analysing their strengths and weaknesses. Primary contributions include the first Basque-Spanish and Basque-French dataset for context-aware translation, innovative data augmentation techniques, and novel modelling approaches. Additionally, this thesis provides a comprehensive analysis of contextual NMT, addressing factors such as length, complexity and syntactic functions of the context, and the impact of context on gender bias. The findings suggest that the developed datasets and methods significantly enhance the translation quality of intersentential linguistic phenomena.
Ver más...

Integrating Outside Knowledge and Spatial Reasoning in Vision-and-language Models

Dirección:: AGIRRE BENGOA, ENEKO;; AZCUNE GALPARSORO, GORKA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: Hizkuntza naturalaren prozesamendua (NLP) eta konputagailu bidezko ikusmenaren (CV) alorrak asko hazi dira azkenaldian. Bultzada hau ordenagailuen kalkulu-ahalmen eta eskuruagarri dagoen datu kopuruaren hazkundeari esker lortu da, baita etengabe hazten ari den ikerketa-komunitateari esker ere. NLP eta CV-ren arteko zubian aurrerapenak lortu dira ere bai, batez ere testu eta ikusmen modalitateen oinarritzea eskatzen duten zereginetan, hala nola, galdera-erantzute bisuala eta testuan baldintzatutako irudi sorkuntza. Horrek sistema eta aplikazio sofistikatuagoetarako bidea zabaltzen du hainbat domeinutan. Dena den, sistema hauek konponbide errazik ez dituzten ahuleziak dituzte oraindik. Tesi honen helburua egungo ikusizko hizkuntza ereduen (VLM) bi ahulezi aztertzea da: munduko ezagutzaren integrazioa eta arrazonamendu espaziala. Tesi hau bi zati nagusitan bana daiteke, jorratzen dugun ahulezi bakoitzeko bana alegia. Lehenengo zatian, irudietatik goiburukoak sortzen ditugu hizkuntza-ereduetan inplizituki kodetuta dagoen munduko ezagutza hobeto aprobetxatzeko. Bigarrenean, aldiz, objektu anotazioetatik datu sintetikoak sortzen zentratu gara, bai hizkuntza-ereduetan eta baita testu bidezko irudi sortzaileetan ere arrazonamendu espaziala laguntzeko.
Ver más...

Cross-lingual Transfer for Low-Resource Natural Language Processing/Transferencia crosslingüe para el Procesamiento del Lenguaje Natural con pocos recursos.

Dirección:: AGERRI GASCON, RODRIGO;; RIGAU CLARAMUNT, GERMAN
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2025
Resumen:: Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining. The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP. More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications.
Ver más...

Ikasketa-adibide urriko Informazio-Erauzketa

Dirección:: AGIRRE BENGOA, ENEKO;; LOPEZ DE LACALLE LECUONA, OIER
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: Informazio-erauzketaren (IE) arloak makina bati testuan agertzen den informazioa identifikatzea eta kategorizatzea nola irakatsi ikertzen du. Ataza hau, gizakiontzat ere erreza ez dena, azken urteetan ikasketa-automatikoan egindako aurrerapenek sustatu dute, datu-anotatuak erabiliz ereduak entrenatuz. Hala ere, corpora handiak anotatzea lan neketsu eta garestia da, batez ere baliabide urriko inguruneetan. Tesi honen helburua baliabide urriko inguruneetan IE metodoak aztertzea eta garatzea da. Zehazki, hizkuntza-ereduen orokortze gaitasunak erabiltzea, batez ere baliabide handiko iturrietatik ikasitakoa baliabide urriko inguruneetara transferitzeko gaitasuna. Tesia bi zati nagusietan banatzen da. Lehenengo za tian, hizkuntza-eredu kodetzaileak erabiliz ikasketa-adibiderik gabeko eta urriko sistema bat garatu da informazio-erauzketa gauzatzeko gai dena. Bigarren zatian, hizkuntza-eredu handiagotara salto egin da eta aurreko metodoaren zenbait muga aztertu dira.
Ver más...

Improving Fidelity and Table Representation in Table Understanding and Table- to- Text Generation

Dirección:: AGIRRE BENGOA, ENEKO
Menciones:: Tésis Internacional
Resumen:: Hizkuntza Prozesamendua (NLP) erronka bereziak ditu Taula Ulermenean (TU), batez ere taula batetik testuak sortzean zehaztasunez. Tesi honek teknika berritzaileak aurkezten ditu Table-to-Text sormenaren fideltasuna hobetzeko. Logika-Forma (LF) automatikoki sortuen eta Vision Language Model-en (VLM) bidez, taulen interpretazioa eta testu-sorrera hobetzen du. Ikerketa honek taula-testuko sormenaren fideltasuna hobetzeko metodoak garatu ditu, oinarrizko ereduekiko %67ko fideltasun-igoera lortuz. PixT3 izeneko Table-to-Text tesi honetan aurkeztutako eredu berria, gaur egungo beste ereduak gainditzen ditu. Gainera, 2,5 milioi adibide eta 11 atazatako 1,1 milioi taula irudi originaleko datu-multzo multimodal handiena sortu da. Lan honek Taula Ulermenaren ikerketa arloa aurreratzen du, taula-datuen interpretaziorako metodo fidagarriagoak, eskalagarriagoak eta ikuspegi bisualekoak proposatuz.
Ver más...

Towards general attribute controllability in NLP models.

Dirección:: AGIRRE BENGOA, ENEKO;; ARTETXE ZURUTUZA, MIKEL
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: Tesi honen helburua Hizkuntzaren Prozesamenduko sistemetan atributuen kontrolgarritasuna lortzea da. Hizkuntza Prozseamenduko ohiko paradigman, entrenamendu datuek eta ikasketa helburuak zehazten dute soilik sistemaren portaera, eta hauek aldatzetik at ez da existitzen sistemen irteeren atributuak kontrolatzeko mekanismorik. Tesi honetan, paradigma honetatik at Hizkuntza Prozesamenduko sistemen irteeren atributuak kontrolatzeko teknikak aztertu eta garatzen ditugu. Tesiaren lehen zatian, hiru sistema mota desberdinen atributuen kontrolagarritasunerako teknika ez-gainbegiratuak garatzen ditugu, bakoitza aplikazio desberdin batekin: i) hitz-bektoreen lerrokatzearen kontrola, hitz-bektore elebidunen sorkuntzara aplikatua, ii) kodetzaile baten informazio edukiaren kontrola, parafrasi sorkuntzara aplikatua, eta iii) hizkuntza-eredu baten metrika eta errimaren kontrola, poesia sorkuntzara aplikatua. Tesiaren bigarren zatian, berriz, hizkuntza-eredu baten egokitzerako teknika orokor bat garatzen dugu, konputazio kostu txikiarekin edozein hizkuntza-ereduren portaera kontrolatzea ahalbidetzen duena.
Ver más...

EMG- Based silent speech interfaces. Insights into the cCallenge of Predicting Speech from Articulatory Muscle Activity.

Dirección:: HERNAEZ RIOJA, INMACULADA CONCEPCION;; NAVAS CORDON, EVA
Menciones:: Cum Laude; Tésis Internacional
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: This thesis is performed as part of the ReSSInt project, which aims to restore speech for Spanish-speaking people who have been deprived of the ability to speak. The main goal is to develop a silent speech interface (SSI) based on non-acoustic biosignals using state-of-the-art technology. It allows users to communicate without making any sound because a computer model interprets the mouth movements or brain signals related to the intended speech. The interface functions thanks to a large database of different types of information. This thesis is primarily focused on the data collection and research challenges associated with an SSI based on mouth movements while speaking. To capture information from mouth movements, a technique called electromyography (EMG) is used, which measures muscle activity. The main contribution of this thesis is the development of a Spanish EMG-speech database and its collection and validation procedure, as well as analyses of the effect of speaker variability, phone confusion, and speech mode. The results can be used to develop and improve the final EMG-based SSI for Spanish alaryngeal speakers
Ver más...

Generic Framework for the Multidimensional Processing and Analysis of Social Media Content: A Proxemic Approach.

Dirección:: AGERRI GASCON, RODRIGO;; SALLABERRY , CHRISTIAN
Menciones:: Cum Laude; Tésis Internacional; Tésis en Cotutela
Calificación:: Sobresaliente Cum Laude
Año:: 2024
Resumen:: In recent decades, significant growth and diversification in sources of User-Generated Content (UGC) have been observed. Social media emerges as one of the primary sources of UGC, offering numerous advantages over traditional data sources, such as affordability, vastness, and diversity across various domains of application (for example, tourism, health, public policies). However, the highly unstructured nature of social media posts introduces several challenges. The language diversity and specificity of social media posts, characterized by features such as brevity, frequent grammatical errors, and the use of special characters, combined with the substantial volume and noisy nature of the data, make analyzing social media data a complex endeavour. This thesis introduces a novel multilingual framework, the APs Framework, designed to streamline the processing and analysis of social media data. This framework is generic in two aspects: it can be applied across various social media platforms and is adaptable to different application domains. The genericity of the application domain is supported by semantic representations of domain knowledge (for example, through thesaurus or ontologies). The APs Framework aims to provide domain-independent insights from social media to non-computer scientists, such as stakeholders in various domains (for example, tourism offices in the tourism domain), thereby enhancing their analytical capabilities. The APs Framework is structured into four phases: Collect, Transform, Analyze, and Valorize. In the Collect phase, a generic and iterative methodology for constructing thematic datasets from social media is proposed. This approach seeks to mitigate the challenges of creating accurate and representative datasets amidst the voluminous and noisy nature of social media. The objective is to shift from ad hoc extraction techniques, prevalent in existing studies, to a more systematic, semi-automatic process. This methodology incorporates human feedback at various stages and utilizes both content-based and metadata-based filtering techniques, alongside semantic domain descriptions, to offer a standardized and reusable method for thematic dataset building from social media. The methodology was evaluated both qualitatively and quantitatively through the development of an X/Twitter dataset focused on tourism in the Basque Country region. The Transform phase tackles the challenge of converting multilingual, unstructured text data into structured knowledge within a given application domain. It concentrates on three pivotal knowledge extraction tasks: (1) Sentiment Analysis, (2) Named Entity Recognition (NER) for Locations, and (3) Fine-grained Thematic Concept Extraction. Given the scarcity of multilingual training resources in the tourism domain, the process of manually generating a novel annotated training dataset for this domain is detailed. Subsequently, the thesis explores optimal strategies for the multilingual analysis of social media content in tourism, comparing rule-based and deep iii learning-based approaches (including fine-tuning and prompting-based few-shot learning with various language models). This exploration aims to identify the minimal number of annotated examples necessary for achieving competitive results across these tasks, leveraging various training techniques and language models. This phase addresses the challenge of minimizing manual annotation efforts without compromising the results¿ quality, considering the time-consuming and expensive nature of manual data annotation. In the Analyze phase, we hypothesize that adapting the theory of proxemics, traditionally applied in physical contexts, to social media could offer a novel approach to crafting meaningful, domain-adaptable indicators for various end-users. The theory is formally redefined, leading to the development of a modular and extensible proxemic data model. This model is capable of representing social media entities and their interactions in a domain-independent manner. Leveraging this model, ProxMetrics, a toolkit and formula for generating adaptable indicators from social media is introduced. These indicators, conceptualized as proxemic similarity measures, span multidimensional social media entities, including users, groups, places, themes, and temporal periods. They are highly customizable, allowing for the adjustment of the five proxemic dimensions (Distance, Identity, Location, Movement and Orientation) to address various domain requirements. The toolkit and models underwent qualitative evaluations in collaboration with a local tourism office to model and address various local touristic requirements. Finally, the Valorize phase addresses the challenge of presenting social media indicators and analyses to non-computer scientist users, such as domain stakeholders, in an accessible and domain-independent manner. To this end, TextBI, a multimodal generic dashboard, is proposed. This tool is designed to display multidimensional annotations and indicators over volumes of multilingual social media data, focusing on four core dimensions: spatial, temporal, thematic, and personal, while also accommodating additional enrichment data, such as sentiment and engagement. The dashboard offers various visualization modes, including frequency, movement, association and, proxemics, combining features from Business Intelligence (interactivity, combined filtering, synchronization of visuals), Geographical Information Systems (spatial view at multiple granularities), and Linguistic Information Visualization tools (text-based analyses). Unlike most existing dashboards, it is generic to operate across different domains, provided the data adheres to the specified data model. The effectiveness of this dashboard was validated in the tourism domain through evaluations conducted by tourism offices, assessing its applicability and relevance. The framework¿s twofold genericity (application domain and data source) is demonstrated through the application of each phase in another domain of application: local public policies, leveraging data from municipality review platforms.
Ver más...

Más información en ADDI (Archivo Digital Docencia Investigación) (Abre una nueva ventana)

Barra de búsqueda

Titulo - Tesis defendidas

Contenido de XSL

Barra de búsqueda

Ruta de navegación

Titulo - Tesis defendidas

Contenido de XSL

Tesis defendidas del programa actual

Hizkuntza-ulermenari ekarpenak: N-gramen arteko atentzio eta lerrokatzeak antzekotasun eta inferentzia interpretagarrirako.

Application of singing synthesis techniquest to bertsolaritza

Hitzen arteko antzekotasuna:ezagutza-baseetan oinarritutako tekniken ekarpenak

Medidas de distancia entre lenguas basadas en corpus/Medidas de distância entre línguas baseadas em corpus.

Adverse drug reaction extraction on electronic health records written in Spanish

Speech recognition based strategies for on-line Computer Assisted Language Learning (CALL) systems in Basque/Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako.

Predicate Matrix: an interoperable lexical knowledge base for predicates

Multilingual sentiment analysis in social media.

Aldaera linguistikoen normalizazioa inferentzia fonologikoa eta morfologikoa erabiliz

Euskal telebistaren sorrera, garapena eta funtzioa euskararen normalizazioaren testuinguruan

Aditza+izena unitate fraseologikoak gaztelaniatik euskarara: azterketa eta tratamendu konputazionala.

Euskarazko denbora-egituren azterketa eta corpusaren sorrera/Analysis of Basque temporal constructions and creation of a corpus.

Datuen Ustiapena Itzulpen Automatikorako

Itzulpen automatiko gainbegiratu gabea

Oesophageal speech:enrichment and evaluatons

Contributions to Information Extraction for Spanish Written Biomedical Text

Txosten klinikoak euskararen eta gazteleraren artean itzultzen laguntzeko corpusaren bilketa eta itzultzaile automatikoaren garapena / Corpus compilation and development of a machine translation system for translating clinical reports between Basque and Spanish

Adimen Artifizialeko metodoak gizarte ikerkuntzarako: analisi demografikoa, jarreren detekzioa eta joera politikoen identifikazioa

Laburpen-gaitasunaren garapena eta eskolako laburpen-testuen prozesamendua

Leveraging Feedback in Conversational Question Answering Systems

Generic semantics-based task-oriented dialogue system framework for human-machine interaction in industrial scenarios

Multilingual Information Extraction in Clinical Texts Using Deep Learning Approaches.

Extreme multi-label deep neural classification of Spanish health records according to the International Classification of Diseases

Personalized Speech Synthesis Using Deep Learning.

Contributions to Document-Level Neural Machine Translation

Integrating Outside Knowledge and Spatial Reasoning in Vision-and-language Models

Cross-lingual Transfer for Low-Resource Natural Language Processing/Transferencia crosslingüe para el Procesamiento del Lenguaje Natural con pocos recursos.

Ikasketa-adibide urriko Informazio-Erauzketa

Improving Fidelity and Table Representation in Table Understanding and Table- to- Text Generation

Towards general attribute controllability in NLP models.

EMG- Based silent speech interfaces. Insights into the cCallenge of Predicting Speech from Articulatory Muscle Activity.

Generic Framework for the Multidimensional Processing and Analysis of Social Media Content: A Proxemic Approach.