Subject

XSL Content

Corpus Linguistics

General details of the subject

Mode: Face-to-face degree course
Language: English

Description and contextualization of the subject

This course is an introduction to corpus linguistics. We will start with a brief introduction to textual corpora, including linguistic annotation and representation schemas. We will then address aspects such as the extraction of relevant information from corpora, such as collocations or keyword extraction, using statistical and distributional techniques. Finally, we will learn the XML markup language. During the module we will introduce several corpora in various languages (English, Spanish, Basque, etc).

Teaching staff

Name	Institution	Category	Doctor	Teaching profile	Area	E-mail
SOROA ECHAVE, AITOR	University of the Basque Country	Profesorado Agregado	Doctor	Bilingual	Science of Computation and Artificial Intelligence	a.soroa@ehu.eus

Competencies

Name	Weight
Capacidad de utilizar los recursos lingüísticos masivos existentes para diferentes lenguas	40.0 %
Habilidad para el manejo y adaptación de los métodos simbólicos más relevantes para la investigación en la tecnología de la lengua.	20.0 %
Capacidad para gestionar y diseñar sistemas basados en lenguajes estándares para el etiquetado de información lingüística (por ejemplo: XML y TEI).	40.0 %

Study types

Type	Face-to-face hours	Non face-to-face hours	Total hours
Lecture-based	10	15	25
Applied computer-based groups	20	30	50

Learning outcomes of the subject

In this course the students will learn the principles of corpus linguistics and linguistic annotations, including markup languages such as XML. At the end of the course, the students will be able to extract many relevant information from textual corpora based on statistical analysis.

Temary

1. Introduction to Corpus Linguistics

2. Corpus characteristics and types

- Corpus examples

3. Corpus annotation

- Usual marks and analysis levels

4. Linguistic representation

- The XML markup langiages

- standards for linguistic representation (TEI, NAF, AWA)

Laboratories on:

- Unix tools

- Word frequencies and Zipf law

- Collocations

- Keyword extraction

- XML and XPath

Bibliography

Basic bibliography

Aarts, J. And Meijs, W. (eds.) (1986) Corpus Linguistics II, Amsterdam: Rodopi.

Aijmer, K. and Altenberg, B. (Eds) (1991) English Corpus Linguistics: Studies In Honour Of Jan Svari. London: Longman.

Anthony, L. (2013) ¿A critical look at software tools in corpus linguistics¿, Linguistic Research, Volume 30, Issue 2, pp. 141-161.

Baker, P. (2010) Sociolinguistics and Corpus Linguistics. Edinburgh University Press, Edinburgh.

Garside, R., Leech, G. and McEnery, T. (1997) Corpus Annotation. Longman, Harlow.

Jurafsky D., Martin J.H. (2000) Speech and Language Processing. An Introduction To Natural Language Processing Computational Linguistics and Speech Recognition. Prentice-Hall.

Lawler J., Aristar H. (1998) Using Computers In Linguistics. A Practical Guide. Routledge.

Leech, G. And Fallon, R. (1992) "Computer Corpora - What Do They Tell Us About Culture". Icame Journal, 29-50.

McEnery, T. and Hardie, A (2012) Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge.

Text Encoding And Interchange, TEI P5 (2016) Chicago And Oxford: Text Encoding Initiative.

XSL Content

Suggestions and requests

Search Bar

Master in Language Analysis and Processing

Subject

XSL Content

Corpus Linguistics

General details of the subject

Description and contextualization of the subject

Teaching staff

Competencies

Study types

Learning outcomes of the subject

Temary

Bibliography

Basic bibliography

XSL Content

Search Bar

Breadcrumb

Subject

XSL Content

Corpus Linguistics

General details of the subject

Description and contextualization of the subject

Teaching staff

Competencies

Study types

Learning outcomes of the subject

Temary

Bibliography

Basic bibliography

XSL Content