Statistical Measures in Corpus Linguistics: Frequency, Dispersion, Association, and Keyness

Instructors: Stefan Th. Gries

By now, corpus linguistics has for quite some time made many connections to (i) cognitive/usage-based theory, (ii) both observational and experimental psycholinguistic work, and (iii) more applied areas. Since corpus linguistics is ultimately a distributional discipline, these connections often take the form of quantitative measures; among those, frequencies of (co-)occurrence, dispersion, association, and keyness are among the most widely used. These notions are often employed to operationalize cognitive notions such as entrenchment, commonness, contingency, and aboutness and dozens of specific statistical measures have been promoted in the literature. In this course, we will first revisit very briefly the main corpus-linguistic measures that have been used most, before we then discuss a new approach towards this cluster of notions and issues, one that tries to improve on the last few decades of work in three different ways. Improvement 1 will be to **unify the statistical approaches** towards dispersion, association, and keyness by using only a single information-theoretic statistic for each of them. Improvement 2 will be to discuss the degree to which existing measures are correlated with frequency to such an extent that they really don't measure much else and to discuss a solution to **'remove frequency from existing measures' to arrive at cleaner, more valid measures**. Improvement 3 will be to realize that 40 years of looking for one measure to quantify X may have been mistaken and that we need to **measure and report multiple dimensions of information at the same time**. The course will pursue these goals and exemplify them in small case studies by using the programming language R on several corpora. Prior knowledge of R will not be required to follow the conceptual logic, but will be advantageous to follow the programming-related parts of the class.

Keywords: Communicative Efficiency, Computational Linguistics, Information Theory, Quantitative Methods, Statistics, Psycholinguistics, R, Usage-Based Linguistics, Corpus Linguistics

When/Where:
Mondays and Thursdays, July 7-July 21, 10:30am - 11:50am
Terms:
Term 1 (July 7 - 22)
Days:
Mondays and Thursdays

Instructors

Photo

Stefan Th. Gries

UC Santa Barbara & JLU Giessen

Stefan Th. Gries is Professor of Linguistics at UC Santa Barbara and Chair of English Linguistics (Corpus Linguistics with a focus on quantitative methods, 25%) in the Department of English at the JLU Giessen. He is a quantitative corpus linguist at the intersection of corpus linguistics, usage-based/cognitive linguistics, and a bit of computational and psycholinguistics. He has worked on topics such as blend formation, grammatical variation, the syntax-lexis interface, semantics (polysemy, antonymy, near synonymy, and legal interpretation), learner corpus and varieties research, corpus-linguistic methodology, and the development and application of statistical methods in linguistics.


When/Where:
Mondays and Thursdays, July 7-July 21, 10:30am - 11:50am
Terms:
Term 1 (July 7 - 22)
Days:
Mondays and Thursdays