Published on Mon Jan 25 2016

A Kernel Independence Test for Geographical Language Variation

Dong Nguyen, Jacob Eisenstein

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. We present a new method for measuring geographic language variation. Our proposed test is shown to support robusterences across a broad range of scenarios and types of data.

0
0
0
Abstract

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real datasets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a dataset of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

Wed Feb 22 2017
NLP
Dialectometric analysis of language variation in Twitter
Microblogging platforms such as Twitter have given rise to a deluge of textual data that can be used for the analysis of informalcommunication between millions of individuals. In this work, we propose an information-theoretic approach to geographic language variation using a corpus based on Twitter.
0
0
0
Thu Oct 22 2015
Machine Learning
Freshman or Fresher? Quantifying the Geographic Variation of Internet Language
A new computational technique to detect and analyze statistically significant geographic variation in language. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning.
0
0
0
Sat Apr 03 2021
NLP
Measuring Linguistic Diversity During COVID-19
0
0
0
Sun Jun 07 2015
NLP
Confounds and Consequences in Geotagged Twitter Data
Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile.
0
0
0
Sun Aug 01 2021
NLP
Geolocation differences of language use in urban areas
The explosion in the availability of natural language data has given rise to a host of applications such as sentiment analysis and opinion mining. Opportunities for tracking spatial variations in language use have largely been overlooked. Here we explore the use of Twitter data with precise geolocation information to resolve spatial variations.
0
0
0
Sat Jul 26 2014
Machine Learning
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using microblogging datasets. We find that that Spanish language is split into two superdialects. The latter can be further clustered into smaller regional character.
0
0
0