The Twitter of Babel: mapping world languages through microblogging platforms

PLoS One. 2013 Apr 18;8(4):e61981. doi: 10.1371/journal.pone.0061981. Print 2013.

Abstract

Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data "proxies" of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Blogging*
  • Data Mining / methods*
  • Humans
  • Internet
  • Language*
  • Linguistics / methods*
  • Social Media*

Grant support

The authors acknowledge the support by the National Science Foundation ICES award CCF-1101743. For the analysis of data data outside of the United States of America the authors acknowledge the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC00285. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBE, or the United States Government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.