How Search Engine Data Enhance the Understanding of Determinants of Suicide in India and Inform Prevention: Observational Study

J Med Internet Res. 2019 Jan 4;21(1):e10179. doi: 10.2196/10179.

Abstract

Background: India is home to 20% of the world's suicide deaths. Although statistics regarding suicide in India are distressingly high, data and cultural issues likely contribute to a widespread underreporting of the problem. Social stigma and only recent decriminalization of suicide are among the factors hampering official agencies' collection and reporting of suicide rates.

Objective: As the product of a data collaborative, this paper leverages private-sector search engine data toward gaining a fuller, more accurate picture of the suicide issue among young people in India. By combining official statistics on suicide with data generated through search queries, this paper seeks to: add an additional layer of information to more accurately represent the magnitude of the problem, determine whether search query data can serve as an effective proxy for factors contributing to suicide that are not represented in traditional datasets, and consider how data collaboratives built on search query data could inform future suicide prevention efforts in India and beyond.

Methods: We combined official statistics on demographic information with data generated through search queries from Bing to gain insight into suicide rates per state in India as reported by the National Crimes Record Bureau of India. We extracted English language queries on "suicide," "depression," "hanging," "pesticide," and "poison". We also collected data on demographic information at the state level in India, including urbanization, growth rate, sex ratio, internet penetration, and population. We modeled the suicide rate per state as a function of the queries on each of the 5 topics considered as linear independent variables. A second model was built by integrating the demographic information as additional linear independent variables.

Results: Results of the first model fit (R2) when modeling the suicide rates from the fraction of queries in each of the 5 topics, as well as the fraction of all suicide methods, show a correlation of about 0.5. This increases significantly with the removal of 3 outliers and improves slightly when 5 outliers are removed. Results for the second model fit using both query and demographic data show that for all categories, if no outliers are removed, demographic data can model suicide rates better than query data. However, when 3 outliers are removed, query data about pesticides or poisons improves the model over using demographic data.

Conclusions: In this work, we used search data and demographics to model suicide rates. In this way, search data serve as a proxy for unmeasured (hidden) factors corresponding to suicide rates. Moreover, our procedure for outlier rejection serves to single out states where the suicide rates have substantially different correlations with demographic factors and query rates.

Keywords: India; internet data; mobile phone; suicide.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adolescent
  • Adult
  • Data Collection
  • Humans
  • India
  • Search Engine / statistics & numerical data*
  • Suicide Prevention*
  • Young Adult