Real-Time Forecasting of the COVID-19 Outbreak in Chinese Provinces: Machine Learning Approach Using Novel Digital Data and Estimates From Mechanistic Models

J Med Internet Res. 2020 Aug 17;22(8):e20285. doi: 10.2196/20285.


Background: The inherent difficulty of identifying and monitoring emerging outbreaks caused by novel pathogens can lead to their rapid spread; and if left unchecked, they may become major public health threats to the planet. The ongoing coronavirus disease (COVID-19) outbreak, which has infected over 2,300,000 individuals and caused over 150,000 deaths, is an example of one of these catastrophic events.

Objective: We present a timely and novel methodology that combines disease estimates from mechanistic models and digital traces, via interpretable machine learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real time.

Methods: Our method uses the following as inputs: (a) official health reports, (b) COVID-19-related internet search activity, (c) news media activity, and (d) daily forecasts of COVID-19 activity from a metapopulation mechanistic model. Our machine learning methodology uses a clustering technique that enables the exploitation of geospatial synchronicities of COVID-19 activity across Chinese provinces and a data augmentation technique to deal with the small number of historical disease observations characteristic of emerging outbreaks.

Results: Our model is able to produce stable and accurate forecasts 2 days ahead of the current time and outperforms a collection of baseline models in 27 out of 32 Chinese provinces.

Conclusions: Our methodology could be easily extended to other geographies currently affected by COVID-19 to aid decision makers with monitoring and possibly prevention.

Keywords: COVID-19; coronavirus; digital data; digital epidemiology; emerging outbreak; forecasting; hybrid model; hybrid simulation; machine learning; machine learning in public health; mechanistic model; modeling; modeling disease outbreaks; precision public health; simulation.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • COVID-19
  • China / epidemiology
  • Coronavirus Infections / epidemiology*
  • Coronavirus Infections / transmission*
  • Data Analysis*
  • Disease Outbreaks
  • Forecasting / methods*
  • Humans
  • Internet
  • Machine Learning*
  • Mass Media
  • Models, Biological*
  • Models, Statistical
  • Pandemics
  • Pneumonia, Viral / epidemiology*
  • Pneumonia, Viral / transmission*
  • Public Health / methods