Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 11 (11), e0165736
eCollection

Discovering Periodic Patterns in Historical News

Affiliations

Discovering Periodic Patterns in Historical News

Fabon Dzogang et al. PLoS One.

Abstract

We address the problem of observing periodic changes in the behaviour of a large population, by analysing the daily contents of newspapers published in the United States and United Kingdom from 1836 to 1922. This is done by analysing the daily time series of the relative frequency of the 25K most frequent words for each country, resulting in the study of 50K time series for 31,755 days. Behaviours that are found to be strongly periodic include seasonal activities, such as hunting and harvesting. A strong connection with natural cycles is found, with a pronounced presence of fruits, vegetables, flowers and game. Periodicities dictated by religious or civil calendars are also detected and show a different wave-form than those provoked by weather. States that can be revealed include the presence of infectious disease, with clear annual peaks for fever, pneumonia and diarrhoea. Overall, 2% of the words are found to be strongly periodic, and the period most frequently found is 365 days. Comparisons between UK and US, and between modern and historical news, reveal how the fundamental cycles of life are shaped by the seasons, but also how this effect has been reduced in modern times.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. An example of two different word time series as extracted from historical newspapers in the United Kingdom.
(a) a time series of a strongly periodic word (Christmas), and (b) a time series which is not periodic (London). Both time series in this example are shortened to the 10 years between 1890 and 1900 for illustrative purposes.
Fig 2
Fig 2
Periodicities of words in (a) the UK corpus and (b) the US corpus, with at least 20% of their variance explained by a single Fourier component. Each circle represents a single word, with its position around each figure indicating the peak time of usage, the radius from the centre indicating the period between 30 days (inner most) and four years (outer most), while the size of the individual circles, along with their colour indicates the variance explained, with larger, darker circles explaining more of their variance with a single Fourier component. This shows the general methodology we developed for the detection of periodic words is capable of finding any periodicity between 30 days and 5 years, with a rigorous statistical test. Some words have indeed a 2 years cycle (US political elections) and some others have a shorter cycle (fashion related words). This makes more remarkable the finding that most of the periodic words have a 12 month cycle. Similarly, there seems to be a non-uniform distribution in the phases, with most words peaking in the summer months.
Fig 3
Fig 3. Strongly periodic words in the UK corpus with a seasonal component are shown categorised into one of 12 topical categories.
Each circle denotes a word in the corpus, with the labels shown around the outside. Label font size and circle size indicate the variance explained by the first component. Position around the figure indicates the phase of the period, corresponding to the day of the year where the word is most present. Words are grouped by colour indicating their category.
Fig 4
Fig 4. Strongly periodic words in the US corpus with a seasonal component are shown categorised into one of 12 topical categories.
Each circle denotes a word in the corpus, with the labels shown around the outside. Label font size and circle size indicate the variance explained by the first component. Position around the figure indicates the phase of the period, corresponding to the day of the year where the word is most present. Words are grouped by colour indicating their category.
Fig 5
Fig 5
Variance explained averaged over all seasonally periodic words and total number of seasonally periodic words peaking per day in (a) the UK corpus and (b) the US corpus, compared with the maximum daily temperature and photoperiod in each geographical region respectively. Each series is standardized while the two former series were first smoothed using a 15-day centred moving average. The grey area represents the 99% confidence interval for the Temperature series. We can see that the average variance explained per day and the number of seasonal words peaking per day correlates more closely with the temperature, rather than the photoperiod, with most words peaking at the warmest and coldest time of the year. This suggests that activities are driven by temperature, rather than by day-length.
Fig 6
Fig 6. Examples of sinusoidal and complex waveforms in the time domain, and their Fourier spectrums from the UK corpus.
(a) shows the most complex periodic waveform (advent) in the time domain, while (b) shows the corresponding Fourier spectrum where we can see several components are activated. (c) shows the most sinusoidal periodic waveform (lamb) in the time domain, with (d) showing the corresponding Fourier spectrum where we can see that only one component is strongly activated, explaining nearly 80% of the variance in the time domain.

Similar articles

See all similar articles

References

    1. Helferty M, Vachon J, Tarasuk J, Rodin R, Spika J, Pelletier L. Incidence of hospital admissions and severe outcomes during the first and second waves of pandemic (H1N1) 2009. Canadian Medical Association Journal. 2010. December 14;182(18):1981–7. 10.1503/cmaj.100746 - DOI - PMC - PubMed
    1. Ansolabehere S, Schaffner BF. Re-examining the validity of different survey modes for measuring public opinion in the US: Findings from a 2010 multi-mode comparison. InAAPOR Annual Conference, Phoenix AZ 2011 (pp. 12–15).
    1. Gallup AM, Newport F, editors. The Gallup Poll: Public Opinion 2009. Rowman & Littlefield Publishers; 2010.
    1. Rubin GJ, Amlôt R, Page L, Wessely S. Public perceptions, anxiety, and behaviour change in relation to the swine flu outbreak: cross sectional telephone survey. BMJ. 2009;339:b2651 10.1136/bmj.b2651 - DOI - PMC - PubMed
    1. Brynjolfsson E, Hu Y, Simester D. Goodbye pareto principle, hello long tail: The effect of search costs on the concentration of product sales. Management Science. 2011;57(8):1373–86.

Publication types

Grant support

A European Research Council Advanced Grant 339365 "ThinkBIG" granted to NC supported NC, TLW and FD. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Feedback