Whole genome sequencing of Streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline

PeerJ. 2016 Sep 14:4:e2477. doi: 10.7717/peerj.2477. eCollection 2016.


Streptococcus pneumoniae typically express one of 92 serologically distinct capsule polysaccharide (cps) types (serotypes). Some of these serotypes are closely related to each other; using the commercially available typing antisera, these are assigned to common serogroups containing types that show cross-reactivity. In this serotyping scheme, factor antisera are used to allocate serotypes within a serogroup, based on patterns of reactions. This serotyping method is technically demanding, requires considerable experience and the reading of the results can be subjective. This study describes the analysis of the S. pneumoniae capsular operon genetic sequence to determine serotype distinguishing features and the development, evaluation and verification of an automated whole genome sequence (WGS)-based serotyping bioinformatics tool, PneumoCaT (Pneumococcal Capsule Typing). Initially, WGS data from 871 S. pneumoniae isolates were mapped to reference cps locus sequences for the 92 serotypes. Thirty-two of 92 serotypes could be unambiguously identified based on sequence similarities within the cps operon. The remaining 60 were allocated to one of 20 'genogroups' that broadly correspond to the immunologically defined serogroups. By comparing the cps reference sequences for each genogroup, unique molecular differences were determined for serotypes within 18 of the 20 genogroups and verified using the set of 871 isolates. This information was used to design a decision-tree style algorithm within the PneumoCaT bioinformatics tool to predict to serotype level for 89/94 (92 + 2 molecular types/subtypes) from WGS data and to serogroup level for serogroups 24 and 32, which currently comprise 2.1% of UK referred, invasive isolates submitted to the National Reference Laboratory (NRL), Public Health England (June 2014-July 2015). PneumoCaT was evaluated with an internal validation set of 2065 UK isolates covering 72/92 serotypes, including 19 non-typeable isolates and an external validation set of 2964 isolates from Thailand (n = 2,531), USA (n = 181) and Iceland (n = 252). PneumoCaT was able to predict serotype in 99.1% of the typeable UK isolates and in 99.0% of the non-UK isolates. Concordance was evaluated in UK isolates where further investigation was possible; in 91.5% of the cases the predicted capsular type was concordant with the serologically derived serotype. Following retesting, concordance increased to 99.3% and in most resolved cases (97.8%; 135/138) discordance was shown to be caused by errors in original serotyping. Replicate testing demonstrated that PneumoCaT gave 100% reproducibility of the predicted serotype result. In summary, we have developed a WGS-based serotyping method that can predict capsular type to serotype level for 89/94 serotypes and to serogroup level for the remaining four. This approach could be integrated into routine typing workflows in reference laboratories, reducing the need for phenotypic immunological testing.

Keywords: Bacterial; Bioinformatics; Genomics; Pneumococcus; Serotyping; WGS.

Grants and funding

The authors received no funding for this work.