Critical assessment of coiled-coil predictions based on protein structure data

Dominic Simm; Klas Hatje; Stephan Waack; Martin Kollmar

doi:10.1038/s41598-021-91886-w

Critical assessment of coiled-coil predictions based on protein structure data

Sci Rep. 2021 Jun 14;11(1):12439. doi: 10.1038/s41598-021-91886-w.

Authors

Dominic Simm^{1

2}, Klas Hatje^{1

3}, Stephan Waack², Martin Kollmar^{4

5}

Affiliations

¹ Group Systems Biology of Motor Proteins, Department of NMR-Based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany.
² Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Göttingen, Germany.
³ Roche Pharmaceutical Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.
⁴ Group Systems Biology of Motor Proteins, Department of NMR-Based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany. mako@nmr.mpibpc.mpg.de.
⁵ Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Göttingen, Germany. mako@nmr.mpibpc.mpg.de.

Abstract

Coiled-coil regions were among the first protein motifs described structurally and theoretically. The simplicity of the motif promises that coiled-coil regions can be detected with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Bank, down to each amino acid and its secondary structure. Apart from the 30-fold difference in minimum and maximum number of coiled coils predicted the tools strongly vary in where they predict coiled-coil regions. Accordingly, there is a high number of false predictions and missed, true coiled-coil regions. The evaluation of the binary classification metrics in comparison with naïve coin-flip models and the calculation of the Matthews correlation coefficient, the most reliable performance metric for imbalanced data sets, suggests that the tested tools' performance is close to random. This implicates that the tools' predictions have only limited informative value. Coiled-coil predictions are often used to interpret biochemical data and are part of in-silico functional genome annotation. Our results indicate that these predictions should be treated very cautiously and need to be supported and validated by experimental evidence.

MeSH terms

Amino Acid Motifs*
Amino Acid Sequence
Databases, Protein / statistics & numerical data
Models, Molecular*
Protein Structure, Secondary*
Software

Associated data

figshare/10.6084/m9.figshare.9994706