Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 8;8(1):7204.
doi: 10.1038/s41598-018-25355-2.

Consensus Bayesian assessment of protein molecular mass from solution X-ray scattering data

Affiliations

Consensus Bayesian assessment of protein molecular mass from solution X-ray scattering data

Nelly R Hajizadeh et al. Sci Rep. .

Abstract

Molecular mass (MM) is one of the key structural parameters obtained by small-angle X-ray scattering (SAXS) of proteins in solution and is used to assess the sample quality, oligomeric composition and to guide subsequent structural modelling. Concentration-dependent assessment of MM relies on a number of extra quantities (partial specific volume, calibrated intensity, accurate solute concentration) and often yields limited accuracy. Concentration-independent methods forgo these requirements being based on the relationship between structural parameters, scattering invariants and particle volume obtained directly from the data. Using a comparative analysis on 165,982 unique scattering profiles calculated from high-resolution protein structures, the performance of multiple concentration-independent MM determination methods was assessed. A Bayesian inference approach was developed affording an accuracy above that of the individual methods, and reports MM estimates together with a credibility interval. This Bayesian approach can be used in combination with concentration-dependent MM methods to further validate the MM of proteins in solution, or as a reliable stand-alone tool in instances where an accurate concentration estimate is not available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
MM determination methods perform differently on different proteins. Four CRYSOL simulated SAXS profiles (Log of relative intensity against s) of proteins with different shape, the profiles are offset for clarity. These cases illustrates the variation in MM estimates of the various methods. Here each of MMQp (P), Vc (V), MoW (M) and Size&Shape (S) at least once provide a MM estimate with the smallest (yellow) and the largest (dark blue) relative error, respectively. However, the estimate provided by the Bayesian inference is consistently the best.
Figure 2
Figure 2
Overview of the method of Bayesian inference. (a) Scatter plot of actual MM (from CRYSOL) vs. the estimated MM (in this case, MMQp). Given the evidence of a MM from MMQp equal to 50 kDa, a distribution is created by extracting the actual MMs (from CRYSOL) of when MMQp = 50 kDa, shown as the red points, and the corresponding distribution in the inlet figure. (b) Example of the Bayesian inference method for a randomly chosen protein, here PDB ID: 214l. The probability distributions of the molecular weights for each of the methods (MMQp: blue; Vc: red; MoW: yellow; Size&Shape: purple) are combined through the Bayesian calculation (green distribution). The most probable MM coincides with the actual MM (black line).
Figure 3
Figure 3
Binning procedure. (a) The distribution of molecular weights of the whole PDB, with very few small and large proteins. (b) The same dataset as to the left, but now log-normalized, with a peak at a MM of 40 kDa. (c) A visualization of the bins used in this study, populated with ~220,000 PDB entries. The bin-widths follow the distribution of atomic weights in the PDB (i.e. it follows the distribution in b), i.e. they vary normally on a log-scale. In the middle (around 40 kDA) the bin sizes are very small. The upper-end and lower-end tails of the distribution (corresponding to the very large/small proteins) are linearly binned to achieve a better resolution. MM’s less than 700 Da and larger than 1.30 MDa are binned to the first and last bin respectively.
Figure 4
Figure 4
Qualitative overview of accuracy for ideal data. Dataset for ideal data with no simulated noise, dataset size is 16,563. The MM’s are expressed in terms of value of the bin (Fig. 3) which the MM falls into. Top: Scatter plot of the estimated MM vs Actual MM. Bottom: Same data as top-panel but plotted as distributions of the relative error between the actual and estimated MM. Finally the median and the median absolute deviation (mad) is shown above each distribution.
Figure 5
Figure 5
ROC-like curves for simulated random noise with different SNRs. ROC-like curves of relative error against normalized frequency. The x-axis is log-scaled to better discern the performance. (a) Ideal data (b) SNR = 32 (c) SNR = 11 (d) SNR = 4 (e) SNR = 2 and (f) SNR = 1. Methods with higher accuracy are located top-left most.
Figure 6
Figure 6
ROC-like curves of different levels of simulated systematic noise. ROC-like curves of relative error against normalized frequency for three different levels of under and over subtraction. The x-axis is log-scaled to better discern the performance Additional levels of over and under-subtraction were investigated (data not shown). Low, medium and high refers to factors of 0.1, 0.4 and 0.9 respectively (see Methods).
Figure 7
Figure 7
Performance of the methods for different protein shapes. Heatmap assessing the performance of the method against the protein shape, as determined by the protein classifier algorithm DATCLASS. The color represents the fraction of the cases at which each method yielded the most accurate MM as determined by the smallest relative error. The figure comprises the results from all noise levels, a total of 6 noise levels each containing 16,563 unique profiles, amounting to 99,378 profiles.
Figure 8
Figure 8
ROC-like curves for experimental data from SASBDB. ROC curves of relative error against normalized frequency for experimental data from all published SASBDB entries, 375 datasets in total. The x-axis is log-scaled to better discern the performance. The actual MM is taken to be the user submitted experimental MM. As a control, the actual MM is plotted against the MM from the user submitted sequence. Right: Counting NaNs as a bad estimate, and normalizing by the total number of cases. Left: Ignoring NaNs, normalizing by the total number of cases minus the number of NaNs.
Figure 9
Figure 9
Credibility interval from Bayesian inference. Scatter plot of DatBayes MM against actual MM for ideal data. Both axis are log-scaled. The bars indicate the width of the probability distribution containing 90% of the probability mass. Note the larger bars for very small and large proteins, a result of the limited training data in these ranges of MMs.

Similar articles

Cited by

References

    1. Jeffries CM, et al. Preparing monodisperse macromolecular samples for successful biological small-angle X-ray and neutron-scattering experiments. Nat. Protoc. 2016;11:2122–2153. doi: 10.1038/nprot.2016.113. - DOI - PMC - PubMed
    1. Kikhney AG, Svergun DI. A practical guide to small angle X-ray scattering (SAXS) of flexible and intrinsically disordered proteins. FEBS Letters. 2015;589:2570–2577. doi: 10.1016/j.febslet.2015.08.027. - DOI - PubMed
    1. Dyer KN, et al. High-throughput SAXS for the characterization of biomolecules in solution: A practical approach. Methods Mol. Biol. 2014;1091:245–258. doi: 10.1007/978-1-62703-691-7_18. - DOI - PMC - PubMed
    1. Mylonas E, Svergun DI. Accuracy of molecular mass determination of proteins in solution by small-angle X-ray scattering. J. Appl. Cryst. 2007;40:245–249. doi: 10.1107/S002188980700252X. - DOI
    1. Trewhella J, et al. 2017 publication guidelines for structural modelling of small-angle scattering data from biomolecules in solution: An update. Acta Crystallogr. Sect. D Struct. Biol. 2017;73:710–728. doi: 10.1107/S2059798317011597. - DOI - PMC - PubMed

Publication types