The reliability of ranking of protein structure modeling methods is assessed. The assessment is based on the parametric Student's t test and the nonparametric Wilcox signed rank test of statistical significance of the difference between paired samples. The approach is applied to the ranking of the comparative modeling methods tested at the fourth meeting on Critical Assessment of Techniques for Protein Structure Prediction (CASP). It is shown that the 14 CASP4 test sequences may not be sufficient to reliably distinguish between the top eight methods, given the model quality differences and their standard deviations. We suggest that CASP needs to be supplemented by an assessment of protein structure prediction methods that is automated, continuous in time, based on several criteria applied to a large number of models, and with quantitative statistical reliability assigned to each characterization.