Implications of avoiding overlap between training and testing data sets when evaluating genomic predictions of genetic merit

J Dairy Sci. 2010 Jul;93(7):3320-30. doi: 10.3168/jds.2009-2845.

Abstract

The aim of this study was to evaluate and quantify the importance of avoiding overlap between training and testing subsets of data when evaluating the effectiveness of predictions of genetic merit based on genetic markers. Genomic selection holds great potential for increasing the accuracy of selection in young bulls and is likely to lead quickly to more widespread use of these young bulls with a shorter generation interval and faster genetic improvement. Practical implementations of genomic selection in dairy cattle commonly involve results of national genetic evaluations being used as the dependent variable to evaluate the predictive ability of genetic markers. Selection index theory was used to demonstrate how ignoring correlations among errors of prediction between animals in training and testing sets could result in overestimates of accuracy of genomic predictions. Correlations among errors of prediction occur when estimates of genetic merit of training animals used in prediction are taken from the same genetic evaluation as estimates for validation of animals. Selection index theory was used to show a substantial degree of error correlation when animals used for testing genomic predictions are progeny of training animals, when heritability is low, and when the number of recorded progeny for both training and testing animals is low. Even when training involves a dependent variable that is not influenced by the progeny records of testing animals (i.e., historic proofs), error correlations can still result from records of relatives of training animals contributing to both the historic proofs and the predictions of genetic merit of testing animals. A simple simulation was used to show how an error correlation could result in spurious confirmation of predictive ability that was overestimated in the training population because of ascertainment bias. Development of a method of testing genomic selection predictions that allows unbiased testing when training and testing variables are estimated breeding values from the same genetic evaluation would simplify training and testing of genomic predictions. In the meantime, a 4-step approach for separating records used for training from those used for testing after correction of fixed effects is suggested when use of progeny averages of adjusted records (e.g., daughter yield deviations) would result in inefficient use of the information available in the data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Breeding / methods*
  • Cattle / genetics*
  • Computer Simulation
  • Databases, Factual
  • Female
  • Genome*
  • Male
  • Statistics as Topic / methods
  • Statistics as Topic / standards*
  • Teaching / methods
  • Teaching / standards*