Dimension reduction and variable selection for genomic selection: application to predicting milk yield in Holsteins

J Anim Breed Genet. 2011 Aug;128(4):247-57. doi: 10.1111/j.1439-0388.2011.00917.x. Epub 2011 Mar 28.


Genome-assisted prediction of genetic merit of individuals for a quantitative trait requires building statistical models that can handle data sets consisting of a massive number of markers and many fewer observations. Numerous regression models have been proposed in which marker effects are treated as random variables. Alternatively, multivariate dimension reduction techniques [such as principal component regression (PCR) and partial least-squares regression (PLS)] model a small number of latent components which are linear combinations of original variables, thereby reducing dimensionality. Further, marker selection has drawn increasing attention in genomic selection. This study evaluated two dimension reduction methods, namely, supervised PCR and sparse PLS, for predicting genomic breeding values (BV) of dairy bulls for milk yield using single-nucleotide polymorphisms (SNPs). These two methods perform variable selection in addition to reducing dimensionality. Supervised PCR preselects SNPs based on the strength of association of each SNP with the phenotype. Sparse PLS promotes sparsity by imposing some penalty on the coefficients of linear combinations of original SNP variables. Two types of supervised PCR (I and II) were examined. Method I was based on single-SNP analyses, whereas method II was based on multiple-SNP analyses. Supervised PCR II was clearly better than supervised PCR I in predictive ability when evaluated on SNP subsets of various sizes, and sparse PLS was in between. Supervised PCR II and sparse PLS attained similar predictive correlations when the size of the SNP subset was below 1000. Supervised PCR II with 300 and 500 SNPs achieved correlations of 0.54 and 0.59, respectively, corresponding to 80 and 87% of the correlation (0.68) obtained with all 32 518 SNPs in a PCR model. The predictive correlation of supervised PCR II reached a plateau of 0.68 when the number of SNPs increased to 3500. Our results demonstrate the potential of combining dimension reduction and variable selection for accurate and cost-effective prediction of genomic BV.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Animals
  • Breeding / statistics & numerical data*
  • Cattle
  • Cost-Benefit Analysis / statistics & numerical data
  • Dairying*
  • Genomics
  • Lactation
  • Least-Squares Analysis
  • Male
  • Milk / metabolism*
  • Polymorphism, Single Nucleotide
  • Principal Component Analysis
  • Quantitative Trait, Heritable
  • Regression Analysis
  • Selection, Genetic*