Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds

Trop Anim Health Prod. 2021 Jul 10;53(3):395. doi: 10.1007/s11250-021-02824-x.


Background: Assigning animals to their corresponding breeds through breed informative single-nucleotide polymorphisms (SNPs) is required in many fields. For instance, it is used in the traceability and the authentication of meat and other livestock products. SNPs' information for several pork breeds are now accessible thanks to the availability of dense SNP chips. These SNP chips cover a large number of molecular markers distributed across the entire genome. To identify the pork breed from a sample of industrial meat, one must analyze a large panel of genetic markers depending on the SNP chip used. The analysis of such large datasets requires intensive work. This leads to the idea of creating less dense chips of breed informative markers based on a reduced number of SNPs. Therefore, the analysis of the data emanating from the genotyping of these reduced chips will require less time and effort.

Aim: The objective of this study is to find the most informative SNPs for the discrimination between four pig breeds, namely Duroc, Landrace, Large White, and Pietrain.

Method: The Illumina Porcine 60 k SNP chip was used to genotype SNPs distributed all over the individuals' genomes. Firstly, we used three different statistical approaches for feature selection: (i) principal component analysis (PCA), (ii) least absolute shrinkage and selection operator (LASSO), and (iii) random forest (RF). These three approaches identified three sets of SNPs; each set corresponds to one approach. Then, we combined the results of the three methods by setting up a final panel containing the SNPs which appear on the three sets altogether.

Results: Separately, each method resulted in a panel with the corresponding most discriminating SNPs. The PCA, the LASSO, and the random forest with Boruta algorithm highlighted 28,816, 50, and 286 SNPs, respectively. The number of SNPs selected by PCA is high compared to Boruta and LASSO because PCA chooses the variables while preserving as much information about the data as possible. The only downside of LASSO regression is that among a group of correlated variables, LASSO tends to select only one variable and ignore the others regardless of their importance. Contrarily to LASSO, the Boruta algorithm considers the interdependence between SNPs and selects informative variables even if they are correlated and have the same effect. The three panels shared 23 SNPs; the distribution of the individuals according to these SNPs showed a grouping of individuals of each breed in well-defined clusters without any overlapping.

Conclusions: The biological pathways represented by 23 breed informative SNPs resulted by the combination of PCA, LASSO, and Boruta should be explored in further analysis. The results provided by our study are promising for further applications of this method in other livestock animals.

Keywords: Boruta; Least absolute shrinkage and selection operator; Pig breeds; Principal component analysis; Random forest; Single-nucleotide polymorphism.

MeSH terms

  • Animals
  • Genetic Markers
  • Genetics, Population*
  • Genotype
  • Oligonucleotide Array Sequence Analysis / veterinary
  • Polymorphism, Single Nucleotide*
  • Swine / genetics


  • Genetic Markers