Background: In candidate-gene association studies of single nucleotide polymorphisms (SNPs), multilocus analyses are frequently of high dimensionality when considering haplotypes or haplotype pairs (diplotypes) and differing modes of expression. Often, while candidate genes are selected based on their biological involvement in a given pathway, little is known about the functionality of SNPs to guide association studies. Investigators face the challenge of exploring multiple SNP models to elucidate which variants, independently or in combination, might be associated with a disease of interest. A data mining module, hapConstructor (freely-available in Genie software) performs systematic construction and association testing of multilocus genotype data in a Monte Carlo framework. Our objective was to assess its utility to guide statistical analyses of haplotypes within a candidate region (or combined genotypes across candidate genes) beyond that offered by a standard logistic regression approach.
Methods: We applied the hapConstructor method to a multilocus investigation of candidate genes involved in pro-inflammatory cytokine IL6 production, IKBKB, IL6, and NFKB1 (16 SNPs total) hypothesized to operate together to alter colorectal cancer risk. Data come from two U.S. multicenter studies, one of colon cancer (1,556 cases and 1,956 matched controls) and one of rectal cancer (754 cases and 959 matched controls).
Results: hapConstructor enabled us to identify important associations that were further analyzed in logistic regression models to simultaneously adjust for confounders. The most significant finding (nominal P = 0.0004; false discovery rate q = 0.037) was a combined genotype association across IKBKB SNP rs5029748 (1 or 2 variant alleles), IL6 rs1800797 (1 or 2 variant alleles), and NFKB1 rs4648110 (2 variant alleles) which conferred an ~80% decreased risk of colon cancer.
Conclusions: Strengths of hapConstructor were: systematic identification of multiple loci within and across genes important in CRC risk; false discovery rate assessment; and efficient guidance of subsequent logistic regression analyses.