Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep 27:3:190.
doi: 10.3389/fgene.2012.00190. eCollection 2012.

Statistical properties of multivariate distance matrix regression for high-dimensional data analysis

Affiliations

Statistical properties of multivariate distance matrix regression for high-dimensional data analysis

Matthew A Zapala et al. Front Genet. .

Abstract

Multivariate distance matrix regression (MDMR) analysis is a statistical technique that allows researchers to relate P variables to an additional M factors collected on N individuals, where P ≫ N. The technique can be applied to a number of research settings involving high-dimensional data types such as DNA sequence data, gene expression microarray data, and imaging data. MDMR analysis involves computing the distance between all pairs of individuals with respect to P variables of interest and constructing an N × N matrix whose elements reflect these distances. Permutation tests can be used to test linear hypotheses that consider whether or not the M additional factors collected on the individuals can explain variation in the observed distances between and among the N individuals as reflected in the matrix. Despite its appeal and utility, properties of the statistics used in MDMR analysis have not been explored in detail. In this paper we consider the level accuracy and power of MDMR analysis assuming different distance measures and analysis settings. We also describe the utility of MDMR analysis in assessing hypotheses about the appropriate number of clusters arising from a cluster analysis.

Keywords: distance matrix; multivariate analysis; regression analysis; simulation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Plot of permutation test-derived p-values as a function of the F-statistic in gray, the corresponding p-values derived from the F-distribution are overlaid in black for 100 samples and 10 random variables following a normal distribution with a mean of 0 and a variance of 1 simulated 1000 times. Fifty samples were coded as control (0) and 50 samples were coded as experiment (1).
Figure 2
Figure 2
Scatter plot of p-values from Figure 1 generated from permutation tests vs. those derived from the F-distribution (Pearson correlation coefficient = 0.99).
Figure 3
Figure 3
Plot of permutation test-derived p-values as a function of the F-statistic in gray, the corresponding p-values derived from the F-distribution are overlaid in black for 10 samples (N = 10) and 10 random variables (P = 10) following a normal distribution with a mean of 0 and a variance of 1 simulated 1000 times. Five samples were coded as control (0) and five samples were coded as experiment (1).
Figure 4
Figure 4
Scatter plot of p-values obtained from the F-distribution vs. permutation tests for random samples sizes varying between 4 and 100 (i.e., 4 ≤ N ≤ 100) and random variables size from 1 to 100 (i.e., 1 ≤ P ≤ 100) with a single continuous regressor variable (M = 1) simulated 1000 times. Outlying observations represented as black squares lying away from the trend line have sample sizes less than or equal to eight.
Figure 5
Figure 5
Power of the MDMR procedure as a function of signal-to-noise ratio obtained from 1000 simulated data sets for a wide variety of settings. Simulated data for 30 (N = 30) samples and 100 variables (P = 100) were generated with 15 samples assigned to a control group (independent variable = 0) and 15 samples assigned to an experimental group (independent variable = 1). Random data in the control group were generated as standard normal variates with a mean of 0 and variance 1. Random data in the experimental group were generated as standard normal variates with variance = 1 and means that took on values of 0–1.5 in increments of 0.001. The power of the permutation-based statistical test is presented. We generated different simulated data sets for which 100, 50, 25, 10, or 5% of the variables used in the construction of the distance matrix had means adjusted from 0 (in the appropriate increments) in the experimental group. The gray line shows the power of a Bonferroni corrected P-value for the Student’s t-tests performed on each of the 100 variables in univariate t-tests which were corrected for the hundred statistical tests pursued.
Figure 6
Figure 6
Power of the MDMR procedure as a function of increasing sample size. Half of the samples for each sample size were assigned to a control (coded as 0) and half to an experimental group (coded as 1). For each sample 100 random variables were generated following a normal distribution with a mean of 0 and a variance of 1 for the control group and an assigned mean difference of 0.1, 0.2, or 0.3 and a variance of 1 for the experimental group.
Figure 7
Figure 7
Power of the proposed MDMR procedure as a function of the correlation of continuous regressor variables for a samples size of N = 100 with P = 100 variables. The x-axis displays the percentage of variables that have a correlation to the regressor variable. Four different correlation strengths are shown ranging from 0.1 to 0.4. P = 100 random variables were generated following a normal distribution with a mean of 0 and a variance of 1.
Figure 8
Figure 8
Comparison of the UPGMA hierarchical cluster algorithm to the matrix regression technique. Simulated data for N = 60 samples and P = 100 variables were generated with 30 samples assigned to the control group (independent variable = 0) and 30 samples assigned to the experimental group (independent variable = 1). Random data in the control group were generated as standard normal variates with a mean of 0 and variance of 1. At mean differences below 0.75, hierarchical clustering using the unweighted average distance (UPGMA) does not clearly differentiate two groups with different means. Shown above are five clusters for what visually appears to be two groups. The red asterisks (*) signify simulated data that has been misclassified. Two samples whose means were at 0.5 were grouped with samples whose means where 0 (bottom two asterisks). The matrix regression technique shows that the correct grouping of two separate groups gives the highest F-statistic of 5.32, while the UPGMA clustering technique of five distinct groups only provides an F-statistic of 5.28.
Figure A1
Figure A1
Power of the MDMR procedure as a function of non-normal population distributions. The black line shows power as calculated before for two populations with normal distributions. The green line displays power for populations with log normal distributions. The pink line shows power for power for populations with bimodal distributions (equivalent for a normal distribution with 100% of the data having means altered) and the blue line shows power when only one mode of a bimodal population is different (equivalent for a normal distribution with 50% of the data having means altered). The red line shows the power of a Bonferroni corrected p-value for the Student’s t-tests performed on each of the 100 variables in univariate t-tests which were corrected for the 100 statistical tests pursued.
Figure A2
Figure A2
Histogram of two log normal distribution. The solid line has a mean of 1 and the dotted line has a mean of 1.225 where the difference in the means yields ~100% power for MDMR with a two log normal population distributions.
Figure A3
Figure A3
Histogram of two bimodal distributions. The solid line has two modes with a mean of 1 and a mean of 4 and the dotted line has two modes with a mean of 1.36 and a mean of 4.36 where the difference in the distributions yields ~100% power for MDMR.

Similar articles

Cited by

References

    1. Alter O., Brown P. O., Botstein D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. U.S.A. 97, 10101–1010610.1073/pnas.97.18.10101 - DOI - PMC - PubMed
    1. Anderson M. J. (2001). A new method for non-parametric multivariate analysis of variance. Austral Ecol. 26, 32–4610.1111/j.1442-9993.2001.01070.pp.x - DOI
    1. Clark A. G. (2006). Genomics of the evolutionary process. Trends Ecol. Evol. (Amst.) 21, 316–32110.1016/j.tree.2006.04.004 - DOI - PubMed
    1. D’Haeseleer P. (2005). How does gene expression clustering work? Nat. Biotechnol. 23, 1499–150110.1038/nbt0805-941 - DOI - PubMed
    1. Donoho D. L. (2000). High-dimensional data analysis: the curses and blessings of dimensionality. Aide-Memoire of the Lecture in American Mathematical Society Conference: Math Challenges of 21st Century Available at: http://www.stat.stanford.edu/~donoho/Lectures/AMS2000/AMS2000.html

LinkOut - more resources