Statistical properties of large data sets with linear latent features

Phys Rev E. 2022 Jul;106(1-1):014102. doi: 10.1103/PhysRevE.106.014102.

Abstract

Analytical understanding of how low-dimensional latent features reveal themselves in large-dimensional data is still lacking. We study this by defining a probabilistic linear latent features model with additive noise and by analytically and numerically computing the statistical distributions of pairwise correlations and eigenvalues of the data correlation matrix. This allows us to resolve the latent feature structure across a wide range of data regimes set by the number of recorded variables, observations, latent features, and the signal-to-noise ratio. We find a characteristic imprint of latent features in the distribution of correlations and eigenvalues and provide an analytic estimate for the boundary between signal and noise, even in the absence of a spectral gap.