Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Nov 25;8(11):e78913.
doi: 10.1371/journal.pone.0078913. eCollection 2013.

SVD Identifies Transcript Length Distribution Functions From DNA Microarray Data and Reveals Evolutionary Forces Globally Affecting GBM Metabolism

Affiliations
Free PMC article

SVD Identifies Transcript Length Distribution Functions From DNA Microarray Data and Reveals Evolutionary Forces Globally Affecting GBM Metabolism

Nicolas M Bertagnolli et al. PLoS One. .
Free PMC article

Abstract

To search for evolutionary forces that might act upon transcript length, we use the singular value decomposition (SVD) to identify the length distribution functions of sets and subsets of human and yeast transcripts from profiles of mRNA abundance levels across gel electrophoresis migration distances that were previously measured by DNA microarrays. We show that the SVD identifies the transcript length distribution functions as "asymmetric generalized coherent states" from the DNA microarray data and with no a-priori assumptions. Comparing subsets of human and yeast transcripts of the same gene ontology annotations, we find that in both disparate eukaryotes, transcripts involved in protein synthesis or mitochondrial metabolism are significantly shorter than typical, and in particular, significantly shorter than those involved in glucose metabolism. Comparing the subsets of human transcripts that are overexpressed in glioblastoma multiforme (GBM) or normal brain tissue samples from The Cancer Genome Atlas, we find that GBM maintains normal brain overexpression of significantly short transcripts, enriched in transcripts that are involved in protein synthesis or mitochondrial metabolism, but suppresses normal overexpression of significantly longer transcripts, enriched in transcripts that are involved in glucose metabolism and brain activity. These global relations among transcript length, cellular metabolism and tumor development suggest a previously unrecognized physical mode for tumor and normal cells to differentially regulate metabolism in a transcript length-dependent manner. The identified distribution functions support a previous hypothesis from mathematical modeling of evolutionary forces that act upon transcript length in the manner of the restoring force of the harmonic oscillator.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The SVD identifies the length distribution functions of the human and yeast global sets and subsets of transcripts as asymmetric generalized coherent states from the DNA microarray data and with no a-priori assumptions.
In general, it is not necessarily possible to identify a distribution function from data that sample the function. This is because identifying a distribution function is mathematically equivalent to estimating the infinite number of moments that are associated with the function. The SVD of data that sample a distribution function, however, may approximately identify the distribution function from the data and with no a-priori assumptions. This is because identifying a distribution function is also equivalent to estimating its eigenfunctions and corresponding eigenvalues. (a) The SVD of Equation (1) of the matrix D that tabulates the mRNA abundance levels of the human global set of transcripts, in increasing order of the transcript lengths as determined by Hurowitz et al, across X gel electrophoresis migration distances, uncovers X unique left singular vectors, X corresponding singular values and X corresponding right singular vectors. The orthonormal right singular vectors are also eigenvectors of the matrix formula image, with the eigenvalues proportional to the singular values. The finite (and, possibly, few) most significant eigenvectors and corresponding eigenvalues – most significant in terms of the fractions of the information that they capture in the data – may approximate the data. (b) The finite and few most significant eigenvectors uncovered by the SVD of the human global transcript length distribution data fit a series of orthogonal asymmetric Hermite functions, where the formula imageth eigenvector is proportional to the qth asymmetric Hermite function of Equations (2) and (3). (c) The corresponding eigenvalues and eigenvalue fractions fit a corresponding geometric series. (d) The series of asymmetric Hermite functions and the corresponding geometric series are known to be among the eigenfunctions and corresponding eigenvalues, respectively, of the asymmetric generalized coherent state of Equations (4) and (5). Therefore, the asymmetric generalized coherent state, where each transcript's profile fits an asymmetric Gaussian, and where the distribution of the peaks of these profiles also fits an asymmetric Gaussian, is identified by the SVD as the distribution function that the data sample.
Figure 2
Figure 2. The SVD of the transcript length distribution data of the human and yeast global sets and protein synthesis subsets.
(a) Raster display of the eigenvectors formula image of Equation (1) of the human global set, i.e., formula image patterns of mRNA abundance level variation across the 50 human DNA microarrays, with overabundance (red), no change in abundance (black) and underabundance (green) around the “ground state” of abundance, which is captured by the first, most significant eigenvector. The inflection points of the formula imageth eigenvector approximately sample the asymmetric parabola formula image (blue), where formula image is the generalized Hooke's constant of Equation (3). (b) Bar chart of the corresponding eigenvalue fractions formula image, with the normalized Shannon entropy formula image. The formula image eigenvalues formula image and eigenvalue fractions approximately fit the geometric series formula image (blue), with formula image. (c) Line-joined graphs of the first (red), second (orange), third (green), fourth (blue) and fifth (violet) most significant eigenvectors of the human global set. The formula imageth eigenvector is approximately proportional to the qth asymmetric Hermite function formula image of Equation (2), where the correlation is in the range of 0.75 to 0.84. The equilibrium formula image of the asymmetric parabola (dashed and shaded), and therefore also of the corresponding transcript length distribution function, is at the gel migration distance of 84 mm, corresponding to a transcript length of formula image1,700±100 nt. The asymmetry is formula image. (d) Graphs of the first (red) through fifth (violet) eigenvectors of the human translation (GO:0006412) subset. The equilibrium is shifted from that of the human global set to the greater migration distance of 96 mm and lesser transcript length of 1,125±75 nt. The width is lesser than that of the human global set, where the magnitude k of the generalized Hooke's constant formula image is twice that of the global set, while the asymmetry s is similar. (e) Eigenvectors of the human ribosome (GO:0005840) subset. The equilibrium is shifted from those of the global set and translation subset to the greater migration distance of 100 mm and lesser transcript length of 975±75 nt. The width is lesser than those of the global set or translation subset, where k is three times that of the global set, while s is similar. (f) Raster display of the formula image eigenvectors of the yeast global set. (g) Bar chart of the corresponding eigenvalue fractions. The formula image eigenvalues and eigenvalue fractions approximately fit the geometric series formula image (blue), with formula image for the yeast global set. (h) Line-joined graphs of the first (red) through fifth (violet) eigenvectors of the yeast global set. The formula imageth eigenvector is approximately proportional to the qth asymmetric Hermite function, where the correlation is in the range of 0.85 to 0.92. The equilibrium of the transcript length distribution function of the global yeast set is at the gel migration distance of 78 mm and the transcript length of formula image1,025±100 nt. The asymmetry formula image is similar to that of the human global set. (i) Eigenvectors of the yeast translation subset. The equilibrium is shifted from that of the yeast global set to the greater migration distance of 84 mm and lesser transcript length of 775±75 nt. The width is lesser than that of the yeast global set, where the magnitude k of the generalized Hooke's constant is twice that of the global set, while the asymmetry s is similar. (j) Eigenvectors of the yeast ribosome subset. The equilibrium is similar to that of the yeast translation subset. The width is lesser than those of the global set or translation subset, where k is three times that of the global set, while s is similar.
Figure 3
Figure 3. Asymmetric generalized coherent states fit the transcript length distributions of the human and yeast global sets.
(a) The overall transcript profile of the human global set, i.e., the sum of the profiles of the human transcripts (line-joined), is approximately proportional to the asymmetric generalized coherent state formula image of Equation (4) with formula image, i.e., the asymmetric Gaussian formula image (dashed and shaded), with the equilibrium formula image at the migration distance of 84 mm, where the correlation is formula image0.99. Graphs of formula image describe the contributions of the subsets of transcript profiles, which peaks formula image are at the migration distances of 124 (red) through 34 (violet) mm, to the overall transcript profile of the human global set. (b) The profiles of the human genes COX7A2 (green), CDK4 (blue) and PFKP (red) are approximately proportional to the asymmetric Gaussians formula image (dashed and shaded) centered at the migration distances of 106, 86 and 72 mm, where the correlations are 0.99, 0.88 and 0.73, respectively. The transcript of COX7A2, which is involved in mitochondrial metabolism, is overexpressed in both the normal brain and GBM tumor, at each of the overexpression cutoffs of formula image. The transcript of CDK4 is overexpressed in the GBM tumor only. The transcript of PFKP, which is involved in glucose metabolism, is overexpressed in the normal brain only. (c) The overall transcript profile of the yeast global set (line-joined) is approximately proportional to the asymmetric Gaussian formula image (dashed and shaded), with the equilibrium formula image at the migration distance of 78 mm. Graphs of formula image describe the contributions of the subsets of transcript profiles, which peaks formula image are at the migration distances of 96 (red) through 42 (violet) mm, to the overall transcript profile of the yeast global set. (d) The profiles of the yeast genes COX9 (green), CDC28 (blue) and PFK2 (red) are approximately proportional to the asymmetric Gaussian formula image (dashed and shaded) centered at the migration distances of 90, 74 and 52 mm, where the correlations are 0.96, 0.83 and 0.89, respectively. Note that COX9 is involved in mitochondrial metabolism, whereas PFK2 is involved in glucose metabolism.
Figure 4
Figure 4. Eigenvectors and overall transcript profiles of the length distribution data of the subsets of human transcripts overexpressed in either the normal brain only, the GBM tumor only or both.
(a) Line-joined graphs of the first (red), second (orange), third (green), fourth (blue) and fifth (violet) most significant eigenvectors of the subset of human transcripts that are most abundant in the normal brain but not the GBM tumor (including, e.g., PFKP), at the overexpression cutoff of formula image. The formula imageth eigenvector is approximately proportional to the qth asymmetric Hermite function, where the correlation is in the range of 0.6 to 0.93. The inflection points of the formula imageth eigenvector approximately sample the asymmetric parabola formula image (dashed and shaded). The equilibrium formula image of the asymmetric parabola, and therefore also of the corresponding transcript length distribution function, is shifted from that of the human global set to the lesser migration distance of 80 mm and greater transcript length of formula image1,875±100 nt. (b) Eigenvectors of the subset of transcripts that are most abundant in the GBM tumor but not the normal brain (including, e.g., CDK4), at the cutoff of formula image. The equilibrium is shifted from those of the normal brain only subset and global set to the greater migration distance of 90 mm and lesser transcript length of 1,375±100 nt. The width of the corresponding length distribution function of the tumor only subset is lesser than that of the normal only subset, where the asymmetry formula image of the generalized Hooke's constant formula image of the GBM tumor only subset is twice that in the normal brain only subset, while the magnitude k is similar. (c) Eigenvectors of the subset of transcripts that are most abundant in both the normal and tumor (including, e.g., COX7A2), at the cutoff of formula image. The equilibrium is shifted to the greater migration distance of 96 mm and lesser transcript length of 1,125±75 nt. The width is lesser than those of the normal only subset as well as the tumor only subset, where the asymmetry is four times that in the normal only subset, while the magnitude is similar. (d) The asymmetric parabolas that fit the inflection points of the eigenvectors of the length distribution data of the subsets of human transcripts overexpressed in either the normal only (red and shaded), the tumor only (blue and shaded) or both (green and shaded). The equilibria of these parabolas are at increasing migration distances, corresponding to decreasing transcript lengths, and with decreasing widths. (e) The overall transcript profile of the subset of human transcripts that are most abundant in the normal brain only, i.e., the sum of the profiles of these transcripts (line-joined), is approximately proportional to the asymmetric Gaussian formula image (dashed and shaded), with the equilibrium formula image at the migration distance of 80 mm, where the correlation is >0.99. (f) The overall profile of the subset of human transcripts that are most abundant in the tumor only (line-joined) is approximately proportional to the asymmetric Gaussian formula image (dashed and shaded), with the equilibrium at 90 mm. (g) The overall profile of the subset of human transcripts that are most abundant in both the normal and tumor (line-joined) is approximately proportional to the asymmetric Gaussian formula image (dashed and shaded), with the equilibrium at 96 mm. (h) The asymmetric Gaussians that fit the overall transcript profiles of the length distribution data of the subsets of human transcripts overexpressed in either the normal only (red and shaded), the tumor only (blue and shaded) or both (green and shaded). The equilibria of these Gaussians are at increasing migration distances, corresponding to decreasing transcript lengths.
Figure 5
Figure 5. Average transcript and gene lengths of the human subsets overexpressed in the normal brain or the GBM tumor.
(a) Average transcript lengths of the human subsets that are overexpressed in the normal brain only (red), the normal brain overall (violet), the GBM tumor only (blue), the GBM tumor overall (orange) or both the normal brain and GBM tumor (green), at each of the overexpression cutoffs of formula image, relative to the average transcript length of the global set of 4,109 transcripts (black). (b) Average maximum gene lengths of the human subsets that are overexpressed in the normal brain or the GBM tumor at each of the cutoffs, relative to the average maximum gene length of the global set of 11,631 genes. (c) Average minimum gene lengths of the human subsets relative to that of the global set.
Figure 6
Figure 6. Overall transcript profiles and Venn diagrams of the subsets of human transcripts overexpressed in the normal brain or the GBM tumor.
(a) The overall transcript profiles of the subsets of human transcripts that are most abundant in the normal brain only (red), the normal brain overall (violet), the GBM tumor only (blue), the GBM tumor overall (orange) or both the normal brain and GBM tumor (green). The equilibria of the profiles of the normal only subset, the human global set, the tumor only subset and the subset of transcripts that are overexpressed in both the normal and tumor are at the increasing migration distances of 80 (red), 84 (black), 90 (blue) and 96 (green) mm, spanning a difference of 16 mm of gel migration distance (shaded), and corresponding to decreasing transcript lengths. (b) The average transcript lengths formula image of Equation (8) of the subsets of M transcripts each that are most abundant in the normal only (red), the normal overall (violet), the tumor only (blue), the tumor overall (orange) or both the normal and tumor (green), relative to the average transcript length formula image of Equation (6) of the human global set of N transcripts, at the overexpression cutoff of formula image. The relation between a gene's overexpression in either the normal overall, the tumor only, the tumor overall or both the normal and tumor and a transcript that is shorter than typical is statistically significant, with the P-value of Equation (11) <0.05 for the observed differences in the average transcript lengths of these subsets and that of the human global set (Table 1). (c) The overall transcript profiles of the subsets of human transcripts that are most abundant in the normal brain only (red), the normal brain overall (violet), the GBM tumor overall (orange) or both the normal brain and GBM tumor (green). (d) The average transcript length differences formula image of the subsets of L transcripts each that are most abundant in the normal only (red), the tumor overall (orange) or both the normal and tumor (green), relative to the average transcript length formula image of the normal overall subset of M transcripts, at the overexpression cutoff of formula image. The relations between a gene's overexpression in the tumor overall or in both the normal and tumor and a transcript that is shorter than typical for a gene that is overexpressed in the normal overall are statistically significant, with the P-value of Equation (12) <0.05 (Table 2). Similarly, the relation between a gene's overexpression in the normal only and a transcript that is longer than typical for a gene that is overexpressed in the normal overall is statistically significant. (e) The overall transcript profiles of the subsets of human transcripts that are most abundant in the normal brain but not the GBM tumor (red) or in both the normal brain and GBM tumor (green). (f) The average transcript length differences formula image of Equation (13) of the subsets of L transcripts that are most abundant in the normal only (red) or in both the normal and tumor (green), relative to the average transcript length formula image of the subsets of transcripts that are most abundant in both the normal and tumor (green) or in the normal only (red), respectively, at the overexpression cutoff of formula image. The relation between a gene's overexpression in the normal brain but not the GBM tumor and a transcript that is longer than typical for a gene that is overexpressed in both the normal brain and GBM tumor is statistically significant, with the P-value of Equation (15) <0.05.

Similar articles

See all similar articles

Cited by 7 articles

See all "Cited by" articles

References

    1. Herbert A, Rich A (1999) RNA processing and the evolution of eukaryotes. Nat Genet 21: 265–269. - PubMed
    1. Muotri AR, Gage FH (2006) Generation of neuronal variability and complexity. Nature 441: 1087–1093. - PubMed
    1. Liu X, Bushnell DA, Silva DA, Huang X, Kornberg RD (2011) Initiation complex structure and promoter proofreading. Science 333: 633–637. - PMC - PubMed
    1. Revyakin A, Liu C, Ebright RH, Strick TR (2006) Abortive initiation and productive initiation by RNA polymerase involve DNA scrunching. Science 314: 1139–1143. - PMC - PubMed
    1. O'Brien T, Hardin S, Greenleaf A, Lis JT (1994) Phosphorylation of RNA polymerase II C-terminal domain and transcriptional elongation. Nature 370: 75–77. - PubMed

Publication types

MeSH terms

Feedback