Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul;10(7):1170-9.
doi: 10.4161/rna.25038. Epub 2013 May 20.

Computational Identification of Functional RNA Homologs in Metagenomic Data

Free PMC article

Computational Identification of Functional RNA Homologs in Metagenomic Data

Eric P Nawrocki et al. RNA Biol. .
Free PMC article


A key step toward understanding a metagenomics data set is the identification of functional sequence elements within it, such as protein coding genes and structural RNAs. Relative to protein coding genes, structural RNAs are more difficult to identify because of their reduced alphabet size, lack of open reading frames, and short length. Infernal is a software package that implements "covariance models" (CMs) for RNA homology search, which harness both sequence and structural conservation when searching for RNA homologs. Thanks to the added statistical signal inherent in the secondary structure conservation of many RNA families, Infernal is more powerful than sequence-only based methods such as BLAST and profile HMMs. Together with the Rfam database of CMs, Infernal is a useful tool for identifying RNAs in metagenomics data sets.

Keywords: homology search; metagenomics; noncoding RNA; structural RNA.


Figure 1. Homology search improvement achieved by utilizing additional information for proteins and structured noncoding RNAs. Examples of identifying coding region homologies by amino acid sequence vs. nucleic acid sequence comparison (BLASTP vs. BLASTN, dashed lines), compared with identifying RNA homologies by primary sequence vs. structure/sequence comparison (BLASTN vs. Infernal, solid lines) for several ribonucleoprotein complexes. Filled circles correspond to BLASTP protein searches and Infernal RNA searches. Open circles correspond to BLASTN coding region searches and BLASTN RNA searches. Question marks indicate targets that were not found by the indicated search method. Each point is labeled with its E-value (“E”) and fractional coverage (“cov”), calculated as the fraction of query positions included in the hit alignment. For each query/target pair, the query sequence was searched against the target genome (for coding sequence and RNA searches) or predicted proteome (for amino acid sequence searches) using the indicated search programs. For example, in the leftmost column, when we use the SRP protein ffh in E. coli as a query against the B. subtilis proteome with BLASTP, the top scoring hit is to the ffh protein with an E-value of 6 × 10−159 and spans 95% of the full protein sequence. Using the coding sequence of E. coli’s ffh protein as a BLASTN target against the B. subtilis genome returns a top hit comprising 61% of the ffh coding sequence with an E-value of 6 × 10−43. Thus, using the protein sequence instead of the coding sequence increases the statistical significance of the ffh homology match by 116 orders of magnitude, indicated by the length of the dashed line in the leftmost box of the figure. Reported E-values are for these single genome/proteome searches, and so would be higher for searches of larger databases. Query RNAs were selected from candidates found by Infernal in each listed query genome’s sequence using the Rfam 11.0 CM for the appropriate family (listed below). Each query RNA was used to build a CM using the Infernal-imposed Rfam structure, and each CM was calibrated and used to search the target genomes. Rfam family IDs for each family, in row order are: RF00169, RF00001, RF00168, RF00373, RF00234, RF00174. For riboswitches, the protein components are always immediately downstream of the RNA components.
Figure 2. Additional information (in bits) gained by structure/sequence profiles vs. sequence-only profiles for various RNA families. Structure/sequence profiles are most advantageous for families with less primary sequence information (toward left) and more secondary structure information (toward top), so Rfam families that gain the most from including secondary structure terms in a homology search are those toward the upper left quadrant. Data shown for the 164 Rfam release 11.0 families with 50 or more sequences in the “seed” alignment (with the exception of SSU rRNA bacteria and SSU rRNA eukarya which would have been outliers on the plot with x-axis values above 1,900 bits and y-axis values above 150 bits). For each family, the seed alignment was used to build two profile models, one with structure (sequence/structure profile CM model) and one without (sequence profile HMM model). From each model, 10,000 sequences were generated and scored, and the average score per sampled sequence was calculated. Several of the outlying points are labeled by the name of RNA family as given by Rfam. Note that the x-axis is drawn on a log scale. Models were built and sequences were generated and scored using Infernal version 1.1 programs cmbuild, cmemit and cmalign. A slightly modified version of this figure will appear in a book to be published by Springer Humana Press entitled “RNA sequence, structure and function: computational and bioinformatic methods,” edited by Jan Gorodkin and Walter Ruzzo, in chapter 9, entitled “Annotating functional RNAs in genomes using Infernal” as Figure 2. This figure is included here with kind permission from Springer Science+Business Media B.V.
Figure 3. Secondary structure of three cobalamin riboswitches. Using the E. coli sequence as a query against their respective genomes, BLASTN detects the Y. enterocolitica cobalamin riboswitch with a significant E-value, but not the A. baumanii riboswitch. Infernal searches with a CM constructed from the E. coli sequence and structure (from the Rfam seed alignment for family RF0017452) find both riboswitches with increased significance values. These example searches are also used in Figure 1. Note that the A. baumanii riboswitch prediction by Infernal is not full length, and excludes the 5′ and 3′ ends. Presumably, the A. baumanii riboswitch extends past the boundaries of the Infernal prediction, but is sufficiently diverged from the E. coli sequence and structure to not be included in the optimal hit alignment. Structures of the targets and percent identity figures were derived from the highest scoring CM alignment of each target to the query (E. coli). Sequence substitutions and insertions in the targets with respect to the query are shown in gray. Inserted residues with respect to the query are shown in lowercase. Basepairs in the Rfam annotated structure are connected by solid lines. All riboswitches are immediately upstream (5′; within 100 residues) of btuB vitamin B12 transporter protein coding genes in their respective genomes.

Similar articles

See all similar articles

Cited by 14 articles

See all "Cited by" articles


    1. Eddy SR. Non-coding RNA genes and the modern RNA world. Nat Rev Genet. 2001;2:919–29. doi: 10.1038/35103511. - DOI - PubMed
    1. Hammann C, Westhof E. Searching genomes for ribozymes and riboswitches. Genome Biol. 2007;8:210. doi: 10.1186/gb-2007-8-4-210. - DOI - PMC - PubMed
    1. Jossinet F, Ludwig TE, Westhof E. RNA structure: bioinformatic analysis. Curr Opin Microbiol. 2007;10:279–85. doi: 10.1016/j.mib.2007.05.010. - DOI - PubMed
    1. Machado-Lima A, del Portillo HA, Durham AM. Computational methods in noncoding RNA research. J Math Biol. 2008;56:15–49. doi: 10.1007/s00285-007-0122-6. - DOI - PubMed
    1. Szymański M, Barciszewska MZ, Zywicki M, Barciszewski J. Noncoding RNA transcripts. J Appl Genet. 2003;44:1–19. - PubMed

Publication types

LinkOut - more resources