Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 99 (26), 16899-903

Generation and Initial Analysis of More Than 15,000 Full-Length Human and Mouse cDNA Sequences

Robert L Strausberg  1 Elise A FeingoldLynette H GrouseJeffery G DergeRichard D KlausnerFrancis S CollinsLukas WagnerCarolyn M ShenmenGregory D SchulerStephen F AltschulBarry ZeebergKenneth H BuetowCarl F SchaeferNarayan K BhatRalph F HopkinsHeather JordanTroy MooreSteve I MaxJun WangFlorence HsiehLuda DiatchenkoKate MarusinaAndrew A FarmerGerald M RubinLing HongMark StapletonM Bento SoaresMaria F BonaldoTom L CasavantTodd E ScheetzMichael J BrownsteinTed B UsdinShiraki ToshiyukiPiero CarninciChrista PrangeSam S RahaNaomi A LoquellanoGarrick J PetersRick D AbramsonSara J MullahyStephanie A BosakPaul J McEwanKevin J McKernanJoel A MalekPreethi H GunaratneStephen RichardsKim C WorleySarah HaleAngela M GarciaLaura J GayStephen W HulykDebbie K VillalonDonna M MuznyErica J SodergrenXiuhua LuRichard A GibbsJessica FaheyErin HeltonMark KettemanAnuradha MadanStephanie RodriguesAmy SanchezMichelle WhitingAnup MadanAlice C YoungYuriy ShevchenkoGerard G BouffardRobert W BlakesleyJeffrey W TouchmanEric D GreenMark C DicksonAlex C RodriguezJane GrimwoodJeremy SchmutzRichard M MyersYaron S N ButterfieldMartin I KrzywinskiUrsula SkalskaDuane E SmailusAngelique SchnerchJacqueline E ScheinSteven J M JonesMarco A MarraMammalian Gene Collection Program Team

Generation and Initial Analysis of More Than 15,000 Full-Length Human and Mouse cDNA Sequences

Robert L Strausberg et al. Proc Natl Acad Sci U S A.


The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see


Fig. 1.
Fig. 1.
Tests for identifying putative full-ORF cDNA clones. In the first test, 5′ ESTs first were compared with all available ORF-complete mRNA sequences from the same organism (human or mouse) in the RefSeq collection. When a 5′ EST aligned (>95% homology for 100 or more base pairs) at or upstream of an annotated translation start site, that clone was considered to contain a candidate full-ORF cDNA. However, if the 5′ EST aligned downstream from an annotated translational start site, that clone was eliminated from consideration, although some of these may be full-ORF clones with an alternate 5′ translational start site. Any 5′ ESTs that did not match a RefSeq sequence were subjected to additional tests. In the second test, six possible frame translations were compared with the subset of GenBank protein records originating from Protein Information Resource (15), Protein Data Base (16), or SwissProt (17) that begin with methionine. This test identifies ESTs from genes with an N terminus similar but not identical to a known protein. Thus, in cases where a protein match (<90% identity but with an E value of less than or equal to 10−6) was detected and incorporated the known initiating methionine, the associated cDNA clone was considered a candidate to have a complete ORF. In the third test, we compared each 5′ EST to a collection of predicted genes derived from the human genome sequence by genomescan (18). When a 5′ EST aligned (95% identity for 100 or more bp) to a gene prediction that begins with ATG, the associated clone was considered a candidate. In the fourth test, we used the new program hkscan, which looks for evidence of a transition from noncoding to coding sequence (described in Materials and Methods).
Fig. 2.
Fig. 2.
Efficacy of cDNA clone selection algorithms used by the MGC Program. Three of the tests (protein homology, genomescan, and hkscan), were retroactively assessed for their ability to identify full-ORF clones within a set of 5,653 established full-ORF RefSeq sequences. Only 301 of the RefSeq sequences were identified by all three of the tests, whereas 2,002 were identified by only one of the three tests. When used in combination, the three tests were effective in identifying 5,601 (>99%) of the RefSeq sequences.
Fig. 3.
Fig. 3.
ORF sizes of MGC full-ORF genes compared with RefSeq genes. The ORFs of MGC full-ORF genes and RefSeq genes were binned in 100-nt increments. The absolute numbers of MGC and RefSeq genes are compared for each size increment. RefSeq genes are represented by a solid line, the total of MGC genes is shown with the dashed lines, and MGC genes within the RefSeq set are depicted with dotted lines.

Similar articles

  • Systematic recovery and analysis of full-ORF human cDNA clones.
    Baross A, Butterfield YS, Coughlin SM, Zeng T, Griffith M, Griffith OL, Petrescu AS, Smailus DE, Khattra J, McDonald HL, McKay SJ, Moksa M, Holt RA, Marra MA. Baross A, et al. Genome Res. 2004 Oct;14(10B):2083-92. doi: 10.1101/gr.2473704. Genome Res. 2004. PMID: 15489330 Free PMC article.
  • The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC).
    Gerhard DS, Wagner L, Feingold EA, Shenmen CM, Grouse LH, Schuler G, Klein SL, Old S, Rasooly R, Good P, Guyer M, Peck AM, Derge JG, Lipman D, Collins FS, Jang W, Sherry S, Feolo M, Misquitta L, Lee E, Rotmistrovsky K, Greenhut SF, Schaefer CF, Buetow K, Bonner TI, Haussler D, Kent J, Kiekhaus M, Furey T, Brent M, Prange C, Schreiber K, Shapiro N, Bhat NK, Hopkins RF, Hsie F, Driscoll T, Soares MB, Casavant TL, Scheetz TE, Brown-stein MJ, Usdin TB, Toshiyuki S, Carninci P, Piao Y, Dudekula DB, Ko MS, Kawakami K, Suzuki Y, Sugano S, Gruber CE, Smith MR, Simmons B, Moore T, Waterman R, Johnson SL, Ruan Y, Wei CL, Mathavan S, Gunaratne PH, Wu J, Garcia AM, Hulyk SW, Fuh E, Yuan Y, Sneed A, Kowis C, Hodgson A, Muzny DM, McPherson J, Gibbs RA, Fahey J, Helton E, Ketteman M, Madan A, Rodrigues S, Sanchez A, Whiting M, Madari A, Young AC, Wetherby KD, Granite SJ, Kwong PN, Brinkley CP, Pearson RL, Bouffard GG, Blakesly RW, Green ED, Dickson MC, Rodriguez AC, Grimwood J, Schmutz J, Myers RM, Butterfield YS, Griffith M, Griffith OL, Krzywinski MI, Liao N, Morin R, Palmquist D, Petrescu AS, Skalska U, Smailus DE, Stott JM, Schnerch A, Schein JE, Jones SJ, Holt RA, Baross A, Marra MA, Clifton S, Makowski KA, Bosak S, Malek J; MGC Project Team. Gerhard DS, et al. Genome Res. 2004 Oct;14(10B):2121-7. doi: 10.1101/gr.2596504. Genome Res. 2004. PMID: 15489334 Free PMC article.
  • 1274 full-open reading frames of transcripts expressed in the developing mouse nervous system.
    Bonaldo MF, Bair TB, Scheetz TE, Snir E, Akabogu I, Bair JL, Berger B, Crouch K, Davis A, Eyestone ME, Keppel C, Kucaba TA, Lebeck M, Lin JL, de Melo AI, Rehmann J, Reiter RS, Schaefer K, Smith C, Tack D, Trout K, Sheffield VC, Lin JJ, Casavant TL, Soares MB. Bonaldo MF, et al. Genome Res. 2004 Oct;14(10B):2053-63. doi: 10.1101/gr.2601304. Genome Res. 2004. PMID: 15489326 Free PMC article.
  • From genome to proteome: developing expression clone resources for the human genome.
    Temple G, Lamesch P, Milstein S, Hill DE, Wagner L, Moore T, Vidal M. Temple G, et al. Hum Mol Genet. 2006 Apr 15;15 Spec No 1:R31-43. doi: 10.1093/hmg/ddl048. Hum Mol Genet. 2006. PMID: 16651367 Review.
  • Genome and genetic resources from the Cancer Genome Anatomy Project.
    Riggins GJ, Strausberg RL. Riggins GJ, et al. Hum Mol Genet. 2001 Apr;10(7):663-7. doi: 10.1093/hmg/10.7.663. Hum Mol Genet. 2001. PMID: 11257097 Review.
See all similar articles

Cited by 535 articles

See all "Cited by" articles

Publication types