Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly

Comp Biochem Physiol C Toxicol Pharmacol. 2012 Jan;155(1):95-101. doi: 10.1016/j.cbpc.2011.05.012. Epub 2011 Jun 1.


For many researchers, next generation sequencing data holds the key to answering a category of questions previously unassailable. One of the important and challenging steps in achieving these goals is accurately assembling the massive quantity of short sequencing reads into full nucleic acid sequences. For research groups working with non-model or wild systems, short read assembly can pose a significant challenge due to the lack of pre-existing EST or genome reference libraries. While many publications describe the overall process of sequencing and assembly, few address the topic of how many and what types of reads are best for assembly. The goal of this project was use real world data to explore the effects of read quantity and short read quality scores on the resulting de novo assemblies. Using several samples of short reads of various sizes and qualities we produced many assemblies in an automated manner. We observe how the properties of read length, read quality, and read quantity affect the resulting assemblies and provide some general recommendations based on our real-world data set.

Publication types

  • Research Support, American Recovery and Reinvestment Act
  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • Computational Biology / methods
  • Contig Mapping / methods
  • Cyprinodontiformes / genetics
  • Databases, Genetic
  • Gene Expression Profiling / methods*
  • Sequence Analysis / methods
  • Software*