Sampling strategies for distances between DNA sequences

Biometrics. 1990 Sep;46(3):551-82.

Abstract

An international effort is now underway to obtain the DNA sequence for the entire human genome (Watson and Jordan, 1989, Genomics 5, 654-656; Barnhart, 1989, Genomics 5, 657-660). This Human Genome Initiative will generate sequence data from several species other than humans, and will result in several copies per species of at least some regions of the genome. Although the project has generated much interest, it is but one aspect of the widespread effort to generate DNA sequence data. Published sequences are collected in common databases, and release 63 of GenBank in March 1990 contained 40,127,752 bases from 33,337 reported sequences (News from GenBank 3; Mountain View, California: Intelligenetics, Inc., 1990). Large though this database is, it is only about 1% of the number of bases in the human genome. Interpretations of data of such magnitude are going to require the collaborative efforts of biometricians and molecular biologists, and an aim of this paper is to show that there is also a role for readers of this journal in the design of surveys of DNA sequences. Discussion here will center on the use of sequence data in evolutionary studies, where some region of DNA is sequenced in several different species. The object is to infer the evolutionary history of that particular region, or of the species themselves. Statistical issues in the very important studies on sequences to locate and characterize regions responsible for human diseases will not be addressed here. We will discuss appropriate ways of measuring distances between DNA sequences and of predicting the sampling properties of the distances. There are procedures for inferring evolutionary histories for a set of elements that depend on a matrix of distances between each pair of elements, and the precision of resulting trees must be influenced by the precision of the distances. We will show that account needs to be taken of two sampling processes--the sampling of sequences by the investigator ("statistical sampling"), and the sampling of genetic material involved in the formation of offspring from a parental population ("genetic sampling").

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Analysis of Variance
  • Animals
  • Base Sequence
  • Biological Evolution*
  • Biometry*
  • DNA / genetics*
  • Humans
  • Species Specificity

Substances

  • DNA