Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Feb 8;7 Suppl 1(Suppl 1):S2.
doi: 10.1186/1471-2148-7-S1-S2.

SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics

Affiliations

SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics

Béatrice Roure et al. BMC Evol Biol. .

Abstract

Background: Phylogenetic analyses based on datasets rich in both genes and species (phylogenomics) are becoming a standard approach to resolve evolutionary questions. However, several difficulties are associated with the assembly of large datasets, such as multiple copies of a gene per species (paralogous or xenologous genes), lack of some genes for a given species, or partial sequences. The use of undetected paralogous or xenologous genes in phylogenetic inference can lead to inaccurate results, and the use of partial sequences to a lack of resolution. A tool that selects sequences, species, and genes, while dealing with these issues, is needed in a phylogenomics context.

Results: Here, we present SCaFoS, a tool that quickly assembles phylogenomic datasets containing maximal phylogenetic information while adjusting the amount of missing data in the selection of species, sequences and genes. Starting from individual sequence alignments, and using monophyletic groups defined by the user, SCaFoS creates chimeras with partial sequences, or selects, among multiple sequences, the orthologous and/or slowest evolving sequences. Once sequences representing each predefined monophyletic group have been selected, SCaFos retains genes according to the user's allowed level of missing data and generates files for super-matrix and super-tree analyses in several formats compatible with standard phylogenetic inference software. Because no clear-cut criteria exist for the sequence selection, a semi-automatic mode is available to accommodate user's expertise.

Conclusion: SCaFos is able to deal with datasets of hundreds of species and genes, both at the amino acid or nucleotide level. It has a graphical interface and can be integrated in an automatic workflow. Moreover, SCaFoS is the first tool that integrates user's knowledge to select orthologous sequences, creates chimerical sequences to reduce missing data and selects genes according to their level of missing data. Finally, applying SCaFoS to different datasets, we show that the judicious selection of genes, species and sequences reduces tree reconstruction artefacts, especially if the dataset includes fast evolving species.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart of sequences selection and construction of chimera for an OTU in a given gene. For each OTU of each gene, SCaFoS selects the sequence that best represents the OTU. See text for a detailed description of the process. Three thresholds (empty blue rhombus) with default or user specific values are important: (i) the maximal percentage of characters present with respect to the longest sequence to keep a sequence, (ii) the minimal percentage of characters present with respect to the longest sequence to consider a sequence as complete and (iii) the maximum in-OTU/out-OTU distances ratio (see text) to keep an OTU. The user should select if he/she desires to create or not chimerical sequences and chose among the different sequence selection criteria (filled blue rhombus). If the selection criterion is the sequence size, no other options should be checked. If the selection criterion is the evolutionary rate of the sequences, the user must chose between a fully automatic or a semi-automatic choice of sequences and specify if he/she desires to use a previously defined selection.
Figure 2
Figure 2
Example of chimera assembly. Sequence fragments are combined from longest to shortest, the length being computed according to the number of characters: selected parts are displayed in blue; the chimerical sequence is the result of the concatenation of each part of the different sequences
Figure 3
Figure 3
Main steps to use SCaFoS. Steps 1, 3 and 5 are done by SCaFoS: 1. SPECIES PRESENCE: listing of all species present in the files of aligned sequences followed by their frequency of presence and, if desired, classified into taxonomic groups (specified by TaxGp in the figure). 2. Definition by the user of the species to be selected and their respective OTUs 3. FILE SELECTION: creation of files containing only the selected species 4. Discarding ambiguously aligned positions (displayed in dark colour) with a tool such as GBlocks [33]; making phylogenetic trees (using PHYML [34] or PAUP [25] for example) 5. DATASETS ASSEMBLING: selection of sequences and chimera construction according to an OTU file and default sequence files: creation of single gene files including chimeras and selected sequences and creation of concatenated files for super-tree and super-matrix approaches respectively. In the last step, three typical cases are represented: (i) construction of a chimera (OTU5) in the orange file, (ii) selection of the less divergent sequence within an OTU (Sp6 in OTU5) and elimination of a short sequence (Sp31) in the red file and (iii) elimination of potential paralogous sequences by the user (Sp31 and Sp71) in the purple file. Eliminated sequences are drawn in grey. The corresponding default sequences files are displayed under their respective sequence files.
Figure 4
Figure 4
Evolution of missing data according to the threshold. For seven threshold values defining the maximal in-OTU/out-OTU distances ratio, the number of selected genes is plotted against the percentage of missing sites in the concatenated file. Subsets are extracted from the Metazoa dataset without making of chimera. The evolution of missing data is also displayed when the selection is only made according to the size criterion (black and grey curves respectively with and without making of chimera); these last selections represent the minimal amount of missing data for the dataset.
Figure 5
Figure 5
Phylogenetic trees obtained for three subsets extracted from the Metazoa dataset. Maximum Likelihood inferences were performed with the JTT+Γ (4 categories) model by TreeFinder [27] on two datasets based on the Philippe et al. [22] Metazoa dataset and constructed as follows. The species were grouped according to 12 OTUs. Sequences with at least 90% of the total number of positions were considered as complete and sequences or chimera shorter than 10% of the total number of positions were removed. The two datasets differ on the main criteria of selection, A: longest sequence (LC) and B: smaller evolutionary distances (SC). Numbers above branches indicate bootstrap support values obtained by analysing 100 bootstrap replicates under the same conditions.
Figure 6
Figure 6
Comparison of evolutionary distances. The datasets are the same as in Figure 5. The phylogenetic inferences were obtained as for Figure 5. Pairwises of patristic distances are plotted in blue (dots including Arthropoda in orange).
Figure 7
Figure 7
Difficulty to determine correct orthologs according to the evolutionary distance. Schematic tree representing two paralogous groups, α and β, including the same species, A and B. In this example, the choice of the two slowest evolving sequences, Aβ and Bα, will keep a sequence in each paralogous group.

Similar articles

Cited by

  • Phylogenomics reveals deep molluscan relationships.
    Kocot KM, Cannon JT, Todt C, Citarella MR, Kohn AB, Meyer A, Santos SR, Schander C, Moroz LL, Lieb B, Halanych KM. Kocot KM, et al. Nature. 2011 Sep 4;477(7365):452-6. doi: 10.1038/nature10382. Nature. 2011. PMID: 21892190 Free PMC article.
  • Gene and genome trees conflict at many levels.
    Haggerty LS, Martin FJ, Fitzpatrick DA, McInerney JO. Haggerty LS, et al. Philos Trans R Soc Lond B Biol Sci. 2009 Aug 12;364(1527):2209-19. doi: 10.1098/rstb.2009.0042. Philos Trans R Soc Lond B Biol Sci. 2009. PMID: 19571241 Free PMC article.
  • iPhy: an integrated phylogenetic workbench for supermatrix analyses.
    Jones MO, Koutsovoulos GD, Blaxter ML. Jones MO, et al. BMC Bioinformatics. 2011 Jan 24;12:30. doi: 10.1186/1471-2105-12-30. BMC Bioinformatics. 2011. PMID: 21261969 Free PMC article.
  • Comparative genome analysis of entomopathogenic fungi reveals a complex set of secreted proteins.
    Staats CC, Junges A, Guedes RL, Thompson CE, de Morais GL, Boldo JT, de Almeida LG, Andreis FC, Gerber AL, Sbaraini N, da Paixão RL, Broetto L, Landell M, Santi L, Beys-da-Silva WO, Silveira CP, Serrano TR, de Oliveira ES, Kmetzsch L, Vainstein MH, de Vasconcelos AT, Schrank A. Staats CC, et al. BMC Genomics. 2014 Sep 29;15:822. doi: 10.1186/1471-2164-15-822. BMC Genomics. 2014. PMID: 25263348 Free PMC article.
  • A Guide to Phylogenomic Inference.
    Patané JSL, Martins J Jr, Setubal JC. Patané JSL, et al. Methods Mol Biol. 2024;2802:267-345. doi: 10.1007/978-1-0716-3838-5_11. Methods Mol Biol. 2024. PMID: 38819564

References

    1. Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005;6:361–375. doi: 10.1038/nrg1603. - DOI - PubMed
    1. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW. Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol. 2000;17:164–178. - PubMed
    1. Koonin EV. Orthologs, paralogs, and evolutionary genomics (1) Annu Rev Genet. 2005;39:309–338. doi: 10.1146/annurev.genet.39.073003.114725. - DOI - PubMed
    1. Pearson WR, Sierk ML. The limits of protein sequence comparison? Curr Opin Struct Biol. 2005;15:254–260. doi: 10.1016/j.sbi.2005.05.005. - DOI - PMC - PubMed
    1. Philip GK, Creevey CJ, McInerney JO. The Opisthokonta and the Ecdysozoa may not be clades: Stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol Biol Evol. 2005;22:1175–1184. doi: 10.1093/molbev/msi102. - DOI - PubMed

Publication types

Substances

LinkOut - more resources