Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 4 (1), 22

Towards a Universal SMILES Representation - A Standard Method to Generate Canonical SMILES Based on the InChI

Affiliations

Towards a Universal SMILES Representation - A Standard Method to Generate Canonical SMILES Based on the InChI

Noel M O'Boyle. J Cheminform.

Abstract

Background: There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.

Results: I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.

Conclusions: The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain - such as the development of a standard aromatic model for SMILES - the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.

Figures

Figure 1
Figure 1
An overview of the steps involved in generating Universal and Inchified SMILES. The normalisation step just applies to Inchified SMILES. To simplify the diagram a Standard InChI is shown, but in practice a non-standard InChI (options FixedH and RecMet) is used for Universal SMILES.
Figure 2
Figure 2
The effect of the number of repetitions used in the shuffle test on the number of canonicalisation failures found for Universal SMILES. This figure is based on the data from the ChEMBL database.
Figure 3
Figure 3
The structure of CHEMBL1229272.
Figure 4
Figure 4
Two canonicalisation failures from the PubChem subset.
Figure 5
Figure 5
Three arrangements of the same atoms attached to a chiral carbon that differ in the angle between the two planar bonds that includes the stereobond. (a) The angle is > 180°: the hydrogen is considered to be below the plane and behind the wedge. (b) The angle is ~180°: some software will treat the stereochemistry as undefined depending on how close the angle is to 180°. (c) The angle is < 180°: the hydrogen is considered to be opposite the wedge. This is an enantiomer of (a).
Figure 6
Figure 6
Two entries in the ChEMBL database with seemingly identical structures but whose InChIs are distinct. The InChIs differ in the double bond stereo layer: /b31-27+,32-28? versus /b31-27-,32-28+. The origin of the difference in InChIs is shown by the images to the right of the main structures which were generated by the winchi application (part of the InChI distribution).
Figure 7
Figure 7
Two tautomers identified as duplicates by Inchified SMILES but not by the InChI.
Figure 8
Figure 8
Two structures of different charge states normalised to the same neutral molecule by Inchified SMILES.

Similar articles

See all similar articles

Cited by 17 articles

See all "Cited by" articles

References

    1. Warr WA. Representation of chemical structures. WIREs Comput Mol Sci. 2011;1:557–579. doi: 10.1002/wcms.36. - DOI
    1. Ash S, Cline MA, Homer RW, Hurst T, Smith GB. SYBYL Line Notation (SLN): A Versatile Language for Chemical Structure Representation. J Chem Inf Comput Sci. 1997;37:71–79. doi: 10.1021/ci960109j. - DOI
    1. Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD. SYBYL Line Notation (SLN): A Single Notation To Represent Chemical Structures, Queries, Reactions, and Virtual Libraries. J Chem Inf Model. 2008;48:2294–2307. doi: 10.1021/ci7004687. - DOI - PubMed
    1. Bolton EE, Wang Y, Thiessen PA, Bryant SH. Annual Reports in Computational Chemistry. Elsevier; 2008. Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities; pp. 217–241.
    1. Panico R, Powell WH, Richer J-C. International Union of Pure and Applied Chemistry. Commission on the Nomenclature of Organic Chemistry. A guide to IUPAC nomenclature of organic compounds: recommendations 1993. Oxford; Boston; Boca Raton, Fla: Blackwell Scientific Publications; CRC Press [distributor]; 1993.

LinkOut - more resources

Feedback