Motivation: Protein structure alignment is key for transferring information from well-studied proteins to less studied ones. Structural alignment identifies the most precise mapping of equivalent residues, as structures are more conserved during evolution than sequences. Among the methods for aligning protein structures, maximum Contact Map Overlap (CMO) has received sustained attention during the past decade. Yet, known algorithms exhibit modest performance and are not applicable for large-scale comparison.
Results: Graphlets are small induced subgraphs that are used to design sensitive topological similarity measures between nodes and networks. By generalizing graphlets to ordered graphs, we introduce GR-Align, a CMO heuristic that is suited for database searches. On the Proteus_300 set (44 850 protein domain pairs), GR-Align is several orders of magnitude faster than the state-of-the-art CMO solvers Apurva, MSVNS and AlEigen7, and its similarity score is in better agreement with the structural classification of proteins. On a large-scale experiment on the Gold-standard benchmark dataset (3 207 270 protein domain pairs), GR-Align is several orders of magnitude faster than the state-of-the-art protein structure comparison tools TM-Align, DaliLite, MATT and Yakusa, while achieving similar classification performances. Finally, we illustrate the difference between GR-Align's flexible alignments and the traditional ones by querying a flexible protein in the Astral-40 database (11 154 protein domains). In this experiment, GR-Align's top scoring alignments are not only in better agreement with structural classification of proteins, but also that they allow transferring more information across proteins.