Transformer-based models are rapidly becoming foundational tools for analyzing and integrating multiscale biological data. This Perspective examines recent advances in transformer architectures, tracing their evolution from unimodal and augmented unimodal models to large-scale multimodal foundation models operating across genomic sequences, single-cell transcriptomics and spatial data. We categorize these models into three tiers and evaluate their capabilities for structural learning, representation transfer and tasks such as cell annotation, prediction and imputation. While discussing tokenization, interpretability and scalability challenges, we highlight emerging approaches that leverage masked modeling, contrastive learning and large language models. To support broader adoption, we provide practical guidance through code-based primers, using public datasets and open-source implementations. Finally, we propose designing a modular 'Super Transformer' architecture using cross-attention mechanisms to integrate heterogeneous modalities. This Perspective serves as a resource and roadmap for leveraging transformer models in multiscale, multimodal genomics.
© 2025. Springer Nature America, Inc.