Traditional molecular screening methods are often limited by high computational cost, long design cycles, and a strong reliance on high-quality 3D protein structures, which are not always available or reliable. To address these limitations, we propose CoDrug, an innovative multimodal fusion framework that integrates textual information with structural representations of proteins and compounds. CoDrug employs two complementary fusion strategies─text-protein sequence fusion, in which SciBERT encodes functional descriptions and ESM extracts sequence-level features, and text-compound structure fusion, in which ChemFormer encodes SMILES and SciBERT processes compound-related textual descriptions. Using contrastive learning, CoDrug aligns textual and structural embeddings in a shared latent space, enabling effective cross-modal representation learning. This architecture supports novel functionalities, including text-driven virtual screening and text-driven molecular optimization, enhancing representation expressiveness and generalization while delivering strong performance under zero-shot settings. Evaluations on diverse benchmarks demonstrate that CoDrug achieves competitive or superior results compared with state-of-the-art baselines, particularly when 3D structural data are incomplete or unavailable. The framework's natural language interface lowers the technical barrier for AI-assisted drug discovery, allowing chemists to efficiently navigate and optimize chemical space without specialized computational expertise. By bridging language-driven hypotheses and structure-guided molecular design, CoDrug offers a scalable and flexible paradigm for accelerating the early stages of drug discovery.