Local Alignment for Medical Vision-Language Pre-Training

IEEE Trans Image Process. 2025:34:7321-7334. doi: 10.1109/TIP.2025.3628469.

Abstract

Establishing local semantic correspondences between medical images and their corresponding reports is crucial for effective medical vision-language pre-training. However, existing methods encounter two major challenges: (1) lesion regions in radiological images are often small, blurry, or lack clear boundaries, complicating accurate localization; and (2) medical reports typically contain redundant or non-diagnostic words, hindering precise semantic alignment. To overcome these issues, we propose MedAligner, a specialized local alignment network for medical vision-language pre-training. MedAligner employs dual encoders to extract both global and local representations and uses global contrastive learning to maintain coarse semantic consistency. To enhance local alignment, we introduce a Word-Region Alignment, which generates a learnable word-pixel similarity matrix that is sparsified to identify salient lesion regions accurately. Additionally, our Diagnostic Term Filtering dynamically samples high-importance diagnostic terms from reports, aligning them with identified lesion areas via a local contrastive loss. Importantly, we adopt a progressive training strategy that gradually refines both the input text and semantic alignment. This is achieved by reconstructing concise diagnostic reports and progressively updating word-pixel similarity, generating increasingly accurate image-text pairs. Extensive experiments demonstrate that MedAligner significantly surpasses existing approaches on tasks such as phrase grounding, image-text retrieval, and zero-shot classification, setting new benchmarks in medical vision-language pre-training.

MeSH terms

  • Algorithms
  • Humans
  • Image Interpretation, Computer-Assisted* / methods
  • Image Processing, Computer-Assisted* / methods
  • Semantics