Emergence of human-like attention and distinct head clusters in self-supervised vision transformers: A comparative eye-tracking study

Neural Netw. 2025 Sep:189:107595. doi: 10.1016/j.neunet.2025.107595. Epub 2025 May 21.

Abstract

Visual attention models aim to predict human gaze behavior, yet traditional saliency models and deep gaze prediction networks face limitations. Saliency models rely on handcrafted low-level visual features, often failing to capture human gaze dynamics, while deep learning-based gaze prediction models lack biological plausibility. Vision Transformers (ViTs), which use self-attention mechanisms, offer an alternative, but when trained with conventional supervised learning, their attention patterns tend to be dispersed and unfocused. This study demonstrates that ViTs trained with self-supervised DINO (self-Distillation with NO labels) develop structured attention that closely aligns with human gaze behavior when viewing videos. Our analysis reveals that self-attention heads in later layers of DINO-trained ViTs autonomously differentiate into three distinct clusters: (1) G1 heads (20%), which focus on key points within figures (e.g., the eyes of the main character) and resemble human gaze; (2) G2 heads (60%), which distribute attention over entire figures with sharp contours (e.g., the bodies of all characters); and (3) G3 heads (20%), which primarily attend to the background. These findings provide insights into how human overt attention and figure-ground segregation emerge in visual perception. Our work suggests that self-supervised learning enables ViTs to develop attention mechanisms that are more aligned with biological vision than traditional supervised training.

Keywords: Attention; DINO; Eye-tracking; Figure ground separation; Self-supervised learning; Vision transformer.

Publication types

  • Comparative Study

MeSH terms

  • Attention* / physiology
  • Deep Learning
  • Eye-Tracking Technology*
  • Fixation, Ocular / physiology
  • Head
  • Humans
  • Neural Networks, Computer
  • Supervised Machine Learning*