Hierarchical graph contrastive learning of local and global presentation for multimodal sentiment analysis

Sci Rep. 2024 Mar 4;14(1):5335. doi: 10.1038/s41598-024-54872-6.

Abstract

Multi-modal sentiment analysis (MSA) aims to regress or classify the overall sentiment of utterances through acoustic, visual, and textual cues. However, most of the existing efforts have focused on developing the expressive ability of neural networks to learn the representation of multi-modal information within a single utterance, without considering the global co-occurrence characteristics of the dataset. To alleviate the above issue, in this paper, we propose a novel hierarchical graph contrastive learning framework for MSA, aiming to explore the local and global representations of a single utterance for multimodal sentiment extraction and the intricate relations between them. Specifically, regarding to each modality, we extract the discrete embedding representation of each modality, which includes the global co-occurrence features of each modality. Based on it, for each utterance, we build two graphs: local level graph and global level graph to account for the level-specific sentiment implications. Then, two graph contrastive learning strategies is adopted to explore the different potential presentations based on graph augmentations respectively. Furthermore, we design a cross-level comparative learning for learning local and global potential representations of complex relationships.