Weakly Supervised Multimodal Sentiment Analysis Based on Deep Fine-Grained Alignment of Image and Text

Weakly Supervised Multimodal Sentiment Analysis Based on Deep Fine-Grained Alignment of Image and Text

Introduction

The advent of the information age has transformed social media into a vital platform for individuals to share opinions and express emotions. Multimodal data, including images and text, contain rich emotional information that is crucial for understanding user sentiments, attitudes, and perspectives. In commercial applications, sentiment analysis of product reviews helps businesses better comprehend consumer needs and tailor their offerings. In political contexts, public opinion analysis enables governments to gauge societal sentiments and respond effectively.

Traditional sentiment analysis methods often focus on a single modality, such as text, while neglecting the complementary information provided by images. Images convey emotions intuitively, whereas text offers nuanced descriptions that enrich emotional understanding. Multimodal sentiment analysis leverages both modalities to achieve more accurate and comprehensive emotion recognition. However, existing approaches face challenges in fine-grained alignment between modalities and often lose critical information during deep feature fusion.

This paper introduces a novel image-text deep interaction model, RADM (Residual Attention-based Cross-modal Deep Interaction Model), which addresses these limitations. Unlike previous methods that rely on strongly supervised object detection in images, RADM divides images into fine-grained region sequences and aligns them with text sequences through a dual-path fusion mechanism. Additionally, an adaptive gating mechanism optimizes residual attention information flow across fusion layers. Experiments on the MSED and MVSA datasets demonstrate that RADM outperforms baseline models in accuracy and F1-score, achieving improvements of up to 1.06%, 0.74% (MSED) and 0.75%, 0.63% (MVSA), respectively.

Related Work

Early Approaches to Multimodal Sentiment Analysis

Early research in multimodal sentiment analysis relied on matrix-based fusion techniques and machine learning algorithms. Tensor Fusion Networks (TFN) employed tensor outer products to model inter-modal interactions but suffered from high-dimensional matrices that were difficult to train. Low-rank Multimodal Fusion (LMF) reduced dimensionality through linear transformations but still faced semantic loss. Kernel-based methods, such as Multi-Kernel Learning (MKL), combined features using Support Vector Machines (SVM), while some studies incorporated Conditional Random Fields (CRF) to enhance classification accuracy. However, these methods heavily depended on handcrafted features and lacked the expressive power of deep learning models.

Deep Learning in Multimodal Sentiment Analysis

With the rise of deep learning, Convolutional Neural Networks (CNNs), Faster R-CNN, and Transformer architectures became dominant in feature extraction. Early CNN-based approaches extracted image and text features independently before concatenating them, but this simplistic fusion ignored inter-modal interactions. The introduction of attention mechanisms improved contextual learning by dynamically weighting relevant features. Subsequent models, such as VGG19, incorporated object and scene features to enrich cross-modal attention.

Recent advancements focus on deep fusion networks and cross-modal alignment. Memory-augmented networks stacked multiple attention and GRU layers to enhance fusion depth. However, these models often lacked interpretability, prompting researchers to explore explicit alignment between text and image regions.

Cross-Modal Alignment Techniques

Strongly supervised methods, such as CAMP and INIT, used Faster R-CNN to detect image objects and aligned them with text via cross-modal attention. SCAN and Oscar further improved alignment by stacking multiple attention layers or integrating object tags with text in Transformer encoders. LXMERT employed self-attention for intra-modal feature refinement before cross-modal fusion.

Despite their effectiveness, these methods required extensive annotations or pre-trained object detectors. ViLT pioneered weakly supervised alignment by dividing images into patch sequences and processing them alongside text using self-attention, significantly reducing training overhead. mPLUG enhanced this approach by introducing text-centric fusion paths for intra-modal enhancement.

Methodology

Model Overview

The proposed RADM consists of three main components:

  1. Feature Encoders – Extract visual and textual features.
  2. Cross-Modal Deep Interaction Network – Aligns modalities through residual attention mechanisms.
  3. Self-Fusion Encoder – Refines fused features for sentiment prediction.

Feature Extraction

Text Encoding:
Input text is tokenized into word sequences and processed using BERT, which generates contextualized word embeddings. A special [CLS] token aggregates global text features.

Image Encoding:
Images are resized to 224×224 and split into 16×16 patches. These patches are linearly projected into embeddings using Vision Transformer (ViT), which captures global dependencies more effectively than CNNs.

Adaptive Gating for Residual Attention

Inspired by RealFormer, RADM introduces Multi-Head Residual Attention (MHRA) to preserve critical information across deep fusion layers. Unlike standard residual connections, an adaptive gate dynamically controls the flow of attention scores between layers:

  1. The gate computes a similarity score between current and previous attention maps.
  2. A fully connected layer determines which information to retain.
  3. The retained features are averaged with the previous attention scores to stabilize training.

This mechanism prevents information degradation in deep networks while enhancing cross-modal alignment.

Cross-Modal Interaction Network

The interaction network comprises multiple layers, each containing:

  1. Intra-Modal Self-Attention – Refines text and image features independently.
  2. Cross-Modal Attention – Aligns visual and textual sequences by computing token-patch affinities.
  3. Feed-Forward Layers – Apply non-linear transformations using GELU activation, which outperforms ReLU in modeling complex relationships.

Sentiment Prediction

After fusion, features are pooled and concatenated for classification. Cross-entropy loss is used to train the model end-to-end.

Experiments

Datasets

  1. MSED – Contains 9,190 social media posts annotated for sentiment, emotion, and desire analysis.
  2. MVSA – Includes MVSA-Single (5,129 samples) and MVSA-Multiple (19,600 samples), labeled for sentiment polarity.

Implementation Details

• Pre-training: RADM is pre-trained on VQA v2 to improve visual-linguistic understanding.

• Training: Fine-tuned with AdamW, dropout (0.2), and learning rate decay.

• Evaluation Metrics: Accuracy and F1-score (to handle class imbalance).

Results

RADM outperforms baselines across all tasks:
• MSED Sentiment: +0.65% accuracy, +0.51% F1 over MMTF-DES.

• MSED Emotion: +1.06% accuracy, +0.74% F1.

• MVSA-Multiple: +0.75% accuracy, +0.63% F1.

Ablation Study

Removing the adaptive gate (rm_rg) or cross-modal interaction (rm_c) degrades performance, confirming their necessity. The full model (RADM) consistently outperforms ablated variants.

Case Study

Visualizations show that RADM successfully aligns phrases like “blue sky & snow” with corresponding image regions (Figure 3). Misalignments (e.g., “enthusiastic athletes”) occur when textual references lack visual counterparts, highlighting the model’s ability to leverage contextual cues for sentiment inference.

Conclusion

RADM advances multimodal sentiment analysis by:

  1. Replacing strong supervision with weakly supervised patch-text alignment.
  2. Introducing adaptive gating to preserve salient features in deep networks.
  3. Achieving state-of-the-art results on MSED and MVSA.

Future work will focus on improving abstract reasoning (e.g., philosophical text-image pairs) and extending the model to broader vision-language tasks.

doi.org/10.19734/j.issn.1001-3695.2024.07.0285

Was this helpful?

0 / 0