Introduction
The rapid development of the internet has led to diverse ways for users to express emotions. Increasingly, people share their lives and opinions online through multimodal content, including text, images, audio, and emojis. Among these, text-image combinations have become particularly popular, making multimodal sentiment analysis an essential research area. Multimodal aspect-based sentiment analysis (MABSA) is a fine-grained task that aims to predict sentiment polarity for specific aspect terms within a given text-image pair.
Despite significant progress in MABSA, challenges remain due to the heterogeneity between modalities and the emotional confusion caused by multiple aspect terms. Traditional methods often fail to align visual features with specific aspects, leading to noise from irrelevant image regions. Additionally, syntactic dependencies and sentiment information in text are frequently overlooked, further complicating accurate sentiment prediction.
To address these issues, this paper proposes a novel model called Aspect-oriented Affective Knowledge Enhanced for Multimodal Aspect-Based Sentiment Analysis (AAK). The model introduces a fine-grained image-aspect cross-modal attention mechanism to enhance visual representations related to specific aspects. It also incorporates sentiment scores from external knowledge to enrich textual sentiment features. Experimental results on Twitter-2015 and Twitter-2017 datasets demonstrate that AAK outperforms baseline models, proving the effectiveness of fine-grained image-text matching and sentiment enhancement.
Background and Related Work
Aspect-Based Sentiment Analysis with Sentiment Knowledge
Aspect-based sentiment analysis (ABSA) focuses on identifying sentiment polarities for specific aspects within a sentence. Due to the complexity of sentiment expressions and sentence structures, researchers have explored various methods to improve ABSA performance. Sentiment lexicons, for instance, have been widely used to enhance sentiment understanding. Some studies employ dependency parsing and graph convolutional networks (GCNs) to capture syntactic relationships between words. Others integrate external knowledge, such as adjective-noun pairs (ANPs), to distinguish sentiment tendencies for different nouns in multimodal contexts.
Multimodal Aspect-Based Sentiment Analysis
MABSA extends ABSA by incorporating visual information from images to improve sentiment prediction. Existing approaches can be categorized into three sub-tasks:
- Multimodal Aspect Term Extraction (MATE): Identifying aspect terms in text with the help of images.
- Multimodal Aspect-Oriented Sentiment Classification (MASC): Predicting sentiment polarity for extracted aspects.
- Joint Multimodal Aspect-Sentiment Analysis (JMASA): Simultaneously extracting aspects and predicting their sentiments.
Recent studies leverage cross-modal attention mechanisms to align text and image features. Some models use facial expressions or ANPs to enhance sentiment understanding. However, many methods still struggle with irrelevant visual features that introduce noise. Additionally, syntactic dependencies and sentiment cues in text are often underutilized.
Proposed Model: AAK
The AAK model consists of four main components:
- Feature Extraction Module: Encodes text and image features using RoBERTa and ResNet-152.
- Fine-Grained Image-Aspect Cross-Modal Attention Mechanism: Aligns visual features with specific aspects to reduce noise.
- Text Sentiment Convolution Module: Incorporates sentiment scores from SenticNet and syntactic dependencies to enrich textual representations.
- Modality Fusion Module: Combines text, sentiment, and image features for final sentiment prediction.
Feature Extraction
Text Feature Extraction
Given a sentence and its aspect terms, the model concatenates them with special tokens and encodes them using RoBERTa. This produces contextualized representations for both the aspect terms and the entire sentence.
Image Feature Extraction
The input image is resized and processed through ResNet-152 to extract visual features. A linear transformation projects these features into the same dimensional space as the text embeddings.
Fine-Grained Image-Aspect Cross-Modal Attention
This module ensures that visual features align with relevant aspect terms, reducing noise from unrelated image regions.
- Cross-Modal Feature Interaction: A multi-head cross-modal Transformer (MC-Transformer) fuses text and image features. The image features serve as queries, while the text features act as keys and values.
- Image Auxiliary Information: ANPs extracted from the image are compared with aspect terms to identify the most relevant visual regions.
- Fine-Grained Attention: A max-pooling operation selects the most discriminative visual features, followed by a sigmoid function to filter irrelevant regions.
Text Sentiment Convolution Module
To mitigate emotional confusion from multiple aspects, this module integrates sentiment scores from SenticNet.
- Sentiment Score Integration: Each word’s sentiment score is projected into the same space as text embeddings.
- Graph Convolution Network (GCN): Syntactic dependencies from dependency trees are used to propagate sentiment information through a GCN, enhancing contextual sentiment understanding.
Modality Fusion
The final step combines text, sentiment, and image representations using a multimodal Transformer. The fused features are passed through a softmax layer for sentiment classification.
Experiments
Datasets and Settings
Experiments were conducted on Twitter-2015 and Twitter-2017 datasets, which contain text-image pairs labeled with aspect terms and sentiment polarities (positive, neutral, negative). The model was trained using PyTorch with an AdamW optimizer, a learning rate of 2e-5, and a batch size of 32.
Baseline Comparisons
AAK was compared against single-modal (ResNet, BERT, RoBERTa) and multimodal models (CapRoBERTa, TMSC, ESAFN). Results showed that AAK achieved the highest accuracy on both datasets, outperforming TMSC by 0.25% on Twitter-2015 and 0.16% on Twitter-2017.
Ablation Studies
Removing any module (image auxiliary information, fine-grained attention, or sentiment convolution) led to performance drops, confirming their importance. The fine-grained attention mechanism had the most significant impact, highlighting its role in reducing noise.
Hyperparameter Analysis
- ANP Selection (K): Performance peaked at K=5, as larger values introduced noise.
- Importance Weights (λ, λₙ): Optimal values were {0.5, 0.3} for Twitter-2015 and {0.3, 0.3} for Twitter-2017.
Case Studies
AAK correctly predicted sentiments in complex cases where baseline models failed. For example, in a tweet mentioning “Madonna,” the model used visual cues (smiling face) and sentiment-enhanced text to classify the sentiment accurately.
Conclusion
The AAK model introduces a fine-grained image-aspect attention mechanism and sentiment-enhanced text representations to improve MABSA performance. By aligning visual features with specific aspects and leveraging sentiment knowledge, the model reduces noise and enhances classification accuracy. Future work will explore extending the model to other modalities and improving cross-modal feature fusion.
DOI: 10.19734/j.issn.1001-3695.2024.08.0294
Was this helpful?
0 / 0