Introduction
Emotion Recognition in Conversation (ERC) represents a critical research direction in dialogue systems, playing a pivotal role in artificial intelligence. The task involves analyzing various modalities—such as text, speech tone, and facial expressions—to identify a speaker’s emotional state. This capability enables dialogue systems to better understand both the content and intent behind conversations, thereby enhancing user satisfaction and experience in human-computer interaction. Applications of ERC span across social media, customer service, and mental health, significantly advancing the development of interactive technologies.
Early ERC research primarily focused on textual analysis. However, human emotional expression is inherently multimodal, incorporating vocal intonation, facial movements, and other cues. Consequently, multimodal ERC emerged to leverage these diverse information sources, improving both the accuracy and robustness of emotion recognition. A key challenge in this domain is cross-modal fusion, which aims to effectively integrate features from different modalities. While prior studies have made notable progress, several limitations persist.
First, many approaches treat all modalities as equally important, neglecting the inherent differences in their representational capacities. This oversight can lead to underutilization of weaker modalities, where valuable emotional information might remain unexplored. Additionally, inconsistencies between modalities often introduce noise during fusion, potentially degrading performance. Second, existing models frequently fail to adequately consider speaker-specific emotional cues. The dynamic interplay between a speaker’s own emotional state and the influence of other participants’ emotions is crucial for accurate recognition but remains underexplored.
To address these challenges, this work introduces the Knowledge-enhanced Cross-modal Fusion network (KCF). The model incorporates two novel components: an external knowledge-enhanced cross-modal fusion module and a directed graph-based emotional clue enhancement module. The former leverages commonsense knowledge to strengthen weaker modalities and reduce fusion interference, while the latter captures speaker-specific emotional dynamics through structured graph representations. Extensive experiments on benchmark datasets demonstrate KCF’s superiority over state-of-the-art baselines, validating its effectiveness in real-world scenarios.
Background and Related Work
Contextual Dependency in ERC
Early approaches to ERC relied on sequential models like LSTMs to capture contextual dependencies in conversations. For instance, BC-LSTM utilized bidirectional LSTMs to encode dialogue history but ignored speaker information. Subsequent work introduced graph-based methods such as DialogueGCN, which modeled speaker dependencies via graph convolutional networks. While effective, these models struggled with sequential information preservation. DAG-ERC addressed this by employing directed acyclic graphs to encode conversational structure, though further improvements were needed to account for emotional fluctuations.
Multimodal Fusion Techniques
Multimodal ERC models aim to integrate text, audio, and visual features. MMGCN constructed intra- and inter-modal graphs to enhance cross-modal interactions but faced limitations in contextual understanding. MM-DFN improved upon this by reducing redundant information across semantic spaces. Other approaches, like GMGCN and JOYFUL, emphasized speaker-aware feature learning or global-local representation alignment. Despite these advances, challenges persisted in handling modality-specific noise and maximizing complementary information.
External Knowledge Integration
Incorporating external knowledge—such as ATOMIC or ConceptNet—has proven beneficial for ERC. KET dynamically fused commonsense knowledge with textual features but overlooked speaker influences. COSMIC utilized multiple GRUs to model mental states derived from commonsense reasoning, while KI-Net employed self-attention for knowledge selection. However, these methods often failed to fully exploit speaker-centric emotional cues.
Methodology
Problem Definition
Given a dataset containing multiple dialogues, each dialogue consists of sequentially ordered utterances from two or more speakers. Each utterance includes text, audio, and visual modalities, represented as feature vectors. The goal is to predict emotion labels for every utterance by leveraging multimodal fusion and speaker-specific emotional dynamics.
Model Architecture
KCF comprises four key components: feature encoding, knowledge-enhanced cross-modal fusion, emotional clue enhancement, and emotion prediction.
Feature Encoding
Text features are extracted using RoBERTa, followed by bidirectional LSTM encoding to capture contextual information. Audio features are processed via OpenSmile and similarly encoded, while visual features are derived from DenseNet. External knowledge is obtained using COMET, which extracts six emotion-relevant commonsense features from text. These features are categorized into intra-speaker (e.g., xReact, xEffect) and inter-speaker (e.g., oReact, oEffect) emotional clues.
Knowledge-Enhanced Cross-Modal Fusion
This module addresses modality imbalance by hierarchically fusing weaker modalities (audio/visual) with text and external knowledge. Cross-modal attention computes interaction scores between modalities, while a knowledge-enhanced variant further refines these interactions using commonsense features. Regularization techniques aggregate features before final fusion via self-attention, ensuring complementary and consistent representations.
Directed Graph-Based Emotional Clue Enhancement
Speaker-specific emotional dynamics are modeled using two directed graphs: one for intra-speaker dependencies and another for inter-speaker influences. Nodes represent utterances, with edges indicating temporal or speaker-based relationships. Multi-head self-attention operates over these graphs, enriched by gating mechanisms to dynamically weight intra- and inter-speaker features.
Emotion Prediction
The fused features are passed through a multilayer perceptron for emotion classification, trained using cross-entropy loss with L2 regularization.
Experiments
Datasets and Metrics
Evaluations were conducted on IEMOCAP and MELD, which contain dyadic and multi-party conversations, respectively. IEMOCAP includes six emotion labels, while MELD has seven. Performance was measured using weighted F1-score (W-F1) and accuracy (ACC).
Results
KCF outperformed all baselines, achieving W-F1 scores of 73.69% (IEMOCAP) and 64.32% (MELD). Ablation studies confirmed the contributions of each module, with the knowledge-enhanced fusion and speaker-aware graphs being particularly impactful. Case studies highlighted KCF’s ability to track emotional shifts in multi-speaker scenarios, though challenges remained in distinguishing similar emotions (e.g., happiness and excitement).
Analysis
Modality combination experiments revealed text as the most influential, though audio and visual features provided complementary gains. The model’s robustness to noisy modalities was attributed to knowledge-enhanced fusion, while speaker graphs improved handling of long-range dependencies. Hyperparameter analysis showed optimal performance with three cross-modal attention heads and six self-attention heads.
Conclusion
KCF advances multimodal ERC by addressing two critical challenges: modality imbalance and speaker-aware emotional dynamics. Its novel fusion strategy leverages external knowledge to enhance weaker modalities, while graph-based speaker modeling captures complex emotional interactions. Future work may explore topic-aware emotion reasoning and advanced noise reduction techniques.
DOI: 10.19734/j.issn.1001-3695.2024.08.0322
Was this helpful?
0 / 0