Multimodal Micro-Expression Recognition Based on Improved 3D ResNet18
Micro-expressions are brief, involuntary facial movements that reveal genuine emotions, typically lasting only 1/25 to 1/5 of a second. Their subtle and fleeting nature makes them particularly challenging to detect and analyze, yet they hold significant value in fields such as psychology, law enforcement, and human-computer interaction. Traditional methods for micro-expression recognition often struggle with capturing these rapid temporal changes, fusing spatial and temporal information effectively, and overcoming the limitations posed by small datasets.
To address these challenges, this paper introduces a novel multimodal micro-expression recognition framework called IM3DR-MFER (Improved 3D ResNet-based Multimodal Micro-Expression Recognition). The proposed method leverages an enhanced 3D ResNet18 architecture, integrating parameter reduction strategies, multi-scale context-aware fusion, and a newly designed three-dimensional attention mechanism (CASANet). By combining global facial features with dynamic optical flow information, the model achieves superior performance in recognizing micro-expressions across different datasets.
Challenges in Micro-Expression Recognition
Micro-expression recognition faces several key challenges:
- Transience and Temporal Feature Extraction – Due to their extremely short duration, capturing micro-expressions requires precise temporal modeling. Traditional 2D convolutional neural networks (CNNs) struggle to encode these rapid changes effectively.
- Spatiotemporal Information Fusion – Micro-expressions involve both spatial (facial muscle movements) and temporal (evolution over frames) components. Most existing methods either focus on static spatial features or fail to integrate temporal dynamics efficiently.
- Data Scarcity and Overfitting – Publicly available micro-expression datasets are limited in size, making deep learning models prone to overfitting.
- Static Feature Extraction Limitations – Many approaches rely on handcrafted features (e.g., LBP-TOP, optical flow), which may not generalize well across different datasets.
- Impact of Preprocessing – Variations in illumination, head pose, and facial occlusions can degrade recognition performance if not properly handled.
To overcome these issues, IM3DR-MFER introduces several innovations, including an improved 3D ResNet18 backbone, a dual-modal input framework, and a novel attention mechanism.
Methodology Overview
The proposed framework consists of three main components:
- Improved 3D ResNet18 Architecture – The backbone network is enhanced with parameter reduction techniques and multi-scale feature fusion to improve efficiency and performance.
- Dual-Modal Feature Extraction – The model processes both raw facial frames (capturing global appearance) and optical flow sequences (encoding motion dynamics).
- CASANet Attention Mechanism – A novel 3D attention module that adaptively highlights critical spatiotemporal features in micro-expression sequences.
1. Enhanced 3D ResNet18 for Micro-Expression Recognition
The standard 3D ResNet18 processes video sequences using 3D convolutions, which simultaneously capture spatial and temporal features. However, the increased dimensionality leads to higher computational costs and potential overfitting on small datasets. To mitigate this, the following improvements are introduced:
• Parameter Reduction via Grouped Convolutions and Channel Shuffling
• Grouped convolutions reduce computational complexity by dividing input channels into separate groups, each processed independently.
• Channel shuffling ensures cross-group information exchange, preventing feature degradation.
• Multi-Scale Context-Aware Fusion (MSCAF)
• Inspired by Atrous Spatial Pyramid Pooling (ASPP), MSCAF employs dilated convolutions with varying receptive fields to capture both fine-grained and global contextual features.
• This module is inserted between the third and fourth residual blocks to enhance feature diversity without excessive parameter growth.
2. Dual-Modal Input Framework
Micro-expressions are best characterized by both appearance and motion cues. The proposed model processes two complementary input modalities:
• Global Facial Features – Extracted directly from raw video frames, these features encode static facial attributes such as texture and shape.
• Optical Flow Features – Computed using LiteFlowNet, a lightweight yet efficient optical flow estimation network. Optical flow captures subtle motion patterns between consecutive frames, which are crucial for detecting micro-expressions.
The two feature streams are fused at an intermediate stage using L2 normalization and concatenation, ensuring balanced contributions from both modalities.
3. CASANet: 3D Attention for Spatiotemporal Feature Enhancement
Attention mechanisms have proven effective in highlighting relevant features while suppressing noise. However, most existing attention modules (e.g., SE, CBAM, ECA) are designed for 2D images and do not fully exploit temporal relationships in videos.
CASANet (Channel and Spatiotemporal Attention Network) extends attention to three dimensions:
• Channel-Time Attention (CAM)
• Computes adaptive weights across both channel and temporal dimensions, emphasizing discriminative features at critical time steps.
• Unlike traditional channel attention, CAM avoids dimensionality reduction, preserving feature richness.
• Spatial-Time Attention (SAM)
• Aggregates spatial information across frames using max and average pooling, followed by a 3D convolution to refine spatiotemporal saliency.
• The final attention map is generated via a sigmoid activation, highlighting regions with significant motion or texture changes.
CASANet is integrated into the final residual block of the 3D ResNet18, ensuring that high-level features receive the most refinement before classification.
Experimental Results and Analysis
The proposed method is evaluated on three widely used micro-expression datasets:
- CASME II – Contains 145 spontaneous micro-expressions from 26 subjects, labeled into five emotion categories.
- SAMM – Comprises 133 high-resolution micro-expressions with diverse ethnicities and age groups.
- Composite Dataset (CD) – A merged collection of CASME II, SAMM, and SMIC samples, totaling 442 sequences.
Performance Comparison with State-of-the-Art Methods
IM3DR-MFER achieves the following recognition accuracies:
• CASME II: 93.2%
• SAMM: 88.7%
• Composite Dataset (CD): 84.6%
These results outperform existing approaches, including: • Traditional Methods (LBP-TOP, MDMO) – Limited by handcrafted feature design.
• Graph-Based Models (Graph-TCN, STA-GCN) – Effective but computationally intensive.
• Deep Learning Methods (CapsuleNet, STST-Net, GACNN) – Some achieve high accuracy but lack efficient spatiotemporal modeling.
Ablation Studies
To validate the contributions of each component, several ablation experiments were conducted:
- Baseline 3D ResNet18 – Achieves moderate accuracy but suffers from overfitting.
- LiteFlowNet-only Model – Performs well on motion cues but lacks static appearance context.
- Improved 3D ResNet18 (with MSCAF) – Shows better generalization than the baseline.
- Dual-Modal Fusion (without CASANet) – Improves accuracy but misses fine-grained temporal details.
- Full IM3DR-MFER (with CASANet and EVM preprocessing) – Delivers the highest performance, confirming the effectiveness of the proposed enhancements.
Visualization and Interpretability
Error heatmaps and feature distribution analyses reveal that IM3DR-MFER focuses on biologically relevant facial regions (e.g., eyes, mouth corners) while suppressing irrelevant background noise. The dual-modal approach ensures robustness against illumination changes and partial occlusions.
Conclusion and Future Work
This paper presents a robust and efficient framework for micro-expression recognition, combining an improved 3D ResNet18 architecture, dual-modal feature fusion, and a novel 3D attention mechanism. The experimental results demonstrate significant improvements over existing methods, particularly in handling the challenges of spatiotemporal feature extraction and small dataset limitations.
Future research directions include: • Advanced Data Augmentation – Exploring generative models (e.g., GANs) to synthesize realistic micro-expression samples.
• Cross-Dataset Generalization – Enhancing model adaptability across different recording conditions and populations.
• Transformer-Based Architectures – Investigating vision transformers for long-range temporal modeling.
The proposed method holds promise for real-world applications in emotion-aware AI, deception detection, and clinical psychology.
For further details, refer to the full paper: https://doi.org/10.19734/j.issn.1001-3695.2024.04.0216
Was this helpful?
0 / 0