Scene Text Recognition Based on Multimodal Feature Fusion
Introduction
Scene text recognition has become increasingly important in the era of information and intelligence, with applications ranging from autonomous driving and road sign recognition to natural scene translation. However, recognizing text in natural scenes remains challenging due to issues such as occlusion, distortion, uneven character distribution, cluttered backgrounds, and varying fonts. Traditional image processing-based algorithms struggle to handle such complex environments. Early deep learning-based approaches treated text recognition as a classification problem, first segmenting characters and then recognizing them individually before concatenating them into strings. These methods relied solely on visual information, ignoring the relationships between characters, which limited their performance, especially in cases of poor image quality.
To address these challenges, modern scene text recognition algorithms leverage semantic information from text, treating the task as a sequence prediction problem. Current methods can be broadly categorized into non-semantic and semantic approaches. Non-semantic methods rely exclusively on visual features, making them vulnerable to missing visual cues (e.g., occlusion). In contrast, semantic methods extract contextual information such as vocabulary and grammar, combining it with visual features to improve recognition accuracy. However, effectively fusing visual and semantic information remains a challenge.
This paper introduces the Multimodal Scene Text Recognition (MMSTR) network, which integrates visual and semantic features to enhance recognition performance. MMSTR employs a shared-weight autoregressive permutation language model for flexible decoding strategies. Additionally, it introduces a Residual Attention Encoder (REA-Encoder) to improve shallow feature extraction and mitigate feature collapse in Vision Transformers. Finally, a Decision Fusion Module (DFM) enhances the integration of semantic and visual features during decoding. Experiments demonstrate that MMSTR achieves state-of-the-art performance across multiple benchmark datasets, particularly in handling challenging cases such as occlusion and distortion.
MMSTR Network Architecture
The MMSTR network follows an encoder-decoder framework, consisting of two main components: the Residual Attention Encoder (REA-Encoder) and the Decision Fusion Decoder. The encoder processes input text images, while the decoder combines visual and semantic features to generate recognition results.
Residual Attention Encoder (REA-Encoder)
The REA-Encoder is designed to overcome the limitations of traditional Vision Transformers (ViTs) in capturing shallow image features. Unlike standard ViTs, which may suffer from feature collapse in deeper layers, the REA-Encoder introduces a novel Residual Multi-Head Attention (ReMHA) mechanism.
The input image is divided into patches, which are flattened into embedding vectors. The ReMHA module processes these vectors using a gated attention mechanism that dynamically controls the flow of attention information between layers. This allows shallow features to propagate to deeper layers, preserving both low-level and high-level visual information.
Key innovations of the REA-Encoder include:
- Attention Residual Learning: Instead of relying solely on feature forwarding, ReMHA propagates attention scores between layers, enhancing feature diversity.
- Adaptive Gating: A learnable gating variable balances attention contributions from different layers, preventing over-reliance on low-level features.
- Multi-Layer Perceptron (MLP) Enhancement: Each ReMHA layer is followed by an MLP that further refines extracted features.
By incorporating these mechanisms, the REA-Encoder effectively mitigates feature collapse and improves the model’s ability to recognize distorted or occluded text.
Decision Fusion Decoder
The decoder integrates visual features from the REA-Encoder with semantic information to generate recognition results. It consists of a Multi-Head Attention (MHA) layer, a Decision Fusion Module (DFM), and an MLP.
The decoder receives three inputs:
- Position Queries (Pq): These predict character positions, similar to query streams in dual-stream attention models.
- Attention Masks (Am): Generated during training using permutation language modeling.
- Context Input (Ic): Derived from ground-truth text labels.
The MHA layer first fuses semantic features, which are then processed by the DFM. The DFM employs cascaded attention mechanisms to deeply integrate visual and semantic information, producing both shallow and deep fusion features. These features are further refined through an MLP and linear projection to generate final character predictions.
Permutation Language Modeling
MMSTR leverages Permutation Language Modeling (PLM) to support multiple decoding strategies, including autoregressive (AR) and non-AR approaches. PLM enables the model to learn from arbitrary subsets of input context, allowing for flexible decoding.
During training, MMSTR samples multiple permutations of text sequences, ensuring robust learning without excessive computational overhead. This approach enhances the model’s ability to handle diverse text structures and improves recognition accuracy.
Experimental Results
Datasets and Evaluation Metrics
MMSTR was evaluated on a combination of synthetic (SynthText, MJSynth) and real-world datasets (IIIT5K, ICDAR13, ICDAR15, CUTE80, SVT, SVTP, Art, COCO-Text, Uber). Performance was measured using Word Accuracy and Normalized Edit Distance (1-NED).
Ablation Studies
Ablation experiments confirmed the contributions of each component:
- REA-Encoder: Improved shallow feature extraction, increasing accuracy by 0.3% on synthetic data and 0.2% on real data.
- DFM: Enhanced semantic-visual fusion, boosting accuracy by 1.2% on synthetic data and 0.5% on real data.
Comparative Analysis
MMSTR was compared against state-of-the-art methods (CRNN, ViTSTR, TRBA, ABINet, Parseq) across three character sets (36-char, 62-char, 94-char). Key findings include:
- Benchmark Performance: MMSTR achieved an average word accuracy of 96.6%, outperforming Parseq (96.0%) and ABINet (95.2%).
- Challenging Datasets: On Art, COCO-Text, and Uber, MMSTR surpassed Parseq by 1.5% in accuracy.
- Direction Robustness: When tested on rotated images (90°, 180°, 270°), MMSTR maintained an average accuracy of 88.4%, with only a 5.8% drop compared to the original orientation.
Qualitative Results
Visual comparisons demonstrated MMSTR’s superiority in recognizing distorted, occluded, and multi-oriented text. However, like other methods, it struggled with severely blurred images due to insufficient visual feature extraction.
Conclusion
MMSTR advances scene text recognition by effectively integrating visual and semantic features. Key contributions include:
- REA-Encoder: Mitigates feature collapse in Vision Transformers, improving shallow feature retention.
- DFM: Enhances multimodal fusion through cascaded attention mechanisms.
- PLM: Supports flexible decoding strategies, improving robustness.
Future work will focus on improving recognition for larger character sets and enhancing performance on blurred images.
DOI: 10.19734/j.issn.1001-3695.2024.05.0250
Was this helpful?
0 / 0