Robust Speech Keyword Spotting Based on Dual-Branch Fusion and Time-Frequency Squeeze and Excitation
Introduction
Speech keyword spotting (KWS) plays a crucial role in human-machine interaction, enabling devices to respond to wake-up commands such as “Hey Siri” or control instructions like “turn on” and “turn off.” These applications typically run on resource-constrained edge devices, requiring continuous monitoring of specific keywords to trigger corresponding functions. To ensure a seamless user experience, KWS systems must exhibit strong noise robustness while maintaining low memory usage.
Despite significant advancements in deep learning-based KWS models, performance degradation in noisy environments remains a major challenge. Real-world noise interferes with both temporal and frequency-domain information in speech, leading to reduced recognition accuracy. Traditional convolutional neural networks (CNNs) often employ sequential processing of temporal and spectral features, which may result in information loss. Additionally, while attention mechanisms have been explored to enhance feature extraction, their high computational cost makes them unsuitable for deployment on edge devices.
To address these challenges, this paper introduces a Parallel Time-Frequency Convolution Network (PTFNet), incorporating two novel components: the Dual-Branch Fusion Unit (DBF) and the Time-Frequency Squeeze and Excitation (TFSE) module. The DBF extracts temporal and spectral features in parallel, reducing information loss caused by sequential processing, while the TFSE enhances model robustness by adaptively weighting important time-frequency regions. Experimental results demonstrate that PTFNet achieves superior recognition accuracy across various signal-to-noise ratios (SNRs) while maintaining a lightweight architecture.
Background and Related Work
Deep Learning in Keyword Spotting
Recent KWS models leverage deep neural networks (DNNs), CNNs, and recurrent neural networks (RNNs) to improve recognition performance. Among these, CNNs have gained popularity due to their balance between accuracy and computational efficiency. For instance, TC-ResNet employs one-dimensional temporal convolutions, while MatchBoxNet utilizes depthwise separable convolutions to reduce parameters. However, one-dimensional convolutions may not fully capture frequency-domain characteristics, leading to suboptimal performance in noisy conditions.
To address this limitation, BC-ResNet combines one-dimensional and two-dimensional convolutions to exploit both temporal and spectral features. Similarly, ConvMixer integrates feature interaction layers and curriculum learning to enhance robustness. While these approaches improve performance, their sequential processing of time and frequency features may still result in information loss.
Attention Mechanisms in KWS
Attention mechanisms have been widely adopted to improve feature representation by selectively focusing on relevant speech segments. Self-attention models, such as Transformer-based architectures, effectively capture long-range dependencies but suffer from high computational costs. Lightweight attention variants, such as those proposed by Kwon, reduce complexity while maintaining performance. However, most existing attention mechanisms do not explicitly model the distinct impacts of noise in the time and frequency domains.
Challenges in Noisy Environments
Noise affects speech signals differently in the temporal and spectral domains. In the time domain, noise introduces irregular distortions, while in the frequency domain, it adds extraneous spectral components. Traditional KWS models often fail to account for these variations, leading to degraded performance in real-world scenarios. Multi-condition training strategies, which expose models to diverse noise conditions during training, partially mitigate this issue but remain limited by the model’s inherent learning capacity.
Proposed Methodology
Parallel Time-Frequency Convolution Network (PTFNet)
PTFNet is designed to efficiently extract and fuse temporal and spectral features while minimizing parameter overhead. The architecture consists of three main components:
- Pre-Convolution Block: Composed of two depthwise separable 2D convolutions, this block performs initial feature extraction from the input log-mel filterbank (FBank) features.
- Residual Blocks: Four residual blocks, each containing a DBF and TFSE module, process the features to enhance robustness.
- Post-Convolution Block: Three depthwise separable 1D convolutions further refine the features before classification.
Dual-Branch Fusion Unit (DBF)
The DBF addresses the limitations of sequential feature extraction by processing temporal and spectral information in parallel. The unit consists of two branches:
• Temporal Branch: Uses 1D depthwise separable convolutions to capture time-domain patterns.
• Frequency Branch: Employs 2D depthwise separable convolutions to extract frequency-domain features.
To enhance feature interaction, the DBF applies bidirectional pooling along both time and frequency axes, followed by cross-fusion. This allows the model to integrate complementary information from both domains, reducing information loss.
Time-Frequency Squeeze and Excitation (TFSE)
The TFSE module enhances model robustness by adaptively weighting important time-frequency regions. It operates in two steps:
- Squeeze: Global average pooling aggregates features along the time and frequency dimensions, generating compact representations.
- Excitation: Two separate fully connected layers compute attention weights for time and frequency components, which are then applied to the original features.
By selectively emphasizing informative segments, TFSE improves the model’s ability to suppress noise and focus on relevant speech cues.
Experimental Results
Dataset and Training Setup
Experiments were conducted on the Google Speech Commands v2 dataset, which includes 35 command keywords. A 12-class subset was used, covering common instructions like “yes,” “no,” and “stop.” Noise samples from the MUSAN dataset were added to simulate real-world conditions, with SNRs ranging from clean to -10 dB.
The model was trained using Adam optimization with an initial learning rate of 6e-3, and data augmentation techniques such as time shifting and spectrogram masking were applied. Early stopping was employed to prevent overfitting.
Performance Comparison
PTFNet was evaluated against several state-of-the-art models, including MHAtt-RNN, MatchBoxNet, BC-ResNet, and ConvMixer. Key findings include:
• Accuracy: PTFNet achieved the highest recognition accuracy across all SNR conditions, outperforming BC-ResNet-8 by 0.58% in clean conditions and 1.7% at -10 dB SNR.
• Parameter Efficiency: With only 77K parameters, PTFNet is significantly lighter than competitors like BC-ResNet-8 (353K parameters) while delivering better performance.
• Generalization: PTFNet exhibited strong generalization to unseen noise conditions, achieving 96.24% accuracy at 20 dB SNR, a scenario not included in training.
Ablation Studies
Ablation experiments confirmed the contributions of DBF and TFSE:
• Removing DBF led to a 1.68% drop in accuracy at -10 dB SNR, highlighting its role in feature fusion.
• Without TFSE, performance declined by 1.56% at -5 dB SNR, demonstrating its importance in noise suppression.
• Replacing average pooling with max pooling in DBF reduced accuracy, indicating that average pooling better preserves useful features.
Conclusion
This paper presents PTFNet, a lightweight and robust KWS model that leverages parallel time-frequency processing and adaptive attention mechanisms. The DBF module mitigates information loss by fusing temporal and spectral features in parallel, while the TFSE module enhances noise robustness through selective feature weighting. Experimental results demonstrate that PTFNet achieves superior accuracy across diverse noise conditions while maintaining a compact architecture suitable for edge deployment.
Future work will explore the model’s adaptability to reverberant environments and investigate noise-specific enhancements for broader real-world applicability.
doi.org/10.19734/j.issn.1001-3695.2024.04.0121
Was this helpful?
0 / 0