Cross-Domain Human Action Generalization Recognition Model Based on Small Sample Size and Randomization
Introduction
Human action recognition has long been a critical research topic in computer vision and wireless sensing. With the exponential growth of video network information, traditional machine learning methods such as those based on human joint points, spatiotemporal interest points, and dense trajectories can no longer meet the increasing application demands. Consequently, the key technologies for human action recognition have shifted toward integrating deep learning with WiFi sensing technology. The main challenges lie in: (a) Scene complexity, which primarily affects recognition accuracy. The same human action can appear significantly different under varying angles and lighting conditions. Additionally, large-scale human movements, variations in body shapes, self-occlusion, and object occlusion all contribute to scene complexity, substantially impacting recognition accuracy. (b) Action boundary ambiguity. Real-time videos may contain multiple actions with varying durations and speed changes, making it difficult to precisely locate action boundaries in time or perform fine-grained temporal analysis. This ambiguity significantly reduces recognition accuracy and efficiency.
WiFi-based sensing utilizes fine-grained physical layer attributes like Channel State Information (CSI) to measure specific changes in wireless signals caused by human actions (such as walking or elbow movements). However, wireless signals undergo multipath propagation, meaning collected CSI samples contain both human action information and environmental information. Figure 1 illustrates CSI amplitude variations when the same action is performed across different domains (different users, locations, and directions) with two repetitions. The significant differences in CSI amplitude patterns demonstrate what is known as domain shift. Specifically, Figure 1(a) shows CSI amplitude variations when only the user differs while location and direction remain constant; Figure 1(b) shows variations when only the location changes; and Figure 1(c) shows variations when only the direction changes.
Current deep learning-based WiFi sensing solutions primarily train and test using labeled data collected from the same domain, making them vulnerable to domain shift. For example, the ARIL (Activity Recognition and Indoor Localization) model achieves over 80% accuracy when trained and tested at the same location. However, when trained on locations 1-10 and tested on locations 11-16, its recognition accuracy drops to 36.79%. To further illustrate domain shift’s impact, Figure 2 uses t-SNE to visualize vectorized representations (embeddings) output by ARIL’s encoding layer for both source and target domain data. When tested on unseen target domains, embeddings of the same action class appear highly scattered, indicating the model fails to learn common feature representations for actions in the target domain.
Problem Analysis and System Model
WiFi signals experience time-varying characteristics during multipath propagation, meaning environmental changes ultimately reflect in each subcarrier’s channel parameters in OFDM systems. Therefore, human actions in WiFi environments cause CSI variations in both time and subchannel parameters. WiFi sensing first requires setting up a data collection environment with transmitters and receivers, such as a commercial router as transmitter and a computer with an Intel 5300 network card as receiver. Collected CSI data is typically noisy, and packet loss may occur at the receiver, necessitating noise reduction through signal processing techniques and interpolation for excessive packet loss. Then, appropriate models must be designed for specific scenarios. With models and data, training is needed to find optimal parameters for robust sensing, with different training methods determining model optimization directions.
To address WiFi cross-domain sensing challenges, academia has proposed various solutions, including domain-independent features, transfer learning, domain adversarial training, and few-shot learning. Domain-independent feature methods often require manual feature design and multiple feature extractors. Transfer learning leverages pretrained models to transfer knowledge from source to target domains, but when domains differ significantly, negative transfer may occur. Domain adversarial training helps feature extractors learn generalizable representations for better target domain performance, but current methods require extensive source and target domain samples. Few-shot learning methods adapt source domains to target domains with minimal labeled data without retraining, but few-shot learners cannot learn domain-independent features—they only find optimal starting points across base set tasks for rapid test task adaptation.
Problem Analysis
Few-shot learning (FSL) aims to enable effective learning with scarce data. Traditional deep learning relies on massive labeled datasets, while FSL extracts and learns effective features from minimal labeled samples for accurate predictions on new, small datasets. Meta-learning is a common FSL approach. As a meta-learning technique, FSL has been widely applied in computer vision. Traditional deep learning models focus on learning unique feature representations to predict corresponding class labels. In contrast, FSL feature extractors pretrain on base sets (training sets), then use few labeled samples to construct support sets, predicting via similarity between query and support set features. Recently, FSL has been applied to WiFi sensing to address domain shift. Generally, machine learning aims to design models that learn from training data and generalize to test data. Traditional models assume training and test data are independently and identically distributed, but real-world scenarios often violate this assumption, especially in wireless sensing where distributions frequently differ significantly, drastically reducing model performance. Domain generalization enhances model robustness when training and test distributions differ and test data is completely unseen. Current domain generalization approaches include: (a) Data manipulation: Enhances generalization by modifying input data through augmentation (randomization, transformations) or generation (diverse synthetic samples). (b) Representation learning: The most common approach, including domain-invariant representation learning (kernel methods, adversarial training, invariant risk minimization) and feature disentanglement (separating domain-shared and domain-specific features). (c) Learning strategies: Improve generalization through ensemble learning (combining multiple models), meta-learning (simulating domain shifts), distributionally robust optimization (worst-case distribution training), gradient manipulation, or self-supervised learning (designing pretext tasks).
This paper proposes SSRCD-Fi, a Small Sample and Randomized Cross-Domain human action generalization recognition model. First, a feature extractor maps input samples to a vector space, clustering same-action samples and separating different-action samples. Then, for new domains, prototype representations are computed using randomization and few labeled samples. Finally, action classification is achieved by measuring distances between query samples and prototypes.
System Model
To help SSRCD-Fi’s feature extractor learn action-relevant but domain-independent features, we enhance the prototype network by: (a) Adding subcarrier attention to focus on subcarrier signals more affected by human actions; (b) Incorporating a shared action classifier to better learn class prototypes; (c) Using embedding means and variances as domain styles, then randomizing features to force domain-independent learning.
Figure 3 illustrates SSRCD-Fi’s learning process, where colors indicate action classes and shapes represent domains. The model first learns class prototypes, with the classifier helping cluster same-class actions while separating different classes. However, prototypes may be influenced by domain signals (same-color but different-shape vectors cluster together). Thus, domain randomization obscures domain styles to learn domain-independent prototypes.
SSRCD-Fi’s architecture is shown in Figure 4. First, preprocessed CSI data from source domains is split into support and query sets (training phase; prediction on new domains also uses support/query sets). CSI data (with Nt transmit antennas, Nr receive antennas, K subcarriers, and T sequence length) is encoded into feature representations by an attention-based feature extractor. Means and variances of these representations serve as domain styles, which are randomized. For action learning, domain-randomized features undergo adaptive instance normalization (AdaIN) before linear mapping for classification loss. For prototype learning, AdaIN-processed features compute similarities between queries and support prototypes for prototype loss.
Key Techniques and Algorithms
Feature Extractor
SSRCD-Fi’s core task is forcing the feature extractor to learn domain-independent features. An effective feature extraction network is fundamental. Figure 5 shows our network structure, which improves residual networks by: (a) Using 1D operations (convolution, batch normalization, pooling) for CSI time-series data; (b) Adding channel attention to focus on more significant subcarrier signals; (c) Employing adaptive average pooling (AVP) at the end to directly map 2D to 1D vectors, reducing redundancy.
Domain Randomization
Domain randomization obscures feature domain characteristics by combining random interpolation and style transfer. Random interpolation linearly blends features from different samples to create ambiguous domain styles. Style transfer applies AdaIN to transfer statistical features between samples, replacing original subcarrier styles with randomized ones.
During training, support and query sets (with M and O samples respectively) are processed by the feature extractor. Both support and query features undergo domain randomization via DR operations.
Linear-Based Loss
For linear prediction, a mapping layer and softmax convert randomized features into class probability distributions. Cross-entropy loss is computed between predictions and true labels for both support and query sets.
Prototype-Based Loss
In prototype learning, class prototypes are computed as mean vectors of randomized support features per class. Query samples are classified by measuring similarity (e.g., cosine) to prototypes, with cross-entropy loss between predictions and true labels.
Pretraining Algorithm
For c-way k-shot learning, SSRCD-Fi first learns domain-independent prototypes, then updates feature extractor parameters via combined linear and prototype losses (Algorithm 1). Each iteration randomly samples support/query sets per class, computes randomized features and prototypes, then minimizes total loss.
Experimental Design and Results
Dataset Selection
We evaluate SSRCD-Fi’s cross-domain performance on three datasets:
-
ARIL Dataset: Supports joint action recognition and indoor localization, containing 6 actions (hand movements) at 15 locations in one room. CSI samples (1×52×192) collected via USRP, preprocessed for noise removal and normalization.
-
CSIDA Dataset: Collected using Atheros platform (1 transmitter, 1 receiver at 5GHz), with 114 subcarriers at 1kHz sampling. Preprocessed CSI shapes are 3×114×1800, covering multiple locations, users, and rooms.
-
Our Dataset: Collected using 1 router (transmitter) and Intel 5300-equipped computer (receiver), containing 6 actions (slide, push, circle, raise, clap, fist). Processed CSI shapes are 3×300×500 across locations and users.
Experimental Setup
Experiments test various cross-domain scenarios (location, user, scene), changing only one domain factor per test. Mean accuracy over multiple trials serves as evaluation metric. Training uses Adam optimizer (lr=5e-4), λ=1, 300 epochs on PyTorch-Lightning 1.1.8 (PyTorch 1.8.0). Hardware includes NVIDIA RTX 3080Ti GPU and 128GB RAM.
Within-Domain Recognition
First, we verify SSRCD-Fi’s few-shot learning within single domains. Figure 7 shows 6-way 1-shot confusion matrices across datasets, achieving 83.3% (ARIL), 91.67% (CSIDA), and 87.5% (our dataset) accuracy. In our dataset (Figure 7c), slide, raise, and fist achieve 100% accuracy; push and circle show 88% accuracy (12% misclassified as slide and fist respectively); clap achieves only 50% accuracy (25% misclassified as push/fist each), likely due to similar motion features or insufficient feature extraction for subtle hand-contact patterns. Despite some errors, overall performance remains robust.
Cross-Domain Recognition
Cross-Location
Tested on ARIL (16 locations) and our dataset (4 locations), Figure 8(a) shows 1-shot average accuracy around 60% (ARIL) and 78% (our dataset). Accuracy improves with more support samples, reaching comparable performance at 3-shot, demonstrating effective real-world applicability.
Cross-User
Tested on CSIDA (5 users) and our dataset (4 users), Figure 8(b) shows 1-shot accuracy around 70% (CSIDA) and 90% (our dataset), exceeding 95% (CSIDA) and nearly 90% (our dataset) at 3-shot, confirming strong generalization across users.
Cross-Scene
Only tested on CSIDA, Figure 8(c) shows 1-3 shot accuracy between 80%-90%.
Comparison with Baselines
Compared to WiGr (few-shot), EI and JADA (domain adversarial), Table 2 shows: (a) JADA uses 4-stage training with highest parameters; SSRCD-Fi’s attention and randomization increase parameters over WiGr; (b) SSRCD-Fi and WiGr outperform adversarial methods in accuracy; (c) SSRCD-Fi surpasses WiGr, indicating better domain-independent feature learning.
Input Type Analysis
Testing amplitude-only, phase-only, and both on CSIDA and our dataset (Figure 9) reveals: (a) Amplitude-only performs worst; (b) Phase changes, more sensitive to gesture-induced path variations, significantly improve results in both datasets.
Similarity Metric Comparison
Comparing Euclidean distance and cosine similarity (Figure 10), cosine consistently outperforms across datasets. Notably, WiGr uses orthogonal regularization to optimize cosine similarity spaces, while SSRCD-Fi achieves this through linear prediction tasks without such regularization.
Model Ablation
Testing SSRCD-Fi variants on our dataset (Figure 11): “w/o DR&SA” (no domain randomization/subcarrier attention), “w/o SA”, and “w/o DR” show full SSRCD-Fi (with both components) achieves optimal performance, confirming their individual contributions.
Conclusion
This paper proposes SSRCD-Fi, a few-shot learning model enhanced with auxiliary linear action classification and domain randomization for action-relevant but domain-independent feature learning. Since prototype-based few-shot learning requires target domain support data, SSRCD-Fi’s encoder cannot yet learn fully generalized representations independent of target domains. Future work will explore using subcarrier statistics as domain styles and incorporating adversarial domain learning to extract generalized features from multiple source domains.
For more details, visit the original paper: https://doi.org/10.19734/j.issn.1001-3695.2024.05.0308
Was this helpful?
0 / 0