Action Recognition Based on Multi-Level Graph Topology Comparison and Refinement

Action Recognition Based on Multi-Level Graph Topology Comparison and Refinement

Introduction

Action recognition is a critical research area in computer vision with applications spanning fitness tracking, public safety, healthcare monitoring, and human-computer interaction. Traditional approaches relied on handcrafted features, which required extensive manual tuning and lacked generalization capabilities. With advancements in deep learning, methods such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Graph Convolutional Networks (GCNs) have been widely adopted. Among these, GCNs have emerged as the dominant approach for skeleton-based action recognition due to their ability to process non-Euclidean data effectively.

Despite their success, existing GCN-based methods suffer from limitations, particularly in distinguishing ambiguous samples—actions with highly similar motion trajectories. These methods primarily focus on intra-sequence feature extraction while neglecting cross-sequence contextual information. To address these challenges, this paper introduces a Graph Topology Contrast Refinement Block (GTCR-Block), which leverages contrastive learning to enhance the discriminative power of GCNs. The proposed method refines graph topologies by enforcing intra-class compactness and inter-class separation, thereby improving recognition accuracy, especially for ambiguous actions.

Background and Related Work

Graph Convolutional Networks for Action Recognition

GCNs have revolutionized skeleton-based action recognition by modeling the natural connections between human joints. Yan et al. first introduced the Spatio-Temporal Graph Convolutional Network (ST-GCN), which processes spatial and temporal features separately. However, ST-GCN uses a fixed graph topology shared across all channels, limiting its ability to capture diverse motion patterns.

To enhance flexibility, Shi et al. proposed the Two-Stream Adaptive Graph Convolutional Network (2s-AGCN), which learns a dynamic graph topology alongside the fixed one. Despite improvements, 2s-AGCN still shares the same topology across all channels within a sample. Chen et al. further refined this approach with the Channel-wise Topology Refinement Graph Convolutional Network (CTR-GCN), which learns both a shared topology and channel-specific correlations, achieving per-channel topology refinement. While CTR-GCN improves feature aggregation, it remains limited in distinguishing highly similar actions due to its reliance on local sequence information.

Contrastive Learning in Action Recognition

Contrastive learning has gained prominence in various domains by learning representations that maximize similarity between positive pairs and dissimilarity between negative pairs. In action recognition, contrastive learning has been applied to distinguish motion regions from static regions and learn cross-skeleton representations. However, existing methods often focus on raw joint descriptors or sequence-level representations without integrating fine-grained joint features with global semantic information.

Methodology

Overview

The proposed GTCR-Block is a plug-and-play module compatible with existing GCN architectures such as ST-GCN, 2s-AGCN, and CTR-GCN. It operates during training to refine graph topologies without introducing additional parameters during inference. The module is integrated at multiple stages of the backbone network to enable hierarchical feature learning.

Sample Classification

The GTCR-Block categorizes samples into two groups:

  1. Reliable Samples: Correctly classified samples, considered true positives (TP). These samples exhibit strong intra-class consistency.
  2. Ambiguous Samples: Misclassified samples, further divided into:
    • False Negatives (FN): Samples of action k misclassified as other actions.

    • False Positives (FP): Samples of other actions misclassified as action k.

For reliable samples, a global graph topology is computed for each action class, serving as a reference for contrastive learning.

Memory Banks

Two memory banks are employed to store cross-batch graph topologies:

  1. Sample-Level Memory Bank (B_sam): Stores graph topologies of ambiguous samples (FN and FP) to enforce separation between misclassified samples and their incorrect labels.
  2. Global-Level Memory Bank (B_glo): Maintains global graph topologies for each action class, updated via momentum to ensure stability.

The global graph topology for action k is computed as a weighted average of historical topologies, preventing abrupt changes and preserving long-term context.

Contrastive Learning

The GTCR-Block applies contrastive loss functions to refine graph topologies:
• Sample-Level Loss (L_sam): Encourages FN samples to align with their true class and pushes FP samples away from their mispredicted class.

• Global-Level Loss (L_glo): Ensures that each sample’s graph topology aligns with its true class’s global topology while diverging from other classes.

The combined contrastive loss is computed hierarchically across multiple network stages, enabling multi-level feature refinement.

Training and Optimization

The overall loss function combines the backbone’s cross-entropy loss (L_CE) with the contrastive loss (L_CL), weighted by a hyperparameter to balance their contributions. The model is trained using stochastic gradient descent with learning rate warmup and decay.

Experiments and Results

Datasets

The proposed method is evaluated on two large-scale skeleton-based action recognition datasets:

  1. NTU RGB+D: Contains 60 action classes performed by 40 subjects, captured from three camera angles. Evaluations follow cross-subject (X-Sub) and cross-view (X-View) protocols.
  2. NTU RGB+D 120: An extended version with 120 action classes and 106 subjects, evaluated under cross-subject (X-Sub) and cross-setup (X-Set) protocols.

Implementation Details

Experiments are conducted on an NVIDIA GeForce RTX 3090 GPU using PyTorch. The model is trained for 75 epochs with a batch size of 64. Key hyperparameters include:
• Temperature coefficient (τ): 0.8

• Momentum term (α): 0.95

• Loss weights (λ_1 to λ_4): 0.1, 0.3, 0.6, 1

• Contrastive loss weight (λ_CL): 0.2

Performance Comparison

The GTCR-Block is tested on two backbone networks, 2s-AGCN and CTR-GCN, achieving significant improvements:

NTU RGB+D
• 2s-AGCN + GTCR-Block: X-Sub accuracy improves from 88.5% to 91.9%; X-View accuracy rises from 95.1% to 96.1%.

• CTR-GCN + GTCR-Block: X-Sub accuracy increases from 92.4% to 93.3%; X-View accuracy improves from 96.8% to 97.4%.

NTU RGB+D 120
• 2s-AGCN + GTCR-Block: X-Sub accuracy rises from 82.9% to 87.5%; X-Set accuracy improves from 84.9% to 89.2%.

• CTR-GCN + GTCR-Block: X-Sub accuracy increases from 88.9% to 89.4%; X-Set accuracy rises from 90.6% to 91.2%.

Ablation Studies

  1. Effect of g(·): The projection function g(·) enhances feature quality, contributing to a 0.2% accuracy improvement.
  2. Memory Banks: Using both B_sam and B_glo yields better results than either alone, demonstrating their complementary roles.
  3. Multi-Level Refinement: GTCR-Blocks placed at later stages (e.g., TGN-10) have a greater impact, as they refine final feature representations.

Analysis of Ambiguous Samples

The GTCR-Block significantly improves recognition accuracy for ambiguous actions:
• “Crossing hands in front” improves by 5.6%.

• “Clapping” improves by 12.1%.

• “Reading” improves by 11.8%.

• “Typing on a keyboard” improves by 8.6%.

Visualizations of feature embeddings confirm that GTCR-Block enhances intra-class compactness and inter-class separation, particularly for challenging samples.

Conclusion

This paper presents a novel GTCR-Block that enhances skeleton-based action recognition by refining graph topologies through contrastive learning. By categorizing samples into reliable and ambiguous groups and leveraging memory banks for cross-sequence learning, the method achieves state-of-the-art performance on NTU RGB+D and NTU RGB+D 120 datasets. The GTCR-Block is a versatile module that can be integrated into existing GCN architectures without additional inference costs.

Future work will focus on model optimization techniques such as knowledge distillation and pruning to improve deployment efficiency while maintaining high accuracy.

doi.org/10.19734/j.issn.1001-3695.2024.04.0167

Was this helpful?

0 / 0