ASGC-STT: Adaptive Spatial Graph Convolution and Spatio-Temporal Transformer for Human Action Recognition

ASGC-STT: Adaptive Spatial Graph Convolution and Spatio-Temporal Transformer for Action Recognition

Human action recognition has emerged as a significant interdisciplinary research direction in computer vision and pattern recognition, attracting widespread attention in recent years. Human actions are embedded in vast amounts of visual data, carrying rich semantic information. The ability to accurately recognize and interpret these actions has broad applications in adaptive recognition, human-computer interaction, and computational behavioral science, making it a valuable area of study in both academia and industry.

Current research in human action recognition primarily falls into two categories: RGB video-based methods and skeleton-based methods. RGB-based approaches require substantial computational resources to process pixel-level information from images or optical flow sequences. In contrast, skeleton-based methods are computationally more efficient, as they represent human structure using 2D or 3D coordinates of a few dozen joints. Additionally, skeleton data provides high-level structural information about human motion, exhibiting greater robustness against appearance variations and environmental noise, such as background clutter and lighting changes. Consequently, skeleton-based action recognition has become an increasingly popular research focus.

To extract discriminative spatial and temporal features from skeleton data, researchers have explored deep learning techniques such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These methods manually construct skeleton sequences into grid-like structures, such as pseudo-images or coordinate vector sequences. However, since skeletons naturally exist as graph-structured data in non-Euclidean space, these approaches fail to fully exploit the inherent relationships between human joints. Recent advances in graph convolutional networks (GCNs) have addressed these limitations by modeling skeleton sequences as spatio-temporal graphs, where joints serve as nodes and natural connections in both space and time define the edges. This representation effectively embeds spatial and temporal relationships into the adjacency matrix of the graph.

Despite their success, GCNs still struggle to capture long-range dependencies between joints. These dependencies can be categorized into explicit and implicit relationships. In a skeletal spatial graph, joints such as the wrist and hand are treated as vertices, while their natural connections form edges. The edges between locally adjacent joints are considered explicit dependencies. However, there also exist learnable implicit dependencies between non-adjacent joints. For example, in the action “drinking water,” the movement of the hand and elbow joints is highly significant. According to biomechanics, hand motion is driven by the elbow, indicating that even topologically distant joints may exhibit implicit dependencies. To effectively capture these crucial relationships, this work introduces an adaptive spatial graph convolution with non-shared graph topology, where each network layer employs a unique graph structure to extract more diverse features. Additionally, multi-scale temporal convolutions are utilized to model high-level temporal dynamics.

Most existing GCN-based methods underestimate long-range temporal dependencies due to the local neighborhood constraints of temporal convolutions. To mitigate this issue, researchers have incorporated attention mechanisms after graph convolution layers to enhance long-range relationship modeling. While this sequential combination improves recognition accuracy, each node is processed independently, and the skeleton is treated as a fully connected graph where every joint connects to all others. This approach introduces unnecessary relationships, potentially reducing discriminative power. For instance, in the action “sitting,” the relationship between the left and right hands is irrelevant and may introduce noise. To address this, a spatio-temporal Transformer module is proposed, which accurately captures correlations between arbitrary joints within and across frames, modeling both local and global joint relationships.

Furthermore, multi-scale features are essential for action recognition. For example, “walking” requires full-body coordination to maintain balance, whereas “waving” only involves hand movement. Since different actions demand varying degrees of joint coordination, a multi-scale feature extractor is necessary to capture dependencies at different ranges. Additionally, actions like “putting on glasses” and “taking off glasses” may appear similar in short sequences, requiring the model to distinguish subtle temporal differences. Thus, capturing both short-range and long-range spatial dependencies, as well as short-term and long-term temporal patterns, is crucial for skeleton-based action recognition. Inspired by residual architectures, a multi-scale residual aggregation module is introduced, which hierarchically expands the receptive field in both spatial and temporal dimensions, effectively modeling multi-scale dependencies.

Adaptive Spatial Graph Convolution Network

The spatial skeleton graph is defined as a structure where vertices represent joints and edges denote natural connections between them. The adjacency matrix encodes these relationships, with entries set to 1 if a connection exists and 0 otherwise. Traditional GCNs propagate features through layers using a shared adjacency matrix, limiting their ability to adapt to different action types. For instance, in “kicking,” interactions between the legs are critical despite their topological separation. Similarly, actions like “brushing teeth” or “covering ears” require stronger connections between the hands and head than with other body parts. A fixed, shared graph topology restricts feature diversity and model flexibility.

To overcome these limitations, an adaptive spatial graph convolution network is proposed, where each layer learns a unique graph topology. This non-shared strategy enhances flexibility by allowing different layers to aggregate distinct features. The adjacency matrix is decomposed into three components: a fixed skeleton-based graph, a learnable layer-specific graph, and an action-specific graph. During training, the layer-specific graph gradually adapts to the data, while the action-specific graph tailors the topology to individual actions. This dynamic adjustment enables the model to capture both explicit and implicit joint dependencies effectively.

Additionally, a multi-scale temporal convolution module is introduced to model temporal relationships across different scales. The module consists of four branches, each processing input features at varying resolutions. The first two branches employ standard convolutions, the third incorporates max pooling, and the fourth uses a simplified convolution path. By concatenating outputs from all branches, the module captures diverse temporal patterns, enhancing the model’s ability to distinguish actions with subtle motion differences.

Spatio-Temporal Transformer Module

While adaptive graph convolution improves flexibility, long-range dependencies may still be underestimated. Moreover, increasing the number of GCN layers risks over-smoothing, where node features become indistinguishable. Traditional Transformer-based approaches ignore the inherent skeleton topology, forcing unnecessary relationships between joints and introducing noise. To address these issues, a spatio-temporal Transformer module is proposed, combining spatial and temporal self-attention mechanisms to model both local and global dependencies.

In the spatial dimension, the module computes query, key, and value vectors for each joint in a frame, using shared linear transformations. Multi-head attention then calculates attention scores, weighting the importance of different joints. This allows the model to focus on relevant joints while suppressing irrelevant ones. In the temporal dimension, joints across frames are similarly processed, enabling the model to capture long-range motion patterns. The multi-head attention mechanism aggregates features from multiple subspaces, enhancing representation diversity.

The Transformer module is integrated into the network through a hierarchical structure, where spatial and temporal attention are applied sequentially. This design ensures that both intra-frame and inter-frame relationships are captured, improving the model’s ability to recognize complex actions.

Multi-Scale Residual Aggregation Module

Capturing multi-scale joint dependencies is crucial for action recognition. Existing methods either introduce additional modules or use high-order polynomial expansions of the adjacency matrix, often struggling to balance performance and computational cost. To address this, a multi-scale residual aggregation module is proposed, which hierarchically processes input features through cascaded residual connections.

The input features are split into four segments, each processed by adaptive graph convolution and Transformer modules. Residual connections between adjacent segments enrich the receptive field, enabling the model to capture both local and non-local dependencies. A gating mechanism is introduced to adaptively filter features, emphasizing important relationships while suppressing noise. This selective aggregation enhances the model’s discriminative power.

By concatenating outputs from all segments and applying a final residual connection, the module generates a comprehensive feature representation that spans multiple scales. This design effectively expands the receptive field in both spatial and temporal dimensions, improving the model’s ability to recognize actions with varying complexity.

Multi-Stream Feature Fusion

Skeleton data is typically represented using joint coordinates, but certain actions, such as “standing” and “sitting,” may be difficult to distinguish based solely on spatial features. To enhance discriminative power, second-order motion features are incorporated, capturing the relative movement of joints and bones between consecutive frames. These motion features provide complementary temporal information, improving the model’s robustness.

Four input streams are utilized: joint coordinates, joint motion, bone coordinates, and bone motion. Each stream is processed independently, and their outputs are combined through weighted fusion. This multi-stream approach ensures that both spatial and temporal dynamics are fully exploited, leading to more accurate action recognition.

Experimental Results

The proposed ASGC-STT model is evaluated on three large-scale datasets: NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400. On NTU-RGB+D 60, the model achieves 92.7% accuracy in the cross-subject setting and 96.9% in the cross-view setting. On NTU-RGB+D 120, it reaches 88.2% (cross-subject) and 89.5% (cross-set). For Kinetics Skeleton 400, top-1 and top-5 accuracies are 38.6% and 61.4%, respectively. These results demonstrate the model’s superior performance and generalization capability.

Ablation studies confirm the contributions of each module. Removing the adaptive graph convolution or multi-scale temporal convolution leads to significant performance drops, highlighting their importance. Similarly, the spatio-temporal Transformer and multi-scale residual aggregation modules consistently improve accuracy across all datasets. Visualization of feature responses further validates that the model effectively focuses on action-relevant joints.

Conclusion

The ASGC-STT framework introduces a novel approach to skeleton-based action recognition by addressing the limitations of static graph topologies and local convolutional operators. The adaptive spatial graph convolution enables dynamic feature extraction, while the spatio-temporal Transformer captures long-range dependencies. The multi-scale residual aggregation module hierarchically models joint relationships at varying scales, enhancing discriminative power. Extensive experiments demonstrate state-of-the-art performance across multiple datasets, confirming the model’s effectiveness and robustness.

doi:10.19734/j.issn.1001-3695.2024.07.0255

Was this helpful?

0 / 0