Logistics Driver Dangerous Behavior Recognition Based on Edge Features

Logistics Driver Dangerous Behavior Recognition Based on Edge Features

Introduction

Ensuring the safe operation of logistics vehicles is crucial for maintaining industrial productivity. However, prolonged driving can lead to driver fatigue, distraction, and dangerous behaviors such as phone usage, smoking, or yawning, which may result in accidents. Traditional monitoring methods relying on RGB cameras struggle in industrial environments due to lighting variations, occlusions, and complex backgrounds. To address these challenges, skeleton-based action recognition has emerged as a robust alternative, leveraging human pose estimation to analyze driver behavior independently of visual noise.

Recent advances in graph convolutional networks (GCNs) have improved skeleton-based action recognition by modeling joint relationships in both spatial and temporal dimensions. However, existing methods often fail to adequately capture subtle yet critical movements, particularly those involving distant joints (e.g., hand gestures during phone calls). Additionally, similar arm motions (e.g., smoking vs. phone usage) are frequently misclassified due to insufficient feature discrimination.

This paper introduces EF-GCN (Edge Feature Graph Convolutional Network), a novel framework designed to enhance dangerous behavior recognition in logistics drivers. The key contributions include:

  1. Spatial Perception Module: Dynamically adjusts joint importance based on movement patterns, emphasizing distant joints that are often overlooked.
  2. Spatio-Temporal Edge Attention (ST-edge): Enhances feature extraction for edge joints, improving recognition of subtle but critical motions.
  3. Separable Convolution (SC Block): Reduces computational complexity while maintaining accuracy.
  4. Similar Feature Recognition Network (SF-RN): Differentiates between visually similar actions (e.g., phone calls vs. smoking) using contrastive learning.

Extensive experiments on public datasets (NTU RGB+D 60, NW-UCLA) and a custom industrial dataset demonstrate that EF-GCN outperforms existing methods, achieving higher accuracy with lower computational costs.

Methodology

1. Overview of EF-GCN

EF-GCN processes skeletal data in two main stages: spatial modeling and temporal modeling.

  • Spatial Modeling: Focuses on joint relationships within each frame. The Spatial Perception Module identifies key joints (e.g., hands, elbows) that contribute most to dangerous behaviors. The ST-edge module further refines edge joint features to capture fine-grained motions.
  • Temporal Modeling: Analyzes joint movements across frames. The SC Block reduces redundancy in temporal convolutions, while SF-RN distinguishes between similar actions by learning discriminative features.

2. Spatial Perception Module

Traditional skeleton-based models treat all joints equally, ignoring that some (e.g., hands) are more informative for behavior recognition. The Spatial Perception Module dynamically assigns higher weights to distant joints (e.g., wrists, ankles) that exhibit significant motion during dangerous actions.

This module uses adaptive graph convolution to learn joint dependencies, ensuring that edge joints receive appropriate attention. For example, when a driver raises a hand to answer a phone, the wrist and elbow joints should be prioritized over less relevant ones (e.g., knees).

3. Spatio-Temporal Edge Attention (ST-edge)

While standard attention mechanisms focus on central joints, ST-edge explicitly enhances edge joint features by:

  • Spatial Attention: Identifies which joints are most active in each frame.
  • Temporal Attention: Tracks how joint movements evolve over time.

By combining these, ST-edge ensures that subtle but critical motions (e.g., a hand moving toward the ear for a phone call) are not overlooked.

4. Separable Convolution (SC Block)

Deep learning models often suffer from high computational costs due to redundant operations. The SC Block replaces standard 5×1 convolutions with depthwise separable convolutions, which:

  • First apply lightweight filters per channel (depthwise convolution).
  • Then combine features across channels (pointwise convolution).

This reduces parameters while preserving accuracy, making EF-GCN suitable for real-time industrial deployment.

5. Similar Feature Recognition Network (SF-RN)

Many dangerous behaviors involve nearly identical arm motions (e.g., smoking vs. holding a phone). To address this, SF-RN employs contrastive learning:

  • Clusters training samples into “clear” (easily recognizable) and “ambiguous” (easily confused) groups.
  • Adjusts feature weights to maximize differences between similar actions.

For instance, it learns that phone usage typically involves longer hand-to-ear durations, whereas smoking involves repetitive hand-to-mouth motions.

Experiments and Results

1. Datasets

  • NTU RGB+D 60: Large-scale dataset with 60 action classes, used to evaluate general action recognition.
  • NW-UCLA: Captures multi-view skeletal data, testing robustness to viewpoint changes.
  • Custom Industrial Dataset: Real-world logistics driver recordings, featuring occlusion and lighting variations.

2. Performance Comparison

A. Accuracy on Public Datasets

Model NTU RGB+D 60 (X-Sub) NW-UCLA
ST-GCN (Baseline) 81.5% 88.3%
CTR-GCN 88.7% 93.7%
HD-GCN 89.4% 94.4%
EF-GCN (Ours) 91.9% 95.8%

EF-GCN achieves 3.2% higher accuracy than CTR-GCN and 2.5% over HD-GCN, demonstrating superior feature learning.

B. Ablation Study

Configuration Top-1 Accuracy Parameters
Baseline (CTR-GCN) 88.7% 1.46M
+ Spatial Perception 90.3% (+1.6%) 1.79M
+ ST-edge 89.8% (+1.1%) 1.88M
+ SC Block 88.3% (-0.4%) 1.05M
+ SF-RN 89.9% (+1.2%) 1.99M
Full EF-GCN 91.9% 2.21M

The spatial perception and SF-RN modules contribute most to accuracy, while SC Block reduces parameters with minimal performance loss.

3. Industrial Deployment Results

On the custom logistics dataset, EF-GCN achieves 82.3% accuracy, outperforming:

  • CTR-GCN (74.5%)
  • BlockGCN (77.9%)

The confusion matrix reveals:

  • Phone calls: 87.27% accuracy (high due to distinct hand-to-ear motion).
  • Smoking: 82.18% accuracy (sometimes confused with eating).
  • Looking backward: 73.82% accuracy (challenging due to camera angle).

Conclusion

EF-GCN introduces several innovations to improve logistics driver behavior recognition:

  1. Spatial Perception Module emphasizes distant joints critical for detecting dangerous actions.
  2. ST-edge Attention enhances edge joint feature extraction.
  3. SC Block reduces computational costs without sacrificing accuracy.
  4. SF-RN effectively distinguishes between similar arm motions.

Experiments confirm that EF-GCN outperforms existing methods on both public and industrial datasets. Future work will explore few-shot learning to improve recognition for rare behaviors and multi-modal fusion (e.g., combining skeleton data with RGB) for even higher robustness.

doi:10.19734/j.issn.1001-3695.2024.06.0251

Was this helpful?

0 / 0