3D Multi-Object Tracking with Multi-Modal Embedding and Trajectory Correction

3D Multi-Object Tracking with Multi-Modal Embedding and Trajectory Correction

Introduction

Three-dimensional multi-object tracking is a crucial research direction in computer vision, with significant applications in autonomous driving and intelligent transportation systems. It plays a vital role in improving daily life and work experiences by enabling continuous identification and tracking of objects in sequential frames. In autonomous driving, 3D multi-object tracking helps predict object motion, speed, and direction, assisting in path planning and collision avoidance.

Despite advancements, challenges remain due to complex tracking scenarios and object state uncertainties. Existing methods often rely on geometric or appearance features for data association, but these approaches struggle with occlusion, false detections, and identity switches. To address these issues, this paper presents a novel 3D multi-object tracking algorithm that integrates multi-modal embedding learning, multi-feature association, and dual-stream trajectory correction.

Methodology

The proposed framework consists of three main components:

  1. Multi-Modal Embedding Learning Network
    • Multi-Scale Image Semantic Feature Enhancement: A feature pyramid network processes image features at different scales, enhancing discriminative power for small objects. Spatial and channel attention mechanisms refine feature representations.

    • Coordinate Information Integration: Image position features are combined with 3D point cloud mappings to preserve spatial context.

    • Multi-Modal Re-Fusion: Image and point cloud features are fused adaptively using attention weights, improving tracking embeddings.

  2. Multi-Feature Data Association Module
    • Affinity Matrix Construction: Four affinity matrices are computed—embedding similarity, IoU, distance, and angle—to measure object relationships.

    • Angle Error Correction: Misaligned angle predictions are adjusted based on IoU thresholds to prevent incorrect associations.

    • Hybrid Integer Linear Programming: A solver optimizes matching between detections and trajectories using combined affinity scores.

  3. Dual-Stream Trajectory Correction and Management
    • Provisional Trajectory Validation: New detections are initialized as provisional trajectories and promoted to confirmed status after consistent matches.

    • Disappeared Trajectory Recovery: Temporarily lost trajectories are predicted forward and backward using Kalman filtering. Corrected trajectories are merged to fill gaps.

    • Error Pruning: Invalid trajectories are deleted if unmatched beyond a threshold.

Experiments

Dataset and Setup
The KITTI dataset, comprising LiDAR and camera data from urban, highway, and campus scenes, was used for evaluation. Training involved 21 sequences, while testing used 29 sequences. The model was implemented in PyTorch, with EPNet as the pretrained detector.

Ablation Studies

  1. Global Ablation:
    • Adding the multi-modal embedding network improved HOTA by 1.63% and MOTA by 0.73%.

    • Incorporating multi-feature association further boosted HOTA to 78.57% and MOTA to 86.46%.

    • The full model with trajectory correction achieved 79.91% HOTA and 89.13% MOTA, demonstrating significant reductions in ID switches and fragments.

  2. Parameter Analysis:
    • Optimal trajectory correction was achieved with an 8-frame threshold.

    • A 2-frame validation window balanced accuracy and computational efficiency.

Comparative Results
The proposed method outperformed state-of-the-art approaches:
• HOTA: 77.72% (vs. 75.46% for DeepFusionMOT).

• MOTA: 88.24% (vs. 87.82% for EagerMOT).

• ID Switches: 71 (vs. 113 for AB3DMOT).

• Fragments: 210 (vs. 390 for EagerMOT).

Qualitative results showed robust tracking under occlusion, with minimal identity switches. Compared to baseline methods, the algorithm effectively recovered missing trajectories and reduced false positives.

Conclusion

This paper introduced a comprehensive 3D multi-object tracking framework leveraging multi-modal embedding, enhanced data association, and trajectory correction. Key contributions include:
• A discriminative embedding network combining image and point cloud features.

• A robust association strategy integrating geometric and appearance cues.

• A dual-stream trajectory manager that repairs fragmented tracks.

Experiments on KITTI validated the method’s superiority in accuracy and robustness. Future work may explore real-time optimizations and extensions to more complex environments.

doi.org/10.19734/j.issn.1001-3695.2024.01.0066

Was this helpful?

0 / 0