DepthMamba: A Multi-Scale Vision Mamba Architecture for Monocular Depth Estimation
Introduction
Monocular depth estimation is a fundamental task in computer vision with broad applications in autonomous driving, 3D reconstruction, and virtual reality. Traditional methods rely on expensive hardware such as LiDAR or depth cameras, which limits their widespread adoption. In contrast, deep learning-based approaches enable depth estimation from a single RGB image, eliminating the need for specialized equipment. Over the years, convolutional neural networks (CNNs) and Transformers have dominated this field, but both suffer from inherent limitations. CNNs struggle with capturing long-range dependencies due to their localized receptive fields, while Transformers, despite excelling in global feature extraction, suffer from quadratic computational complexity.
To address these challenges, this paper introduces DepthMamba, an end-to-end model based on the Vision Mamba architecture. DepthMamba leverages Visual State Space (VSS) modules to efficiently model long-range dependencies while maintaining linear computational complexity. Additionally, the model incorporates an MLP-Bins depth prediction module to generate smooth and precise depth maps. Extensive experiments on the NYU Depth V2 (indoor) and KITTI (outdoor) datasets demonstrate that DepthMamba outperforms existing Transformer-based approaches while significantly reducing computational overhead.
Background and Motivation
Limitations of CNNs and Transformers
CNNs have been widely used in monocular depth estimation due to their hierarchical feature extraction capabilities. However, their reliance on local receptive fields restricts their ability to capture global context, leading to suboptimal performance in complex scenes. Early works like Eigen et al. introduced multi-scale networks to mitigate this issue, but their models still lacked sufficient global awareness. Subsequent improvements, such as DSPP modules and AdaBins, attempted to expand receptive fields but remained constrained by the fundamental limitations of convolution operations.
Transformers, on the other hand, revolutionized vision tasks by leveraging self-attention mechanisms to model long-range dependencies. Models like Depthformer and DPT-Hybrid demonstrated superior performance by integrating Transformer-based encoders with CNN decoders. However, the quadratic computational cost of self-attention makes these models impractical for high-resolution images.
The Rise of State Space Models
State Space Models (SSMs), particularly Mamba, have emerged as a promising alternative to Transformers. Mamba introduces selective state spaces that enable efficient sequence modeling with linear complexity. VMamba further adapted this approach for vision tasks, demonstrating that SSMs can match or exceed Transformer performance while being computationally more efficient.
Motivated by these advancements, DepthMamba integrates VSS modules into an encoder-decoder framework, combining the strengths of SSMs with multi-scale feature fusion. The result is a model that efficiently captures global context while maintaining computational efficiency.
DepthMamba Architecture
Overview
DepthMamba follows an encoder-decoder structure, designed to extract multi-scale features and reconstruct high-quality depth maps. The encoder progressively reduces spatial resolution while increasing channel dimensions, whereas the decoder gradually recovers spatial details through skip connections and upsampling.
Visual State Space (VSS) Module
The core building block of DepthMamba is the VSS module, derived from VMamba. Unlike CNNs, which process local neighborhoods, or Transformers, which compute pairwise attention, VSS employs 2D selective scanning (SS2D) to efficiently capture global dependencies.
The VSS module consists of two branches:
- Branch 1 processes features through normalization, linear layers, depthwise convolution, and SiLU activation before applying SS2D.
- Branch 2 applies a linear transformation followed by SiLU activation.
The outputs of both branches are element-wise multiplied and combined with the original input via residual connection. This design ensures that the model retains both local and global information while minimizing computational overhead.
2D Selective Scanning (SS2D)
SS2D is a key innovation that enables efficient long-range modeling. It operates in three steps:
- Cross-Scanning: The input feature map is scanned along four different paths to capture spatial relationships.
- Parallel Processing: The flattened sequences are processed by S6 modules, which selectively retain relevant information.
- Cross-Merging: The outputs from different scanning directions are merged to reconstruct the original spatial dimensions.
This mechanism allows DepthMamba to efficiently aggregate global context without the computational burden of self-attention.
MLP-Bins Depth Prediction
Instead of relying on computationally expensive Transformer-based depth prediction modules (e.g., AdaBins), DepthMamba introduces a lightweight MLP-Bins module. This module discretizes depth estimation into adaptive bins and predicts depth values through a simple yet effective pipeline:
- Feature Aggregation: The decoder’s output is flattened and averaged to produce a compact representation.
- Bin Prediction: A three-layer MLP predicts the probability distribution over predefined depth bins.
- Depth Calculation: The final depth is computed as a weighted sum of bin centers and their corresponding probabilities.
This approach ensures smooth and accurate depth maps while reducing model complexity.
Experimental Results
Datasets and Evaluation Metrics
DepthMamba is evaluated on two standard benchmarks: • NYU Depth V2 (Indoor): Contains 24,231 training and 654 test images captured with Kinect sensors.
• KITTI (Outdoor): Comprises 23,158 training and 652 test images from driving scenarios.
Performance is measured using: • RMSE (Root Mean Squared Error)
• Absolute Relative Error (Abs Rel)
• Threshold Accuracy (δ₁, δ₂, δ₃)
Comparison with State-of-the-Art Methods
NYU Depth V2 Results DepthMamba achieves an RMSE of 0.324, outperforming Depthformer (0.345) and AdaBins (0.364). Notably, it reduces parameters by 27.75% while improving δ₁ accuracy by 1.51%. Qualitative results show that DepthMamba preserves finer details, such as distant bookshelves and window frames, where Depthformer tends to lose structural information.
KITTI Results On KITTI, DepthMamba achieves an RMSE of 2.225, surpassing Depthformer (2.285) and AdaBins (2.360). The model excels in recovering small objects like signboards and railings, demonstrating superior edge preservation.
Ablation Studies
- Model Scaling: Experiments with small, medium, and large variants confirm that deeper networks (27 VSS blocks) improve performance, particularly in complex indoor scenes.
- Initialization: Using VMamba-B pretrained weights further enhances accuracy, validating the importance of proper initialization.
- MLP-Bins Effectiveness: Replacing MLP-Bins with AdaBins shows comparable performance, but the lightweight MLP design reduces computational overhead.
Conclusion
DepthMamba represents a significant advancement in monocular depth estimation by leveraging state space models for efficient global context modeling. The integration of VSS modules and MLP-Bins ensures high accuracy while maintaining computational efficiency. Experimental results on NYU Depth V2 and KITTI demonstrate that DepthMamba outperforms Transformer-based approaches with fewer parameters, making it a practical solution for real-world applications.
Future work may explore extending this framework to other dense prediction tasks, such as semantic segmentation or optical flow estimation. The success of DepthMamba underscores the potential of SSMs in computer vision, paving the way for more efficient and scalable architectures.
For more details, refer to the full paper: https://doi.org/10.19734/j.issn.1001-3695.2024.05.0226
Was this helpful?
0 / 0