Introduction
Semantic segmentation of remote sensing images is a critical task in computer vision, with applications ranging from land cover classification to urban planning and environmental monitoring. Traditional methods relying on convolutional neural networks (CNNs) have demonstrated strong performance in extracting local features but struggle with capturing long-range dependencies due to the inherent limitations of convolutional operations. On the other hand, Transformer-based models excel in modeling global context but suffer from high computational complexity, making them less efficient for high-resolution remote sensing imagery.
To address these challenges, this paper introduces CVNet, a novel semantic segmentation network that combines the strengths of CNNs and Visual State Space (VSS) models. The VSS model, inspired by Mamba, offers linear computational complexity while effectively capturing long-range dependencies. CVNet leverages a dual-branch encoder architecture, where one branch processes local features via CNN and the other captures global context using VSS. Additionally, a Co-Modulation Module (CMM) is designed to adaptively fuse features from both branches, enhancing semantic understanding. An auxiliary loss further refines the model by emphasizing critical regions during training.
Experimental results on the LoveDA and Vaihingen datasets demonstrate that CVNet outperforms existing state-of-the-art methods, achieving superior performance in both segmentation accuracy and computational efficiency.
Background and Motivation
Challenges in Remote Sensing Semantic Segmentation
Remote sensing images present unique challenges due to their high resolution, complex spatial distributions, and diverse land cover categories. Traditional CNN-based approaches, such as FCN and DeepLab, rely on local receptive fields, limiting their ability to model long-range dependencies. While multi-scale context fusion techniques (e.g., ASPP in DeepLab) help mitigate this issue, they still fall short in fully capturing global relationships.
Transformer-based models, such as UNetFormer and CMTFNet, have been introduced to address this limitation by leveraging self-attention mechanisms. However, the quadratic computational complexity of Transformers makes them computationally expensive, particularly for large-scale remote sensing images.
The Emergence of Visual State Space Models
The Visual State Space (VSS) model, derived from Mamba, offers a promising alternative. Unlike Transformers, VSS models achieve linear computational complexity while maintaining strong global modeling capabilities. Recent works like VMamba and Samba have demonstrated the effectiveness of VSS in vision tasks, including remote sensing image segmentation. However, existing approaches still retain some Transformer-based components, leaving room for further optimization.
This paper proposes CVNet, a fully CNN-VSS hybrid architecture that eliminates dependency on Transformers while maximizing the benefits of both local and global feature extraction.
Methodology
Network Architecture
CVNet consists of four main components:
- Dual-Branch Encoder
• CNN Branch: Uses a pre-trained ResNet50 to extract multi-scale local features at 1/4, 1/8, 1/16, and 1/32 resolutions. • VSS Branch: Processes the input image through a series of VSS blocks, which employ 2D Selective Scan (SS2D) to capture long-range dependencies efficiently. The SS2D module scans the image in four directions, aggregating spatial context while maintaining linear complexity. - Co-Modulation Module (CMM)
• Since CNN and VSS branches produce features with different semantic representations, direct fusion may introduce redundancy. The CMM dynamically adjusts feature importance using channel attention and spatial attention mechanisms. • The module first generates coarse attention maps via global average and max pooling, then refines them using a group convolution-based refinement step. The final fused features emphasize the most discriminative regions from both branches. - Decoder with VSS Blocks
• Unlike traditional Transformer-based decoders, CVNet employs VSS blocks enhanced with channel and spatial attention. • Features from different scales are upsampled and weighted before being progressively refined through the decoder. - Auxiliary Loss
• An auxiliary segmentation head is introduced to provide additional supervision during training. This head processes multi-scale features from the CMM outputs, ensuring the model focuses on critical regions.
Key Innovations
- Linear Complexity with VSS
• By replacing Transformer blocks with VSS, CVNet reduces computational overhead while maintaining strong global modeling. - Adaptive Feature Fusion via CMM
• The CMM ensures seamless integration of local and global features, improving segmentation accuracy in complex scenes. - Enhanced Training with Auxiliary Supervision
• The auxiliary loss helps the model learn more discriminative features, particularly for challenging categories like agricultural and barren lands.
Experimental Results
Datasets and Evaluation Metrics
CVNet is evaluated on two benchmark datasets:
- LoveDA: Contains urban and rural scenes with seven land cover categories.
- Vaihingen: Features high-resolution aerial images with five foreground classes.
Performance is measured using mF1 (mean F1-score) and mIoU (mean Intersection over Union).
Comparative Analysis
- LoveDA Results
• CVNet achieves 69.61% mF1 and 53.95% mIoU, outperforming RS3Mamba by 1.53% and 1.85%, respectively. • Notably, CVNet excels in classifying challenging categories like agricultural land (57.31% IoU), demonstrating its ability to capture fine-grained distinctions. - Vaihingen Results
• CVNet attains 90.53% mF1 and 83.13% mIoU, surpassing RS3Mamba by 0.98% and 1.6%. • The model shows strong performance in detecting small objects (e.g., cars) and irregular features, thanks to the VSS-based decoder.
Ablation Study
Ablation experiments confirm the contributions of each component:
- VSS in Decoder: Improves mIoU by 0.44% over the baseline.
- CMM Module: Boosts mIoU by 1.88%, highlighting its role in feature fusion.
- Auxiliary Loss: Further increases mIoU by 2.16%, validating its effectiveness in training.
Computational Efficiency
Despite its dual-branch design, CVNet maintains a reasonable computational footprint:
- FLOPs: 14.53G (lower than RS3Mamba’s 27.02G).
- Parameters: 65.4M (half of RS3Mamba’s 129.56M).
This efficiency makes CVNet suitable for real-world applications where computational resources are limited.
Conclusion
CVNet represents a significant advancement in remote sensing image segmentation by effectively combining CNNs and VSS models. Its dual-branch encoder captures both local and global features, while the CMM ensures optimal feature fusion. The auxiliary loss further enhances model performance by emphasizing critical regions.
Experimental results demonstrate that CVNet outperforms existing methods on LoveDA and Vaihingen datasets, achieving superior segmentation accuracy with lower computational costs. Future work will explore further optimizations of VSS for other remote sensing tasks, such as object detection and change detection.
DOI: 10.19734/j.issn.1001-3695.2024.07.0269
Was this helpful?
0 / 0