Joint Spatial-Temporal Differential Attention and Hierarchical Detail Enhancement for Remote Sensing Image Change Detection
Introduction
Remote sensing image change detection is a critical process that involves identifying and analyzing changes in the same geographical area over different time periods. This technique has widespread applications in urban planning, disaster monitoring, land use analysis, and environmental management. With the rapid advancement of optical sensors, high-resolution remote sensing images have become increasingly accessible, leading to significant progress in change detection methodologies. Traditional approaches relied heavily on pixel-level comparisons and handcrafted features, which often struggled with semantic understanding and robustness against noise. In contrast, deep learning-based methods, particularly those leveraging convolutional neural networks (CNNs), have demonstrated superior performance by automatically learning discriminative features from large-scale datasets.
Among deep learning architectures, U-Net has emerged as a popular choice for change detection due to its encoder-decoder structure and skip connections, which help preserve spatial details during feature extraction and reconstruction. However, existing U-Net-based change detection methods still face several challenges: (1) insufficient modeling of spatio-temporal differences between bi-temporal images, leading to false alarms and missed detections; (2) inadequate interaction between hierarchical features during decoding, resulting in loss of fine-grained details; and (3) poor handling of multi-scale objects and boundary refinement, causing jagged edges in detection results.
To address these limitations, this paper proposes SHNet, a novel change detection network that integrates Spatial-Temporal Differential Attention (STDA), Hierarchical Detail Enhancement (HDE), and Multi-Scale Boundary Refinement (MSBR). The proposed method enhances feature representation by emphasizing change regions while suppressing irrelevant variations, improves feature fusion across different levels, and refines boundary details for more accurate detection. Extensive experiments on four benchmark datasets demonstrate that SHNet outperforms existing state-of-the-art methods in terms of both qualitative and quantitative evaluation metrics.
Methodology
Network Architecture
SHNet adopts a U-shaped encoder-decoder structure, which is widely used in segmentation and change detection tasks. The encoder consists of twin branches with shared weights to extract features from bi-temporal images independently. Each encoder layer employs two 3×3 convolutional blocks with residual connections, followed by a max-pooling operation for downsampling. This design ensures efficient feature extraction while mitigating overfitting.
The decoder progressively reconstructs the change map by upsampling and combining features from different levels. Unlike traditional U-Net, which directly concatenates encoder and decoder features via skip connections, SHNet introduces the HDE module to facilitate richer interactions between hierarchical features. Additionally, the STDA module enhances the cascaded branch by leveraging spatio-temporal differences between the two input images. Finally, the MSBR module refines the boundaries of detected changes using lightweight multi-scale feature extraction.
Spatial-Temporal Differential Attention (STDA)
A major challenge in change detection is distinguishing real changes from pseudo-changes caused by illumination variations, seasonal differences, or misregistration. The STDA module addresses this by computing two complementary attention maps: Euclidean distance-based attention and difference-based attention.
The Euclidean distance attention measures pixel-wise similarity between bi-temporal features. Regions with higher distances indicate greater dissimilarity, suggesting potential changes. Conversely, regions with lower distances are likely unchanged. The difference-based attention captures absolute feature discrepancies and aggregates them using global average and max pooling. These pooled features are concatenated and processed through a 7×7 convolution to generate a difference attention map.
The two attention maps are adaptively fused using a learnable weight parameter, which balances their contributions based on their importance. The resulting STDA map highlights change regions while suppressing irrelevant variations, thereby improving the model’s discriminative capability.
Hierarchical Detail Enhancement (HDE)
In conventional U-Net architectures, skip connections directly pass encoder features to the decoder, often leading to suboptimal feature fusion due to semantic gaps between different levels. The HDE module mitigates this issue by enabling bidirectional information flow between adjacent hierarchical features.
For a given high-level feature, the module first applies transposed convolution to align its spatial dimensions with the corresponding low-level feature. The two features are then combined through element-wise addition. To enhance meaningful interactions, hybrid spatial-channel attention is employed. Spatial attention is derived by concatenating global average and max pooled features, followed by a 7×7 convolution. Channel attention is computed using two 1×1 convolutions applied to globally averaged features.
The spatial and channel attention maps are fused and normalized using a sigmoid function to produce a detail enhancement weight map. This map is used to adaptively blend the high-level and low-level features, ensuring that semantic information from deeper layers complements the fine-grained details from shallower layers. The output is further refined using a 1×1 convolution to reduce channel dimensions.
Multi-Scale Boundary Refinement (MSBR)
Change detection often involves objects of varying shapes and sizes, requiring multi-scale feature extraction. Additionally, repeated downsampling operations in the encoder can lead to blurred or jagged boundaries in the predicted change maps. The MSBR module tackles these challenges using a lightweight design inspired by Res2Net.
The input features are divided into four equal partitions. The first partition remains unchanged to preserve original information. The subsequent partitions are processed using atrous strip convolutions (ASCs), which consist of sequential 1×3 and 3×1 dilated convolutions. These ASCs capture long-range dependencies while maintaining computational efficiency. Features from preceding partitions are progressively integrated to enhance multi-scale representation.
The concatenated features are compressed using a 1×1 convolution, followed by a squeeze-and-excitation (SE) block. The SE block recalibrates channel-wise feature responses by modeling interdependencies between channels. This ensures that the most discriminative features are emphasized, further improving detection accuracy.
Experiments and Results
Datasets and Implementation
SHNet is evaluated on four publicly available datasets: WHU, Google, LEVIR, and GVLM. The WHU dataset focuses on building changes, while the Google and LEVIR datasets cover urban expansion scenarios. The GVLM dataset specializes in landslide detection. All images are cropped into 256×256 patches and augmented using rotations, flips, and scaling to prevent overfitting.
The model is implemented using PyTorch and trained for 200 epochs with a batch size of 4. Stochastic gradient descent (SGD) with momentum is used for optimization, and the learning rate is decayed linearly. Binary cross-entropy loss serves as the primary training objective.
Comparative Analysis
SHNet is compared against eight state-of-the-art methods, including FC-EF, FC-Conc, IFN, SNUNet, BIT, MSCANet, LightCDNet, and HANet. Evaluation metrics include precision, recall, F1-score, and intersection-over-union (IoU).
Qualitative results show that SHNet produces the cleanest change maps with minimal false positives and negatives. For instance, on the WHU dataset, competing methods either miss small buildings or generate spurious detections, whereas SHNet accurately identifies changes with well-defined boundaries. Similar trends are observed on the Google and LEVIR datasets, where SHNet outperforms in detecting scattered and irregularly shaped buildings. On the GVLM dataset, SHNet provides the most precise landslide boundaries despite the complex terrain.
Quantitatively, SHNet achieves the highest F1 and IoU scores across all datasets. On WHU, it attains an F1 of 91.80% and IoU of 84.84%, surpassing the second-best method by 3.9% and 4.47%, respectively. Similar improvements are seen on Google (F1: 86.13%, IoU: 75.63%) and LEVIR (F1: 91.06%, IoU: 83.59%). These results underscore SHNet’s robustness across diverse scenarios.
Ablation Study
Ablation experiments confirm the contributions of each module. Removing STDA leads to the most significant performance drop (F1: -0.76%, IoU: -1.28%), highlighting its role in suppressing pseudo-changes. Disabling HDE or MSBR also degrades results, though to a lesser extent, validating their importance in feature fusion and boundary refinement.
Complexity Analysis
Despite its superior accuracy, SHNet maintains reasonable computational complexity. With 11.99 million parameters and 27.1 GFLOPs, it is lighter than IFN (35.99M, 82.35G) and SNUNet (12.03M, 54.88G) while being more accurate. Although LightCDNet (1.18M, 2.81G) and HANet (3.03M, 17.14G) are more efficient, their performance lags significantly behind SHNet.
Conclusion
This paper presents SHNet, an advanced change detection network that effectively addresses key challenges in high-resolution remote sensing imagery. By integrating spatial-temporal differential attention, hierarchical detail enhancement, and multi-scale boundary refinement, SHNet achieves state-of-the-art performance across multiple datasets. Future work may explore incorporating twin-branch features into the decoding process for further improvements.
For more details, refer to the original paper: https://doi.org/10.19734/j.issn.1001-3695.2024.05.0218
Was this helpful?
0 / 0