Tumor Segmentation Based on Multi-Scale Visual Information and Non-Local Target Mining
Introduction
Cancer remains one of the most significant global health challenges, with millions of new cases and deaths reported annually. Accurate tumor segmentation in medical imaging, particularly in computed tomography (CT) scans, is crucial for diagnosis, treatment planning, and biopsy guidance. Traditionally, this task relies on manual annotation by radiologists, which is time-consuming and subject to inter-observer variability. While computer-aided diagnosis (CAD) systems have shown promise in automating tumor segmentation, existing methods primarily focus on visually distinct and large-scale lesions, such as those in the lungs, liver, or brain. However, segmenting small, visually inconspicuous tumors remains a formidable challenge due to their low contrast with surrounding tissues, irregular shapes, and high positional variability.
To address these limitations, this paper introduces a novel tumor segmentation framework inspired by clinical diagnostic workflows. The proposed method leverages multi-scale feature extraction and non-local target mining to enhance the detection and segmentation of non-salient small tumors. By mimicking the diagnostic process of radiologists—initial screening, localization, refinement, and segmentation—the framework achieves superior performance compared to existing state-of-the-art approaches.
Challenges in Non-Salient Small Tumor Segmentation
Small tumors present unique challenges that complicate automated segmentation:
- Size and Position Variability – Small tumors occupy minimal regions in CT scans, making them susceptible to being overlooked by deep learning models. Additionally, their irregular shapes and unpredictable locations further hinder accurate detection.
- Low Visual Saliency – Many tumors exhibit intensity and texture similarities to surrounding tissues, leading to ambiguous boundaries and high false-positive or false-negative rates.
- Background Interference – Complex anatomical structures and noise in CT scans introduce additional confounding factors, reducing segmentation precision.
Existing methods attempt to mitigate these issues through multi-scale feature extraction, attention mechanisms, and specialized loss functions. However, most approaches are tailored to specific tumor types and struggle with generalization. Moreover, they often fail to adequately suppress background noise while preserving critical tumor details.
Proposed Framework
The proposed framework, named CDI-NSTSEG (Clinical-inspired Non-Salient Tumor Segmentation), consists of three key components:
- Triple Feature Extraction Network – Utilizes a U-Net backbone to extract multi-scale features at 0.5×, 1.0×, and 1.5× resolutions. The 0.5× scale captures global context, while the 1.5× scale enhances local detail perception.
- Scale Fusion Network (SFM) – Hierarchically integrates multi-scale features using an attention-based fusion mechanism. This ensures comprehensive tumor characterization by selectively combining discriminative features from different scales.
- Layer-wise Target Mining Network – Comprises a Global Localization Module (GLM) and a Layered Focusing Module (LFM). The GLM identifies tumor regions via non-local attention, while the LFM iteratively refines segmentation by suppressing false positives and recovering missed tumor areas.
Triple Feature Extraction Network
The backbone employs a U-Net architecture due to its effectiveness in medical image segmentation with limited training data. Three parallel branches process input images at varying scales (0.5×, 1.0×, 1.5×) to capture diverse tumor characteristics. Each branch includes a Channel Compression Module (CHM) to reduce computational overhead while preserving essential features.
Scale Fusion Module (SFM)
Unlike conventional pyramid-based methods that lose fine-grained details, SFM adaptively fuses multi-scale features using an attention mechanism. Features from 0.5× and 1.5× scales are resized to match the 1.0× resolution via mixed pooling (for downsampling) and bilinear interpolation (for upsampling). An attention generator computes scale-specific weights, enabling dynamic feature aggregation. This process enhances tumor localization by emphasizing relevant structures across scales.
Global Localization Module (GLM)
The GLM employs non-local attention to model long-range dependencies in both channel and spatial dimensions. It consists of two sub-modules:
• Channel Attention – Computes interdependencies between feature channels to highlight tumor-specific patterns.
• Spatial Attention – Captures contextual relationships across spatial locations, improving tumor boundary delineation.
By combining these mechanisms, GLM generates an initial coarse segmentation map, which is further refined by subsequent modules.
Layered Focusing Module (LFM)
The LFM addresses false positives and negatives by progressively refining segmentation results across network layers. It operates on foreground and background features separately:
- Foreground Exploration – Identifies missing tumor regions (false negatives) through contextual analysis.
- Background Suppression – Eliminates erroneously detected areas (false positives) by contrasting tumor features against background noise.
Each LFM layer incorporates a Context Exploration Unit (CE) with multi-branch convolutions to capture local and global context. The refined features are combined via element-wise operations to produce the final segmentation.
Experimental Results
The framework was evaluated on two datasets:
- Small Intestinal Stromal Tumor Dataset (SISD) – Contains 372 CT images from 41 patients, augmented to 2,167 training samples.
- Pancreatic Tumor Dataset (PTD) – Comprises 2,537 annotated CT images from a public challenge.
Performance Metrics
Six standard metrics were used:
• Dice Coefficient (Dice) – Measures overlap between predicted and ground truth masks.
• Accuracy (ACC) – Proportion of correctly classified pixels.
• Specificity (SPE) – True negative rate.
• F1-Score (F1) – Harmonic mean of precision and recall.
• Structure Measure (SM) – Evaluates global and regional structural similarity.
• Enhanced Alignment Measure (EM) – Assesses pixel-level and image-level errors.
Comparative Analysis
On SISD, the proposed method achieved a Dice score of 58.37%, outperforming 10 state-of-the-art methods, including U-Net (48.62%), PraNet (49.87%), and SINet-V2 (50.99%). Similarly, on PTD, it attained 57.64% Dice, surpassing U-Net (49.13%) and SINet-V2 (53.57%). Visual comparisons demonstrated superior boundary precision and reduced false positives compared to competing approaches.
Ablation Studies
- Module Contributions – Removing SFM, GLM, or LFM led to significant performance drops, confirming their individual importance.
- Scale Strategy – The 0.5×, 1.0×, 1.5× combination outperformed alternatives (e.g., single-scale or extreme scales), validating its effectiveness.
- Attention Mechanism – Non-local attention surpassed SE, ECA, and SAM in capturing tumor-relevant features.
- Interference Handling – Joint foreground-background refinement outperformed single-path approaches, highlighting the necessity of dual-path processing.
Conclusion
This paper presents a clinically inspired framework for segmenting non-salient small tumors in CT scans. By integrating multi-scale feature fusion with non-local target mining, the method achieves robust and precise segmentation, addressing key challenges such as low contrast and background interference. Extensive experiments on SISD and PTD demonstrate its superiority over existing techniques, offering a promising solution for clinical applications. Future work will explore extending this approach to other medical imaging modalities and tumor types.
doi.org/10.19734/j.issn.1001-3695.2024.01.0063
Was this helpful?
0 / 0