Introduction
Adolescent idiopathic scoliosis (AIS) is a three-dimensional spinal deformity occurring during adolescence, affecting approximately 0.5% to 5.2% of the global population. Accurate assessment of spinal structure in a standing posture is crucial for effective treatment. Traditional imaging techniques such as computed tomography (CT) and magnetic resonance imaging (MRI) face limitations, including high radiation exposure, inability to capture standing postures, and high costs. In contrast, biplane radiographs (BR) offer a low-cost, low-radiation alternative that allows imaging in a standing position. However, BR only provides two-dimensional projections, making it challenging for clinicians to assess the three-dimensional spinal structure directly.
To address this challenge, researchers have explored various methods for reconstructing three-dimensional spinal models from BR. Early approaches relied on statistical shape models (SSM), which use prior knowledge from annotated datasets to infer three-dimensional structures. While effective, SSM-based methods require extensive manual annotations and large datasets to capture shape variations accurately. More recently, deep learning techniques have been employed to automate and improve the reconstruction process. However, existing deep learning methods still face challenges, including inefficient feature extraction from BR, semantic inconsistencies between different radiographic views, and high computational costs due to the large number of parameters in three-dimensional deconvolutional networks.
This study introduces a novel convolutional neural network called 2XR3DS-Net, designed to reconstruct three-dimensional spinal models from BR efficiently and accurately. The network incorporates three key innovations: a dual-channel feature enhancement module to improve spine-specific feature extraction, a feature translation fusion module to resolve semantic inconsistencies between different radiographic views, and a parameter-sharing three-dimensional deconvolution module to reduce computational overhead. Experimental results demonstrate that 2XR3DS-Net outperforms existing methods in reconstruction accuracy while significantly reducing training time.
Methodology
Network Architecture
2XR3DS-Net consists of three main modules: the Dual-Channel Feature Enhancement (DCFE) module, the Feature Translation Fusion (FTF) module, and the Separate Three-Dimensional Reconstruction (S3DR) module. The network processes BR inputs, extracts and enhances features, resolves semantic conflicts, and reconstructs a three-dimensional spinal model.
Dual-Channel Feature Enhancement Module
The DCFE module is designed to improve the extraction of spine-specific features from BR. Traditional convolutional networks often struggle to focus on relevant anatomical structures due to interference from surrounding tissues such as ribs and soft organs. To address this, the DCFE module integrates a residual neural network (ResNet) with a squeeze-and-excitation (SE) attention mechanism. The ResNet architecture allows for efficient feature propagation through skip connections, while the SE mechanism dynamically adjusts channel-wise feature weights to emphasize spine-related information.
The SE module is embedded within the skip connections of the ResNet, enabling the network to prioritize informative features while suppressing irrelevant background noise. This design enhances the network’s ability to capture fine details of the spinal structure, leading to more accurate reconstructions. The DCFE module processes anteroposterior (AP) and lateral (LAT) radiographic views separately, generating two distinct feature vectors that encode spinal geometry from different perspectives.
Feature Translation Fusion Module
A major challenge in reconstructing three-dimensional models from BR is the semantic inconsistency between AP and LAT views. Directly concatenating feature vectors from these views can lead to conflicting interpretations of the same anatomical region, degrading reconstruction quality. The FTF module addresses this issue by harmonizing feature representations across different views.
The FTF module operates in two stages: feature channel fusion and feature space refinement. In the first stage, a 1×1 convolutional layer extracts preliminary features, followed by an adaptive average pooling layer to generate channel-wise weight sequences. These weights are further processed through adaptive convolutional layers to produce a refined feature representation. In the second stage, a 3×3 convolutional layer combined with a sigmoid activation function highlights spatially significant features. The final output is a unified three-dimensional feature vector that preserves anatomical consistency across different radiographic views.
Separate Three-Dimensional Reconstruction Module
Reconstructing high-quality three-dimensional models typically requires stacking multiple deconvolutional layers, which increases computational complexity and training time. The S3DR module mitigates this issue by employing a parameter-sharing mechanism. Instead of using a single high-channel deconvolutional layer, the module splits it into two lower-channel layers that share parameters. This design reduces the total number of parameters while maintaining reconstruction quality.
The S3DR module consists of four groups of parameter-shared deconvolutional layers, progressively upsampling the feature maps to reconstruct the spinal model in voxel format. Voxel-based representation ensures uniform spatial coverage, facilitating precise volumetric analysis. The module’s efficiency allows for faster training and inference without compromising reconstruction accuracy.
Experimental Setup
Dataset Preparation
The study utilized a dataset of 33 spinal CT scans from scoliosis patients, approved for research use with patient and clinician consent. Since acquiring synchronized BR and CT scans is impractical, simulated BR were generated from CT images using cone-beam CT back-projection. Each CT scan produced 180 pairs of BR, corresponding to different projection angles. The dataset was divided into training (30 cases) and testing (3 cases) sets, with each vertebral level evaluated separately.
Evaluation Metrics
Three metrics were used to assess reconstruction quality:
- Hausdorff Distance (HD): Measures the maximum surface deviation between the reconstructed and ground-truth models. Lower values indicate better alignment.
- Average Surface Distance (ASD): Computes the mean distance between corresponding surface points, reflecting overall reconstruction accuracy.
- 3D Intersection over Union (3DIOU): Quantifies volumetric overlap between reconstructed and ground-truth models, with higher values indicating better reconstruction.
Implementation Details
The experiments were conducted on a system with an NVIDIA GeForce RTX 3080 GPU, using PyTorch 2.0.1. The network was trained for 70 epochs with a batch size of 1, an initial learning rate of 0.00005, and the Adam optimizer.
Results and Discussion
Ablation Study
To validate the contributions of each module, five model variants were compared:
- Model 1 (Baseline): A basic network without DCFE, FTF, or S3DR modules.
- Model 2: Baseline with the DCFE module added.
- Model 3: Baseline with the FTF module added.
- Model 4: Baseline with both DCFE and FTF modules.
- Model 5 (2XR3DS-Net): Baseline with all three modules.
Results showed that Model 2 improved 3DIOU by 2% and reduced HD and ASD by 19% and 14%, respectively, demonstrating the DCFE module’s effectiveness in feature extraction. Model 3 also improved reconstruction quality with minimal parameter increase, confirming the FTF module’s role in resolving semantic conflicts. Model 4 showed stable convergence but no significant gains over Models 2 and 3.
Model 5 achieved the best performance, with a 3DIOU of 0.60, HD of 3.20 mm, and ASD of 0.55 mm. Notably, it reduced training time by 9% compared to Model 1, highlighting the efficiency of the S3DR module.
Comparative Analysis
2XR3DS-Net was compared against state-of-the-art methods, including Li et al.’s stacked deconvolutional approach and Chen et al.’s three-dimensional encoder-decoder network. The proposed method outperformed both in reconstruction accuracy while using fewer parameters and significantly less training time. For instance, compared to Li et al.’s method, 2XR3DS-Net reduced parameters by 27% and training time by 89%, with superior 3DIOU and lower surface errors.
Vertebral-Level Performance
Reconstruction quality varied across vertebral levels. Thoracic vertebrae (T3-T12) generally exhibited better results (average 3DIOU: 0.64, ASD: 0.42 mm, HD: 2.76 mm) than upper thoracic (T1-T2) and lumbar vertebrae (L1-L5) (average 3DIOU: 0.55, ASD: 0.70 mm, HD: 4.11 mm). This discrepancy was attributed to anatomical complexity, such as rib interference in T1-T2 and organ artifacts in lumbar regions.
Visualization
Qualitative results demonstrated high-fidelity reconstructions, particularly for challenging cases like T1, T2, and L5. The voxel-based outputs were further refined into smooth mesh models using Poisson surface reconstruction, showcasing detailed spinal structures consistent with ground-truth anatomy.
Conclusion
2XR3DS-Net presents an efficient and accurate solution for reconstructing three-dimensional spinal models from biplane radiographs. By integrating feature enhancement, semantic fusion, and parameter-sharing deconvolution, the network achieves superior performance while reducing computational costs. Future work will explore incorporating positional priors and transfer learning to further enhance reconstruction quality, particularly for anatomically complex vertebrae.
doi: 10.19734/j.issn.1001-3695.2024.07.0272
Was this helpful?
0 / 0