Global Optimization of Visual SLAM Based on Neural Radiation Fields

Introduction

Recent advancements in neural radiation fields (NeRF) have significantly impacted dense simultaneous localization and mapping (SLAM), enabling high-fidelity 3D scene reconstruction. However, a critical challenge persists: the accumulation of tracking errors during camera pose estimation and scene reconstruction. Traditional SLAM systems rely on discrete representations such as point clouds, voxels, or meshes, which often struggle to capture fine geometric details and maintain global consistency. In contrast, NeRF-based SLAM leverages implicit neural representations to model scenes as continuous surfaces, offering greater flexibility in shape extraction. Despite these advantages, existing NeRF-SLAM methods often lack robust global optimization mechanisms, leading to drift in camera trajectories and degraded reconstruction quality over time.

This paper introduces GN-SLAM, a novel dense visual SLAM framework that integrates global optimization strategies to minimize tracking errors and enhance scene reconstruction. The proposed method combines loop closure detection and global bundle adjustment (BA) to correct accumulated drift while continuously updating an implicit neural scene representation. By leveraging the full history of input frames, GN-SLAM ensures long-term consistency in both pose estimation and 3D reconstruction. Experimental results on synthetic and real-world datasets demonstrate significant improvements in tracking accuracy and reconstruction fidelity compared to state-of-the-art baselines.

Methodology

Loop Closure Detection

Loop closure detection is essential for correcting drift in long-term SLAM operations. GN-SLAM employs a two-stage approach to identify and validate loop closures. First, it constructs a keyframe graph where nodes represent keyframes and edges denote spatial relationships. Keyframes are selected based on motion thresholds derived from optical flow analysis. If the average flow between the current frame and the previous keyframe exceeds a predefined threshold, the current frame is designated as a new keyframe and added to the optimization buffer.

The keyframe graph is built by establishing connections between highly covisible keyframes. A covisibility matrix is computed to measure the geometric overlap between keyframes, and edges are filtered based on rigid flow consistency. To avoid redundancy, GN-SLAM suppresses redundant connections within a local temporal radius. Loop candidates are sampled in descending order of covisibility, and only those with sufficiently low flow residuals are accepted. This selective approach ensures efficient optimization while maintaining robustness against false loop closures.

Once loop closures are detected, a differentiable dense bundle adjustment (DBA) layer optimizes camera poses and depth estimates. The optimization minimizes reprojection errors while accounting for sensor noise and occlusions. By jointly refining poses and scene geometry, GN-SLAM achieves globally consistent reconstructions even in large-scale environments.

Instant Neural Mapping

GN-SLAM employs an instant neural mapping strategy to update the 3D scene representation in real time. Unlike traditional methods that store full keyframe images, this approach uses a compact multi-resolution hash encoding to represent scene geometry and appearance. Given a set of keyframes with estimated poses, depth, and RGB data, the system randomly samples pixels and generates points along viewing rays. Each 3D point is encoded using a trainable hash table, enabling efficient feature retrieval.

The scene is represented using two neural networks: one predicts signed distance functions (SDF) to model geometry, and the other decomposes appearance into diffuse and specular components. The diffuse component captures view-independent albedo, while the specular term accounts for view-dependent reflections. This decomposition improves reconstruction quality in scenes with complex lighting and reflective surfaces.

Rendering is performed via volumetric integration, where color and depth are computed as weighted sums along each ray. The weights are derived from SDF predictions, ensuring smooth transitions between surfaces. To optimize the neural representation, GN-SLAM minimizes a composite loss function that includes terms for RGB reconstruction, depth consistency, SDF approximation, and specular regularization. This holistic optimization ensures accurate and detailed scene reconstructions.

Global Bundle Adjustment

Most NeRF-based SLAM systems perform bundle adjustment on a small subset of keyframes, limiting their ability to correct long-term drift. GN-SLAM introduces a more scalable approach by maintaining a global set of pixel samples collected across all frames. Instead of storing entire keyframes, the system periodically extracts a fraction of pixels from incoming frames and adds them to a global pool. During optimization, random subsets of these pixels are used to refine both the scene representation and camera poses.

The global BA process alternates between updating the neural scene parameters and adjusting camera poses. By leveraging the full history of observations, GN-SLAM achieves more accurate pose estimates and reduces cumulative errors. This strategy is particularly effective in large environments where local BA alone is insufficient to maintain global consistency.

Experiments

Tracking Accuracy

GN-SLAM was evaluated on the Replica and TUM RGB-D datasets, which are widely used benchmarks for visual SLAM. On Replica, the method achieved an average absolute trajectory error (ATE) of 0.39 cm, outperforming NICE-SLAM by 80.0% and Vox-Fusion by 27.8%. Similar improvements were observed on TUM RGB-D, where GN-SLAM reduced ATE by 43.2% compared to NICE-SLAM. These results highlight the effectiveness of global optimization in minimizing drift.

Qualitative comparisons further demonstrate GN-SLAM’s superiority in reconstruction quality. Unlike baseline methods that often produce blurry surfaces or missing details, GN-SLAM generates sharp and complete reconstructions, even for small objects and fine structures. The inclusion of specular reflection modeling also enhances realism in scenes with complex lighting.

Ablation Studies

Ablation experiments validate the contributions of individual components. Disabling loop closure detection led to significant trajectory drift and distorted reconstructions, while removing global BA resulted in blurred surfaces and misaligned geometry. These findings underscore the importance of both modules in achieving robust and accurate SLAM.

Computational Efficiency

GN-SLAM processes frames at 3.63 FPS on average, making it suitable for real-time applications. While memory usage is higher than some baselines, the trade-off enables superior reconstruction quality. Future work will focus on optimizing memory efficiency for deployment on resource-constrained platforms.

Conclusion

GN-SLAM advances dense visual SLAM by integrating neural radiance fields with robust global optimization. The proposed loop closure detection and global BA strategies effectively mitigate tracking drift, while the instant neural mapping pipeline ensures high-fidelity 3D reconstruction. Experimental results demonstrate significant improvements in both accuracy and robustness, establishing GN-SLAM as a state-of-the-art solution for real-time scene understanding. Future directions include reducing memory footprint and extending the framework to dynamic environments.

DOI: 10.19734/j.issn.1001-3695.2024.06.0274

Was this helpful?

0 / 0