Neural Implicit Scalable Encoding for SLAM

CVPR 2022

Zihan Zhu1,2 *     Songyou Peng2,4 *     Viktor Larsson3     Weiwei Xu1     Hujun Bao1    
Zhaopeng Cui1 #     Martin R. Oswald2,5         Marc Pollefeys2,6        

* Equal Contribution

1State Key Lab of CAD&CG, Zhejiang University   2ETH Zurich   3Lund University  
4MPI for Intelligent Systems, Tübingen   5University of Amsterdam   6Microsoft

NICE-SLAM produces accurate dense geometry and camera tracking on large-scale scenes.

(The black / red lines are the ground truth / predicted camera trajectory)



TL;DR: We present NICE-SLAM, a dense RGB-D SLAM system that combines neural implicit decoders with hierarchical grid-based representations, which can be applied to large-scale scenes.

Neural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM). Nevertheless, existing methods produce over-smoothed scene reconstructions and have difficulty scaling up to large scenes. These limitations are mainly due to their simple fully-connected network architecture that does not incorporate local information in the observations. In this paper, we present NICE-SLAM, a dense SLAM system that incorporates multi-level local information by introducing a hierarchical scene representation. Optimizing this representation with pre-trained geometric priors enables detailed reconstruction on large indoor scenes. Compared to recent neural implicit SLAM systems, our approach is more scalable, efficient, and robust. Experiments on five challenging datasets demonstrate competitive results of NICE-SLAM in both mapping and tracking quality.


NICE-SLAM takes an RGB-D image stream as input and outputs both the camera pose as well as a learned scene representation in form of a hierarchical feature grid. From right-to-left, our pipeline can be interpreted as a generative model which renders depth and color images from a given scene representation and camera pose. At test time we estimate both the scene representation and camera pose by solving the inverse problem via backpropagating the image and depth reconstruction loss through a differentiable renderer (left-to-right). Both entities are estimated within an alternating optimization: Mapping: The backpropagation only updates the hierarchical scene representation; Tracking: The backpropagation only updates the camera pose. For better readability we joined the fine-scale grid for geometry encoding with the equally-sized color grid and show them as one grid with two attributes (red and orange).

Additional Results

(The black / red lines are the ground truth / predicted camera trajectory)


Replica Dataset


(our re-implementation of iMAP)​





ScanNet Dataset

As can be observed, our NICE-SLAM produces sharper and cleaner geometry. Also, unlike the global update as shown in iMAP, our system can update locally thanks to the grid-based hierarchical representation.


(our re-implementation of iMAP)​






Multi-room Apartment

To further evaluate the scalability of our method we capture a sequence in a large apartment with multiple rooms.


(our re-implementation of iMAP)​



Final 3D Reconstruction Fly-through



Co-fusion Dataset (Robustness to Dynamic Objects)

NICE-SLAM is able to handle dynamic objects. Note that the airship and toy car is not wrongly reconstructed.



Robustness to Frame Loss​

Here, we simulate large frame lost. The video show current tracking camera pose as well as rendered images for each tracking iteration. The ground truth camera is shown in black, the current tracking camera is shown in red. We can notice that NICE-SLAM is able to fast recover the camera pose thanks to the prediction from the coarse-level (shown in cyan).


(our re-implementation of iMAP)​



      author    = {Zhu, Zihan and Peng, Songyou and Larsson, Viktor and Xu, Weiwei and Bao, Hujun and Cui, Zhaopeng and Oswald, Martin R. and Pollefeys, Marc},
      title     = {NICE-SLAM: Neural Implicit Scalable Encoding for SLAM},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2022}


The authors thank the Max Planck ETH Center for Learning Systems (CLS) for supporting Songyou Peng. We also thank Edgar Sucar for providing additional implementation details about iMAP. Special thanks to Chi Wang for offering the data collection site. The thumb-up logo was created by Freepik - Flaticon.