Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction.
To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant.
SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
Images captured by monocular or surrounding cameras are first passed to a 2D encoder, yielding 2D latent features. Then the latent features are mapped to 3D using the predicted depth map following the LSS. SparseOcc adopts a sparse representation for the latent space. Upon this representation, we introduce three key building blocks: a latent diffuser that performs completion, a feature pyramid that enhances receptive filed, and a transformer head that predicts semantic occupancy.
Semantic occpancy prediction results on nuScenes-Occupancy validation set. For accuracy evaluation, We report the geometric metric IoU, semantic metric mIoU, and the IoU for each semantic class. For efficiency evaluation, we report the FLOPs, training GPU memory, and 3D/overall inference latency. The C denotes camera and the bold numbers indicate the best results.
Semantic scene completion results on SemanticKITTI validation set. For accuracy evaluation, We report the geometric metric IoU, semantic metric mIoU, and the IoU for each semantic class. For efficiency evaluation, we report the FLOPs. The C denotes camera and the bold numbers indicate the best results. The methods with “*” are RGB-input variants reported by for fair comparison.
Qualitative results of 3D semantic occupancy on nuScenes-Occupancy validation set. The input multi-view images are shown on the leftmost and the occupancy predictions of C-CONet, our SparseOcc, and the ground-truth are then visualized sequentially. Compared to 3D dense representation based C-CONet, our SparseOcc achieves better completion and segmentation as highlighted by the red circles.
@inproceedings{tang2024sparseocc,
title = {SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction},
author = {Tang, Pin and Wang, Zhongdao and Wang, Guoqing and Zheng, Jilai and Ren, Xiangxuan and Feng, Bailan and Ma, Chao},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2024}
}