当前位置: 首页 > article >正文

使用通用预训练范式为 3D 基础模型铺平道路

大家读完觉得有帮助记得关注和点赞!!!,本次是英文需要英文功底扎实的阅读。

Abstract

In contrast to numerous NLP and 2D vision foundational models, learning a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation, thereby establishing a pathway to 3D foundational models. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a devised volumetric neural renderer by comparing the rendered with the real images. Notably, our approach seamlessly integrates the learned 3D encoder into various downstream tasks. These tasks encompass not only high-level challenges such as 3D detection and segmentation but also low-level objectives like 3D reconstruction and image synthesis, spanning both indoor and outdoor scenarios. Besides, we also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness. Code and models are available at https://github.com/OpenGVLab/PonderV2.

Index Terms: 
3D pre-training, 3D vision, neural rendering, foundation model, point cloud, LiDAR, RGB-D image, multi-view image

1Introduction

Foundation models hold immense importance within the domain of computer vision and have seen extensive development across various fields, including NLP [1, 2, 3, 4, 5, 6, 7], 2D computer vision [8, 9, 10, 11, 12], multimodal domains [13, 14, 15, 16, 17, 18, 19] and embodied AI [20, 21, 22, 23, 24]. These models allow researchers to leverage and fine-tune them for specific tasks, saving substantial computational resources and time. One of the keys to their success is pre-training which helps the model capture general knowledge and learn meaningful features and representations from raw data. Although promising, building such a model in 3D is far more challenging. Firstly, the diversity of 3D representations, such as 3D point clouds and multi-view images, introduces complexities in designing a universal pre-training approach. Moreover, the inherent sparsity of 3D data and the variability arising from sensor placement and occlusions by other scene elements present unique obstacles in the pursuit of acquiring generalizable features.

Refer to caption

Figure 1:The radar chart of PonderV2, showing its effectiveness on over ten benchmarks in both indoor and outdoor scenarios. Abbreviations: sem. for semantic, ins. for instance, seg. for segmentation, eff. for efficient, L.R. for limited reconstructions, L.A. for limited annotations, obj. for object, rec. for reconstruction, cam. for camera, det. for detection. Note that the SOTA in the figure denotes the state-of-the-art performance with the same backbone as ours on validation sets.

Previous pre-training methods for obtaining effective 3D representation can be roughly categorized into two groups: contrast-based [25, 26, 27, 28, 29, 30, 31] and masked autoencoding-based (or completion-based) [32, 33, 34, 35, 36, 37, 38]. Contrast-based methods are designed to maintain invariant representation under different transformations. To achieve this, informative samples are required. In the 2D image domain, the above challenge is addressed by (1) introducing efficient positive/negative sampling methods, (2) using a large batch size and storing representative samples, and (3) applying various data augmentation policies. Inspired by these works, many works [25, 26, 27, 28, 29, 30, 31] are proposed to learn geometry-invariant features on 3D point cloud.

Methods using masked autoencoding are another line of research for 3D representation learning, which utilizes a pre-training task of reconstructing the masked point cloud based on partial observations. By maintaining a high masking ratio, such a simple task encourages the model to learn a holistic understanding of the input beyond low-level statistics. Although the masked autoencoders have been successfully applied in 2D images [12] and videos [39, 40], it remains challenging and still in exploration due to the inherent irregularity and sparsity of the point cloud data.

Refer to caption

Figure 2:The proposed unified 3D pre-training approach, termed PonderV2, is directly trained with RGB-D rendered image supervision, and can be used for various indoor or outdoor 3D downstream applications, e.g., 3D object detection, 3D semantic segmentation, 3D scene reconstruction, and image synthesis, etc.

Different from the two groups of methods above, we propose point cloud pre-training via neural rendering (PonderV2), an extension of our preliminary work published at conference ICCV 2023 [41], as shown in Fig. 2. Our motivation is that neural rendering, one of the most amazing progress and domain-specific designs in 3D vision, can be leveraged to enforce the point cloud features being able to encode rich geometry and appearance cues. Why should rendering be considered the optimal choice for 3D representation learning? Drawing inspiration from the human visual system’s ability to intelligently perceive the 3D world through engagement with a 2D canvas, we realize that understanding and cognition inherently manifest in the 2D ‘rendering’ of the original 3D environment. Based on this fundamental principle, the conception of a rendering task as the foundation for 3D pre-training arises as the most intuitive bridge connecting the physical 3D world and the perceptual 2D realm. Specifically, 3D points are forwarded to a 3D encoder to learn the geometry and appearance of the scene via a neural representation, which is leveraged to render the RGB or depth images in a differentiable way. The network is trained to minimize the difference between rendered and observed 2D images. In doing so, our method implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into the 2D framework that takes as input multi-view images. Furthermore, given the 2D nature of the loss terms, it affords the flexibility to employ diverse signals for supervision, including RGB images, depth images, and annotated semantic labels.

Our methodology, though elegantly simple and readily implementable, stands as a testament to its robust performance. To validate its prowess, we conducted a battery of extensive experiments across a staggering array of over 9 indoor and outdoor tasks, including both high-level tasks such as indoor/outdoor segmentation and detection, as well as low-level tasks such as image synthesis and indoor scene/object reconstruction, etc. We achieve state-of-the-art performance on over 11 indoor/outdoor benchmarks. Part of PonderV2’s validation set performance compared to baselines and state-of-the-art methods with same backbone are shown in Fig. 1. The sufficient and convincing results indicate the effectiveness of the proposed universal methodology. Specifically, we first evaluate PonderV2 on different backbones on various popular indoor benchmarks with multi-frame RGB-D images as inputs, proving its flexibility. Furthermore, we pre-train a single backbone for various downstream tasks, namely SparseUNet [42], which takes whole-scene point clouds as input, and remarkably surpasses the state-of-the-art method with the same backbone on various indoor 3D benchmarks. For example, PonderV2 reaches 77.0 val mIoU on ScanNet semantic segmentation benchmark and ranks 1st on ScanNet benchmark with a test mIoU of 78.5. PonderV2 also ranks 1st on ScanNet200 semantic segmentation benchmark with a test mIoU of 34.6. Finally, we conduct abundant experiments in outdoor autonomous driving scenarios, reaching SOTA validation performance as well. For example, we achieved 73.2 NDS for 3D detection and 79.4 mIoU for 3D segmentation on the nuScenes validation set, 3.0 and 6.1 higher than the baseline, respectively. The promising results display the efficacy of PonderV2.

To summarize, our contributions are listed below.

  • • 

    We propose to utilize differentiable neural rendering as a novel universal pre-training paradigm tailored for the 3D vision realm. This paradigm, named PonderV2, captures the natural relationship between the 3D physical world and the 2D perception canvas.

  • • 

    Our approach excels in acquiring efficient 3D representations, capable of encoding intricate geometric and visual cues through the utilization of neural rendering. This versatile framework extends its applicability to a range of modalities, encompassing both 3D and 2D domains, including but not limited to point clouds and multi-view images.

  • • 

    The proposed methodology reaches state-of-the-art performance on many popular benchmarks of both indoor and outdoor, and is flexible to integrate into various backbones. Besides high-level perception tasks, PonderV2 can also boost low-level tasks such as image synthesis, scene and object reconstruction, etc. The effectiveness and flexibility of PonderV2 showcase the potential to pave the way for a 3D foundation model.

2Related Works

Self-supervised learning in point clouds. Current methods can be roughly categorized into two categories: contrast-based and Masked Autoencoding-based. Inspired by the works [43, 44] from the 2D image domain, PointContrast [25] is one of the pioneering works for 3D contrastive learning. Similarly, it encourages the network to learn invariant 3D representation under different transformations. Some works [26, 27, 28, 29, 30, 31, 45] follow the pipeline by either devising new sampling strategies to select informative positive/negative training pairs, or explore various types of data augmentations. Another line of work is Masked Autoencoding-based [33, 34, 35, 36, 37, 38] methods, which get inspiration from Masked Autoencoders [12] (MAE). PointMAE [35] proposes restoring the masked points via a set-to-set Chamfer Distance. PointM2AE [37] utilizes a multiscale strategy to capture both high-level semantic and fine-grained details. VoxelMAE [38] instead recovers the underlying geometry by distinguishing if the voxel contains points. GD-MAE [46] applies a generative decoder for hierarchical MAE-style pre-training. Besides the above, ALSO [47] regards surface reconstruction as the pretext task for representation learning. Different from the above pipelines, we propose a novel framework for point cloud pre-training via neural rendering. Unlike previous works primarily designed for point clouds, our pre-training framework is applicable to both image- and point-based models.

Representation learning on images. Representation learning has been well-developed in the 2D domain [12, 48, 49, 43, 50, 40], and has shown its capabilities in all kinds of downstream tasks as the backbone initialization. Contrastive-based methods, such as MoCo [43] and MoCov2 [50], learn images’ representations by discriminating the similarities between different augmented samples. MAE-based methods, including MCMAE [51] and SparK [52], obtain the promising generalization ability by recovering the masked patches. Within the realm of 3D applications, models pre-trained on ImageNet [53] are widely utilized in image-related tasks [54, 55, 56, 57, 58, 59]. For example, to compensate for the insufficiency of 3D priors in tasks like 3D object detection, depth estimation [60] and monocular 3D detection [61] are usually used as extra pre-training techniques.

Neural rendering. Neural Rendering is a type of rendering technology that uses neural networks to render images from 3D scene representation in a differentiable way. NeRF [62] is one of the representative neural rendering methods, which represents the scene as the neural radiance field and renders the images via volume rendering. Based on NeRF, there are a series of works [63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 41] trying to improve the NeRF representation, including accelerate NeRF training, boost the quality of geometry, and so on. Another type of neural rendering leverages neural point clouds as the scene representation. [73, 74] take points locations and corresponding descriptors as input, rasterize the points with z-buffer, and use a rendering network to get the final image. Later work such as PointNeRF [75] and X-NeRF [76] render realistic images from neural point representation using a NeRF-like rendering process. Our work is inspired by the recent progress of neural rendering.

3Neural Rendering as a Universal Pre-training Paradiagm

In this section, we present the details of our methodology. We first give an overview of our pipeline in Sec. 3.1, which is visualized in Fig. 3. Then, we detail some specific differences and trials for indoor and outdoor scenarios due to the different input data types and settings in Sec. 3.2 and Sec. 3.3.

Refer to caption

Figure 3:The overall piepeline of PonderV2. Taking a raw point cloud that can be constructed from multi-frame RGB-D images, scene scans and LiDAR, etc., we first apply augmentations including masking and grid sampling to obtain a quantized sparse tensor. Then, a sparse backbone is utilized to extract the sparse features and serves as the encoder to be pre-trained and finetuned in the future. After that, the sparse feature is densified and fed into a shallow dense convolutional networks, which gives out a dense feature volume. Next, the rendering decoder query features from the volume and applies shallow MLPs to output each point’s SDF and color value. Finally, the SDF and color outputs are integrated to render 2D RGB-D images which be will supervised by ground-truth.

3.1Universal Pipeline Overview

As shown in Fig. 3, the input of our pipeline is an original sparse point cloud 𝒳={𝒞in,ℱin} comprising a set of n coordinates 𝒞in∈ℝn×3 and their corresponding c⁢hi⁢n channel features ℱin∈ℝn×c⁢hi⁢n which may include attributes such as colors or intensities. These point clouds can be generated from a variety of sources, including RGB-D images, scans, or LiDAR data. Before diving into our backbone, we first apply augmentations to the input data and quantize it using a specific grid size 𝒈=[gx,gy,gz]∈ℝ3. This process can be expressed as:

𝒳^=𝒢(𝒯(𝒳),𝒈))={𝒞^in,ℱ^in},(1)

where 𝒢⁢(⋅,𝒈) is a grid sampling function designed to ensure that each grid has only one point sampled. 𝒯⁢(⋅) denotes the augmentation transform function, and 𝒳^ is the sampled points.

Then, we feed 𝒳^ into a sparse backbone fe(s)⁢(⋅), which serves as our pre-training encoder. The outputs are obtained by:

ℱ^=fe(s)⁢(𝒳^)={𝒞^out,ℱ^out},(2)

where 𝒞^out and ℱ^out are the coordinates and features of the sparse outputs, respectively. To make the sparse features compatible with our volume-based decoder, we encode them into a volumetric representation by a densification process. Specifically, we first discretize the 3D space at a resolution of lx×ly×lz voxel grids. Subsequently, the sparse features that fall into the same voxel are aggregated by applying average pooling based on their corresponding sparse coordinates. The aggregation will result in the dense volume features ℱdense∈ℝlx×ly×lz×c⁢ho⁢u⁢t, where empty voxel grids are padded with zeros. A shallow dense convolutional network fd(d)⁢(⋅) is then applied to obtain the enhanced 3D feature volume 𝒱∈ℝlx×ly×lz×c⁢hv⁢o⁢l, which can be expressed as:

𝒱=fd(d)⁢(ℱdense)(3)

Given the dense 3D volume 𝒱, we make a novel use of differentiable volume rendering to reconstruct the projected color images and depth images as the pretext task. Inspired by [65], we represent a scene as an implicit signed distance function (SDF) field to be capable of representing high-quality geometry details. Specifically, given a camera pose 𝐏 and sampled pixels 𝐱, we shoot rays 𝐫 from the camera’s projection center 𝐨 in direction 𝐝 towards the pixels, which can be derived from its intrinsics and extrinsics. Along each ray, we sample D points {𝐩j=𝐨+tj⋅𝐝∣j=1,…,D∧0≤tj<tj+1}, where tj is the distance from each point to camera center, and query each point’s 3D feature 𝐟j from 𝒱 by trilinear interpolation. A SDF value sj is predicted for each point 𝐩j using an shallow MLP ϕSDF:

sj=ϕSDF⁢(𝐩j,𝐟j)⁢,(4)

To determine the color value, our approach draws inspiration from [66] and conditions the color field on the surface normal 𝐧j (i.e., the gradient of the SDF value at ray point 𝐩j) together with a geometry feature vector 𝐡i derived from ϕSDF. This can yield a color representation:

cj=ϕRGB⁢(𝐩j,𝐟j,𝐝i,𝐧j,𝐡j)⁢,(5)

where ϕRGB is parameterized by another shallow MLP. Subsequently, we render 2D colors C^⁢(𝐫) and depths D^⁢(𝐫) by integrating predicted colors and sampled depths along rays 𝐫 using the following equations:

C^⁢(𝐫)=∑j=1Dwj⁢cj,D^⁢(𝐫)=∑j=1Dwj⁢tj,(6)

The weight wj in these equations is an unbiased, occlusion-aware factor, as illustrated in [65], and is computed as wj=Tj⁢αj. Here, Tj=∏k=1j−1(1−αk) represents the accumulated transmittance, while αj is the opacity value computed by:

αj=max⁡(σs⁢(sj)−σs⁢(sj+1)σs⁢(sj),0),(7)

where σs⁢(x)=(1+e−s⁢x)−1 is the Sigmoid function modulated by a learnable parameter s.

Finally, our optimization target is to minimize the L1 reconstruction loss on rendered 2D pixel space with a λC and a λD factor adjusting the weight between color and depth, namely:

ℒ=1|𝐫|⁢∑r∈𝐫λC⋅‖C^⁢(r)−C⁢(r)‖+λD⋅‖D^⁢(r)−D⁢(r)‖(8)

3.2Indoor Scenario

While the proposed rendering-based pretext task operates in a fully unsupervised manner, the framework can be easily extended to supervised learning by incorporating off-the-shelf labels to further improve the learned representation. Due to the large amount of synthetic data with available annotations for indoor scenes, the rendering decoder could additionally render semantic labels in 2D. Specifically, we employ an additional shallow MLP, denoted as ϕSEMANTIC, to predict semantic features for each query point:

𝐥j=ϕSEMANTIC⁢(𝐩j,𝐟j,𝐧j,𝐡j)⁢,(9)

These semantic features can be projected onto a 2D canvas using a similar weighting scheme as described in Eq. 6. For supervision, we utilize the CLIP [13] features of each pixel’s text label which is a readily available attribute in most indoor datasets.

It is worth noting that when handling vast amounts of unlabeled RGB-D data, the semantic rendering can also be switched to an unsupervised manner by leveraging existing 2D segmentation models. For example, alternative approaches such as utilizing Segment-Anything [8], or diffusion [77] features to provide pseudo semantic features as supervision, which can also distill knowledge from 2D foundational models into 3D backbones. Further research on this aspect is left as future work, as the primary focus of this paper is the novel pre-training methodology itself.

3.3Outdoor Scenario

Refer to caption

Figure 4:The overall piepeline of multi-image input. The rendering decoder is the same as Fig. 3. The image encoder takes augmented multi-view images as inputs, giving out the multi-view features. The image features are then zero-padded and transformed to obtain the 3D dense feature volume.

To further show the generalization ability of our pre-training paradigm, we also have applied our methodology to the outdoor scenario, where multi-view images and LiDAR point clouds are usually available. To make the pre-training method suitable for these inputs, we convert them into the 3D volumetric space.

Specifically, for the LiDAR point clouds, we follow the same process in Sec. 3.1 to augment the point clouds and voxelize the point features extracted by a 3D backbone.

For multi-view images ℐ={𝑰1,𝑰2,…}, inspired by MAE [12], we first mask out partial pixels as the data augmentation to get ℐ^. Then we leverage a 2D backbone fe(2⁢d) to extract multi-view image features ℱimage=fe(2⁢d)⁢(ℐ^). The 2D features are subsequently transformed into the 3D ego-car coordinate system to obtain the 3D dense volume features. Concretely, we first pre-define the 3D voxel coordinates Xp∈ℕlx×ly×lz×3, and then project Xp on multi-view images to index the corresponding 2D features. The process can be calculated by:

ℱdense=ℬ⁢(Tc2i⁢Tl2c⁢Xp,ℱimage),(10)

where Tl2c and Tc2i denote the transformation matrices from the LiDAR coordinate system to the camera frame and from the camera frame to image coordinates, respectively, and ℬ represents the bilinear interpolation. The encoding process of the multi-view image case is shown in Fig. 4.

4Experiments for Indoor Scenarios

We first conduct comprehensive experiments on indoor datasets, which mainly contain two parts. In the first part, we use a lightweight backbone for ablation studies, which take as input multi-frame RGB-D only. We call this variant Ponder-RGBD. The subsequent part mainly focuses on a single, unified pre-trained model that pushes to the limits of performance, surpassing the previous SOTA pretraining pipeline substantially.

4.1Indoor Scene Multi-frame RGB-D Images as Inputs

4.1.1Experimental Setup

We use ScanNet [78] RGB-D images as our pre-training dataset. ScanNet is a widely used real-world indoor dataset, which contains more than 1500 indoor scenes. Each scene is carefully scanned by an RGB-D camera, leading to about 2.5 million RGB-D frames in total. We follow the same train / val split with VoteNet [79]. In this part, we have not introduced semantic rendering yet, which will be used in the part of scene-level as inputs.

4.1.2Implementation and Training Details

For the version of RGB-D inputs, 3 input channels of RGB is taken with grid size 𝒈=0.023. After densifying 𝒱 to 64×64×64 with 96 channels, we apply a dense UNet with an output channel number of 128. ϕSDF is designed as a 5-layer MLP while ϕRGB is a 3-layer MLP.

During pre-training, a mini-batch of batch size 8 includes point clouds from 8 scenes. Each point cloud input to our sparse backbone is back-projected from 5 continuous RGB-D frames of a video from ScanNet’s raw data with a frame interval of 20. The 5 frames are also used as the supervision of the network. We randomly down-sample the input point cloud to 20,000 points and follow the masking strategy as used in Mask Point [36].

We train the proposed pipeline for 100 epochs using an AdamW optimizer [80] with a weight decay of 0.05. The learning rate is initialized as 1⁢e−4 with an exponential schedule. For the rendering process, we randomly choose 128 rays for each image and sample 128 points for each ray.

4.1.3Comparison Experiments

TABLE I: Indoor Ponder-RGBD 3D object detection mAP@25 and mAP@50 on ScanNet and SUN RGB-D with VoteNet [79] backbone. The DepthContrast [31] and Point-BERT [33] results are adopted from IAE [34] and MaskPoint [36]. Ponder-RGBD outperforms both VoteNet-based and 3DETR-based point cloud pre-training methods with fewer training epochs.

MethodDetection ModelPre-training TypePre-training EpochsScanNet ValSUN RGB-D Val
mAP@50mAP@25mAP@50mAP@25
3DETR [81]3DETR--37.562.730.358
+ Point-BERT [33]3DETRMasked Auto-Encoding30038.361.0--
+ MaskPoint [36]3DETRMasked Auto-Encoding30040.663.4--
VoteNet [79]VoteNet--33.558.632.957.7
+ STRL [28]VoteNetContrast10038.459.535.058.2
+ RandomRooms [30]VoteNetContrast30036.261.335.459.2
+ PointContrast [25]VoteNetContrast-38.059.234.857.5
+ PC-FractalDB [82]VoteNetContrast-38.361.933.959.4
+ DepthContrast [31]VoteNetContrast100039.162.135.460.4
+ IAE [34]VoteNetMasked Auto-Encoding100039.861.536.060.4
+ Ponder-RGBD (Ours)VoteNetRendering10041.0↑7.563.6↑5.036.6↑3.761.0↑3.3

TABLE II: Indoor Ponder-RGBD 3D object detection mAP@25 and mAP@50 on ScanNet validation set with H3DNet [83] backbone. Ponder-RGBD significantly boosts the accuracy by a margin of +2.8 and +1.2 for mAP@50 and mAP@25, respectively.

MethodmAP@50mAP@25
VoteNet [79]33.558.7
3DETR [81]37.562.7
3DETR-m [81]47.065.0
H3DNet [83]48.167.2
+ Ponder-RGBD (Ours)50.9↑2.868.4↑1.2

Object Detection We select two representative approaches, Votenet [79] and H3DNet [83], as the baselines. VoteNet leverages a voting mechanism to obtain object centers, which are used for generating 3D bounding box proposals. By introducing a set of geometric primitives, H3DNet achieves a significant improvement in accuracy compared to previous methods. Two datasets are applied to verify the effectiveness of our method: ScanNet [78] and SUN RGB-D [84]. Different from ScanNet, which contains fully reconstructed 3D scenes, SUN RGB-D is a single-view RGB-D dataset with 3D bounding box annotations. It has 10,335 RGB-D images for 37 object categories. For pre-training, we use PointNet++ as the point cloud encoder fe(s), which is identical to the backbone used in VoteNet and H3DNet. We pre-train fe(s) on the ScanNet dataset and transfer the weight as the downstream initialization. Following [79], we use average precision with 3D detection IoU threshold 0.25 and threshold 0.5 as the evaluation metrics.

The 3D detection results are shown in Tab. I. Our method improves the baseline of VoteNet without pre-training by a large margin, boosting mAP@50 by 7.5% and 3.7% for ScanNet and SUN RGB-D, respectively. IAE [34] is a pre-training method that represents the inherent 3D geometry in a continuous manner. Our learned point cloud representation achieves higher accuracy because it is able to recover both the geometry and appearance of the scene. The mAP@50 and mAP@25 of our method are higher than that of IAE by 1.2% and 2.1% on ScanNet, respectively. Besides, we have observed that our method surpasses the recent point cloud pre-training approach, MaskPoint [36], even when using a less sophisticated backbone (PointNet++ vs. 3DETR), as presented in Tab. I. To verify the effectiveness of Ponder-RGBD, we also apply it for a much stronger baseline, H3DNet. As shown in Tab. II, our method surpasses H3DNet by +2.8 and +1.2 for mAP@50 and mAP@25, respectively.

Semantic Segmentation 3D semantic segmentation is another fundamental scene understanding task. We select one of the top-performing backbones, MinkUNet [42], for finetuning. MinkUNet leverages 3D sparse convolution to extract effective 3D scene features. We report the finetuning results on the ScanNet dataset with the mean IoU of the validation set as the evaluation metric. Tab. III shows the quantitative results of Ponder-RGBD with MinkUNet. The results demonstrate that Ponder-RGBD is effective in improving the semantic segmentation performance, achieving a significant improvement of 1.3 mIoU.

Scene Reconstruction 3D scene reconstruction task aims to recover the scene geometry, e.g. mesh, from the point cloud input. We choose ConvONet [85] as the baseline model, whose architecture is widely adopted in [86, 87, 64]. Following the same setting as ConvONet, we conduct experiments on the Synthetic Indoor Scene Dataset (SISD) [85], which is a synthetic dataset and contains 5000 scenes with multiple ShapeNet [88] objects. To make a fair comparison with IAE [34], we use the same VoteNet-style PointNet++ as the encoder of ConvONet, which down-samples the original point cloud to 1024 points. Following [85], we use Volumetric IoU, Normal Consistency (NC), and F-Score [89] with the threshold value of 1% as the evaluation metrics. The results are shown in Tab. IV. Compared to the baseline ConvONet model with PointNet++, IAE is not able to boost the reconstruction results, while the proposed approach can improve the reconstruction quality (+2.4% for IoU). The results show the effectiveness of Ponder-RGBD for the 3D scene reconstruction task.

TABLE III: Indoor Ponder-RGBD 3D segmentation mIoU on ScanNet validation dataset.

MethodVal mIoU
PointNet++ [90]53.5
KPConv [91]69.2
SparseConvNet [92]69.3
PT [93]70.6
MinkUNet [42]72.2
+ Ponder-RGBD (Ours)73.5↑1.3

TABLE IV: Indoor Ponder-RGBD 3D scene reconstruction IoUNC, and F-Score on SISD dataset with PointNet++ backbone. Ponder-RGBD can boost the scene reconstruction performance.

MethodEncoderIoU↑NC↑F-Score↑
IAE [34]PointNet++75.788.791.0
ConvONet [85]PointNet++77.888.790.6
+ Ponder-RGBD (Ours)PointNet++80.2↑2.489.3↑0.692.0↑1.4

Image Synthesis From Point Clouds We also validate the effectiveness of our method on another low-level task of image synthesis from point clouds. We use Point-NeRF [75] as the baseline. Point-NeRF uses neural 3D point clouds with associated neural features to render images. It can be used both for a generalizable setting for various scenes and a single-scene fitting setting. In our experiments, we mainly focus on the generalizable setting of Point-NeRF. We replace the 2D image features of Point-NeRF with point features extracted by a DGCNN network. Following the same setting with PointNeRF, we use DTU [94] as the evaluation dataset. DTU dataset is a multiple-view stereo dataset containing 80 scenes with paired images and camera poses. We transfer both the DGCNN encoder and color decoder as the weight initialization of Point-NeRF. We use PSNR as the metric for synthesized image quality evaluation.

The results are shown in Fig. 6. By leveraging the pre-trained weights of our method, the image synthesis model is able to converge faster with fewer training steps and achieve better final image quality than training from scratch.

Refer to caption

Figure 5:Rendered images by Ponder-RGBD on the ScanNet validation set. The projected point clouds are visualized in the first column. Even though input point clouds are very sparse, our model is capable of rendering color and depth images similar to the reference images.

Ponder-RGBD’s rendered color and depth images are shown in Fig. 5. As shown in the figure, even though the input point cloud is pretty sparse, our method can still render color and depth images similar to the reference images.

Refer to caption

Figure 6:Comparison of image synthesis from point clouds. Compared with training from scratch, our Ponder-RGBD model is able to converge faster and achieve better image synthesis results.

4.1.4Ablation Study

Influence of Rendering Targets The rendering part of our method contains two items: RGB color image and depth image. We study the influence of each item with the transferring task of 3D detection. The results are presented in Tab. V. Combining depth and color images for reconstruction shows the best detection results. In addition, using depth reconstruction presents better performance than color reconstruction for 3D detection.

Influence of Mask Ratio To augment point cloud data, we employ random masking as one of the augmentation methods, which divides the input point cloud into 2048 groups with 64 points. In this ablation study, we evaluate the performance of our method with different mask ratios, ranging from 0% to 90%, on the ScanNet and SUN RGB-D datasets, and report the results in Tab. VI. Notably, we find that even when no dividing and masking strategy is applied (0%), our method achieves a competitive mAP@50 performance of 40.7 and 37.3 on ScanNet and SUN RGB-D, respectively. Our method achieves the best performance on ScanNet with a mask ratio of 75% and a mAP@50 performance of 41.7. Overall, these results suggest that our method is robust to the hyper-parameter of mask ratio and can still achieve competitive performance without any mask operation.

Influence of 3D Feature Volume Resolution In our method, Ponder-RGBD constructs a 3D feature volume with a resolution of [16, 32, 64], which is inspired by recent progress in multi-resolution 3D reconstruction. However, building such a high-resolution feature volume can consume significant GPU memory. To investigate the effect of feature volume resolution, we conduct experiments with different resolutions and report the results in Tab. VII. From the results, we observe that even with a smaller feature volume resolution of 16, Ponder-RGBD can still achieve competitive performance on downstream tasks.

TABLE V: Indoor Ponder-RGBD ablation study for supervision type3D detection 𝐴𝑃50 on ScanNet and SUN RGB-D validation set. Combining color supervision and depth supervision can lead to better detection performance than using a single type of supervision.

SupervisionScanNet mAP@50SUN RGB-D mAP@50
VoteNet33.532.9
+ Depth40.9↑7.436.1↑3.2
+ Color40.5↑7.035.8↑2.9
+ Depth + Color41.0↑7.536.6↑3.7

TABLE VI: Ablation study for mask ratio. 3D detection 𝐴𝑃50 on ScanNet and SUN RGB-D validation set.

Mask RatioScanNet mAP@50SUN RGB-D mAP@50
VoteNet33.532.9
0%40.7↑7.237.3↑4.4
25%40.7↑7.236.2↑3.3
50%40.3↑6.836.9↑4.0
75%41.7↑8.237.0↑4.1
90%41.0↑7.536.6↑3.7

TABLE VII: Ablation study for feature volume resolution. 3D detection 𝐴𝑃50 on ScanNet and SUN RGB-D validation set.

ResolutionScanNet mAP@50SUN RGB-D mAP@50
VoteNet33.532.9
1640.7↑7.236.6↑3.7
16 + 32 + 6441.0↑7.536.6↑3.7

TABLE VIII: Ablation study for view number. 3D detection 𝐴𝑃50 on ScanNet and SUN RGB-D validation set. Using multi-view supervision for pre-training can achieve better performance.

#ViewScanNet mAP@50SUN RGB-D mAP@50
VoteNet33.532.9
140.1↑6.635.4↑2.5
340.8↑7.336.0↑3.1
541.0↑7.536.6↑3.7

Number of Input RGB-D View. Our method utilizes N RGB-D images, where N is the input view number. We study the influence of N and conduct experiments on 3D detection, as shown in Tab. VIII. We change the number of input views while keeping the scene number of a batch still 8. Using multi-view supervision helps to reduce single-view ambiguity. Similar observations are also found in the multi-view reconstruction task [95]. Compared with the single view, multiple views achieve higher accuracy, boosting mAP@50 by 0.9% and 1.2% for ScanNet and SUN RGB-D datasets, respectively.

Refer to caption

Figure 7:Reconstructed surface by Ponder-RGBD. Our pre-training method can be easily integrated into the task of 3D reconstruction. Despite the sparsity of the input point cloud (only 2% points are used), our method can still recover precise geometric details.

4.1.5Other applications

The pre-trained model from our pipeline Ponder-RGBD itself can also be directly used for surface reconstruction from sparse point clouds. Specifically, after learning the neural scene representation, we query the SDF value in the 3D space and leverage the Marching Cubes [96] to extract the surface. We show the reconstruction results in Fig. 7. The results show that even though the input is sparse point clouds from complex scenes, our method is capable of recovering high-fidelity meshes.

4.2Indoor Scene Point Clouds as Inputs

TABLE IX:Indoor semantic segmentation results. Our method builds on SparseUNet [42], and is evaluated on ScanNet [78], ScanNet200 [97], and S3DIS [98] benchmarks. Compared to other pre-training approaches, PonderV2 has significant finetuning improvements across all the benchmarks with shared pre-trained weights.

Method#Params.ScanNetScanNet200S3DIS
Val mIoUTest mIoUVal mIoUTest mIoUArea5 mIoU6-fold mIoU
StratifiedFormer [99]18.8M74.373.7--72.078.1
PointNeXt [100]41.6M71.571.2--70.577.2
PTv1 [93]11.4M70.6-27.8-70.476.5
PTv2 [101]12.8M75.475.230.2-71.677.9
SparseUNet [42]39.2M72.273.62525.365.465.4
+ PC [25]39.2M74.1-26.2-70.376.9
+ CSC [26]39.2M73.8-26.424.972.2-
+ MSC [45]39.2M75.5-28.8-70.177.2
+ PPT [102]41.0M76.476.631.933.272.778.1
PonderV2 (Ours)41.0M77.0↑4.878.5↑4.932.3↑7.334.6↑9.373.2↑7.879.9↑14.5

TABLE X:Indoor S3DIS semantic segmentation 6-fold cross-validation results. PonderV2 achieves the best average performance on all metrics including mIoU, mAcc and allAcc.

MetricArea1Area2Area3Area4Area5Area6Average
PPTPonderV2PPTPonderV2PPTPonderV2PPTPonderV2PPTPonderV2PPTPonderV2PPTPonderV2Scratch
mIoU83.084.165.471.787.185.174.173.472.773.286.487.478.179.9↑1.865.4
mAcc90.390.875.679.991.890.984.081.978.279.092.592.885.486.5↑1.1-
allAcc93.593.788.390.194.694.290.890.891.592.294.594.892.292.5↑0.3-
4.2.1Experimental Setup

In this setting, we want to pre-train a unified backbone that can be applied to various downstream tasks, whose input is directly the whole scene point clouds so that the upstream and downstream models have a unified input and encoder stage. We choose SparseUNet [42], which is an optimized implementation of MinkUNet [42] by SpConv [103], as fe(s) due to its efficiency, whose out features ℱ^ have 96 channels. We mainly focus on three widely recognized indoor datasets: ScanNet [78], S3DIS [104] and Structured3D [105] to jointly pre-train our weights. ScanNet and S3DIS represent the most commonly used real-world datasets in the realm of 3D perception, while Structured3D is a synthetic RGB-D dataset. Given the limited data diversity available in indoor 3D vision, there exists a non-negligible domain gap between datasets, thus naive joint training may not boost performance. Point Prompt Training [102] (PPT) proposes to address this problem by giving each dataset its own batch norm layer. Considering its effectiveness and flexibility, we combine PPT with our universal pre-training paradigm and treat PPT as our baseline. Notably, PPT achieves state-of-the-art performance in downstream tasks with the same backbone we use, i.e., SparseUNet.

Following the pre-training phase, we discard the rendering decoder and load the encoder backbone’s weights for use in downstream tasks, either with or without additional task-specific heads. Subsequently, the network undergoes fine-tuning and evaluation on each specific downstream task. We mainly evaluate the mean intersection-over-union (mIoU) metric for semantic segmentation and mean average precision (mAP) for instance segmentation as well as object detection tasks.

4.2.2Implementation and Training Details

For the version of the unified backbone, we base our indoor experiments on Pointcept [106], a powerful and flexible codebase for point cloud perception research. All hyper-parameters are the same with scratched PPT, for fair comparison. The number of input channels for this setting is 6, containing 3 channels of colors and 3 channels of surface normals. Grid size 𝒈 is same as 0.023 meters. We also apply common transformations including random dropout (at a mask ratio of 0.8), rotation, scaling, flipping, etc.

Considering that although a lightweight decoder reduces the rendering effect, it may boost downstream performance, we use a tiny dense 3D UNet with channels of 128 after densifying. The dense feature volume 𝒱 is configured to be of size 128×128×32 with 128 channels. ϕSDF is a shallow MLP with 3 layers and ϕRGB as well as ϕSEMANTIC are both 1-layer MLPs. All shallow MLPs have a hidden channel of 128. For semantic supervision, we use the text encoder of a ”ViT-B/16” CLIP model, whose output semantic features are of 512 channels.

In each scene-level input point cloud, we first sample 5 RGB-D frames for supervision, from which we subsequently sample 128 rays each. Consequently, a total of 128×5=640 pixel values per point cloud are supervised in each iteration. Notably, the weight coefficient λC is 1.0 and λD is set to 0.1.

We train the proposed pipeline for 200 epochs using an SGD optimizer with a weight decay of 1⁢e−4 and a momentum of 0.9. The learning rate is initialized as 8⁢e−4 with with a one-cycle scheduler [107]. For the rendering process, we randomly choose 5 frames for each point cloud, 128 rays for each image, and sample 128 points for each ray. The batch size of point clouds is set to 64. Models are trained on 8 NVIDIA A100 GPU.

4.2.3Comparison Experiments

Semantic Segmentation We conduct indoor semantic segmentation experiments on ScanNet, S3DIS and Structured3D. We take PPT, the SOTA method using SparseUNet [42], as our baseline, and report the comparison performance of scratch and finetuned models in Tab. IX. The S3DIS dataset contains 6 areas, among which we usually take Area 5 as the validation set. It is also common to evaluate 6-fold performance on it, so we also report the detailed 6-fold results in Tab. X. Note that to avoid information leaks, we pre-train a new model on only ScanNet and Structured3D before finetuning on each area of the S3DIS dataset. The semantic segmentation experiments all show the significant performance of our proposed paradigm.

TABLE XI:Indoor instance segmentation results. With the same pre-trained weights, we further fine-tune PonderV2 on ScanNet and ScanNet200 instance segmentation driven by PointGroup [108]. We compare mAP@25, mAP@50, and mAP results with previous pre-training methods, and our method shows superior results across benchmarks.

Method#Params.ScannetScanNet200
mAP@25mAP@50mAPmAP@25mAP@50mAP
PointGroup [108]39.2M72.856.936.032.224.515.8
+ PC [25]39.2M-58.0--24.9-
+ CSC [26]39.2M-59.4--25.2-
+ LGround [97]39.2M----26.1-
+ MSC [45]39.2M74.759.639.334.326.817.3
+ PPT [102]41.0M76.962.040.736.829.419.4
PonderV2 (Ours)41.0M77.0↑4.262.6↑5.740.9↑4.937.6↑5.430.5↑6.020.1↑4.3

Instance Segmentation 3D instance segmentation is another fundamental perception task. We benchmark our finetuning results on ScanNet, ScanNet200 and S3DIS (Area 5), which is shown in Tab. XI. The results show that our pre-training approach also helps enhance instance segmentation understanding.

TABLE XII:Indoor data efficient results. We follow the ScanNet Data Efficient benchmark [26] and compare the validation results of PonderV2 with previous pre-training methods. All methods are trained with SparseUNet. Pct. in limited reconstructions setting denotes the percentage of scene reconstruction that could be used for training. #Pts. in limited annotations setting denotes the number of points per scene that are annotated for training.

Pct.Limited Reconstructions#Pts.Limited Annotations
SratchCSC [26]MSC [45]PPTPonderV2SratchCSC [26]MSC [45]PPTPonderV2
1%26.028.929.231.334.6↑8.62041.955.560.160.667.0↑25.1
5%47.849.859.452.256.5↑8.75053.960.566.867.572.2↑18.3
10%56.759.461.062.866.0↑9.310062.265.969.770.874.3↑12.1
20%62.964.664.966.468.2↑5.320065.568.270.772.274.8↑9.3
100%72.273.875.376.477.0↑4.8Full72.273.875.376.477.0↑4.8

Data Efficiency on ScanNet The ScanNet benchmark also contains data efficiency settings of limited annotations (LA) and limited reconstructions (LR). In the LA setting, the models are allowed to see only a small ratio of labels while in the LR setting, the models can only see a small number of reconstructions (scenes). Again, to prevent information leaking, we pre-train our model on only S3DIS and Structured3D before finetuning. Results in Tab. XII indicate that our approach is more data efficient compared to the baseline.

TABLE XIII:Object reconstruction results on CO3D dataset.

F1-Score
MCC [109]63.5
+ SparseUNet [42]63.9
PonderV265.6↑2.1

Object Reconstruction Besides high-level perception tasks, we also conduct object reconstruction experiments to see if PonderV2 can work for low-level tasks. We take MCC [109] as our baseline and evaluate on a subset of CO3D [110] dataset for fast validation. CO3D is a 3D object dataset that can be used for object-level reconstruction. Specifically, we choose 1573 train samples and 224 test samples from 10 categories including parkingmeter, baseball glove, toytrain, donut, skateboard, hotdog, frisbee, tv, sandwich and toybus. We train models for 100 epochs with a learning rate of 1⁢e−4 and an effective batch size of 64. The problem is evaluated by the predicted occupancy of a threshold of 0.1, the same as in MCC’s original paper. The evaluation metric is F1-Score on the occupancies. For a fair comparison, we first directly change MCC’s encoder to our SparseUNet [42]. Then we average-pool the output volume into a 2D feature map and use MCC’s patch embedding module to align with the shape of the decoder’s input. Moreover, we adjust the shrink threshold from default 10.0 to 3.0, which increases the number of valid points after grid sampling. After this modification, the scratched baseline result is slightly higher than the original MCC’s, as shown in Tab. XIII. Moreover, we find that if finetuned with our pre-trained weights, it can remarkably gain nearly 2 points higher F1-Score. Note that the loaded weight is the same as previous high-level tasks, i.e. it is trained on scene-level ScanNet, Structured3D, and S3DIS datasets. We directly ignore the first layer as well as the final layer of weight before finetuning, since MCC does not take normal as input and requires a different number of output channels. The results indicate that not only can our approach work well on low-level tasks, but also that the proposed method has the potential to transfer scene-level knowledge to object-level.

5Experiments for Outdoor Scenarios

As shown in Fig. 8, PonderV2 achieves significant improvements across various 3D outdoor tasks of different input modality, which further prove the universal effectiveness of proposed methodology. In this section, we detail the experimental details of PonderV2’s outdoor experiments.

5.1Experimental Setup

Refer to caption

Figure 8:Effect of our pre-training for 3D outdoor detection and segmentation, where C, L, and M denote camera, LiDAR, and fusion modality, respectively.

We conduct the experiments on the NuScenes [111] dataset, which is a challenging dataset for autonomous driving. It consists of 700 scenes for training, 150 scenes for validation, and 150 scenes for testing. Each scene is captured through six different cameras, providing images with surrounding views, and is accompanied by a point cloud from LiDAR. The dataset comes with diverse annotations, supporting tasks like 3D object detection and 3D semantic segmentation. For detection evaluation, we employ nuScenes detection score (NDS) and mean average precision (mAP), and for segmentation assessment, we use mean intersection-over-union (mIoU).

5.2Implementation and Training Details

We base our code on the MMDetection3D [112] toolkit and train all models on 4 NVIDIA A100 GPUs. The input image is configured to 1600×900 pixels, while the voxel dimensions 𝒈 for point cloud voxelization are [0.075,0.075,0.2]. During the pre-training phase, we implemented several data augmentation strategies, such as random scaling and rotation. Additionally, we partially mask the inputs, focusing only on visible regions for feature extraction. The masking size and ratio for images are configured to 32 and 0.3, and for points to 8 and 0.8, respectively. For the segmentation task, we use SparseUNet as our sparse backbone fe(e), and for the detection task, we use VoxelNet [113], which is similar to the encoder part of SparseUNet, as our backbone. For multi-image setting, we use ConvNeXt-small [114] as our feature extractor fe(2⁢d). A uniform voxel representation 𝒱 with the shape of 180×180×5 is constructed. The fd(d) here is a 3-kernel size convolution which serves as a feature projection layer reducing the 𝒱’s feature dimension to 32. For the rendering decoders, we utilize a 6-layer MLP for ϕSDF and a 4-layer MLP for ϕRGB. In the rendering phase, 512 rays per image view and 96 points per ray are randomly selected. We maintain the loss scale factors for λRGB and λdepth at 10. The model undergoes training for 12 epochs using the AdamW optimizer with initial learning rates of 2⁢e−5 and 1⁢e−4 for point and image modalities, respectively. In the ablation studies, unless explicitly stated, fine-tuning is conducted for 12 epochs on 50% of the image data and for 20 epochs on 20% of the point data, without the implementation of the CBGS [115] strategy.

5.3Outdoor Comparison Experiments

3D Object Detection In Table XIV, we compare PonderV2 with previous detection approaches on the nuScenes validation set. We adopt UVTR [57] as our baselines for point-modality (UVTR-L), camera-modality (UVTR-C), Camera-Sweep-modality(UVTR-CS) and fusion-modality (UVTR-M). Benefits from the effective pre-training, PonderV2 consistently improves the baselines, namely, UVTR-L, UVTR-C, and UVTR-M, by 2.9, 2.4, and 3.0 NDS, respectively. When taking multi-frame cameras as inputs, PonderV2-CS brings 1.4 NDS and 3.6 mAP gains over UVTR-CS. Our pre-training technique also achieves 1.7 NDS and 2.1 mAP improvements over the monocular-based baseline FCOS3D [61]. Without any test time augmentation or model ensemble, our single-modal and multi-modal methods, PonderV2-L, PonderV2-C, and PonderV2-M, achieve impressive NDS of 70.6, 47.4, and 73.2, respectively, surpassing existing state-of-the-art methods.

3D Semantic Segmentation In Table XV, we compare PonderV2 with previous point clouds semantic segmentation approaches on the nuScenes Lidar-Seg dataset. Benefits from the effective pre-training, PonderV2 improves the baselines by 6.1 mIoU, achieving state-of-the-art performance on the validation set. Meanwhile, PonderV2 achieves an impressive mIoU of 81.1 on the test set, which is comparable with existing state-of-the-art methods.

5.4Comparisons with Pre-training Methods

TABLE XIV: Comparisons of different methods with a single model on the nuScenes val set. We compare with classic methods on different modalities without test-time augmentation. †: denotes our reproduced results based on MMDetection3D [112]. L, C, CS, and M indicate the LiDAR, Camera, Camera Sweep, and Multi-modality input, respectively.

MethodsPresent atModalityCSCBGSNDS↑mAP↑
PVT-SSD [116]CVPR’23L65.053.6
CenterPoint [117]CVPR’21L66.859.6
FSDv1 [118]NeurIPS’22L68.762.5
VoxelNeXt [119]CVPR’23L68.763.5
LargeKernel3D [120]CVPR’23L69.163.3
TransFusion-L [121]CVPR’22L70.165.1
CMT-L [58]ICCV’23L68.662.1
UVTR-L [57]NeurIPS’22L67.760.9
+ PonderV2 (Ours)-L70.665.0
BEVFormer-S [122]ECCV’22C44.837.5
SpatialDETR [123]ECCV’22C42.535.1
PETR [54]ECCV’22C44.237.0
Ego3RT [124]ECCV’22C45.037.5
3DPPE [125]ICCV’23C45.839.1
CMT-C [58]ICCV’23C46.040.6
FCOS3D† [61]ICCVW’21C38.431.1
+ PonderV2 (Ours)-C40.133.2
UVTR-C [57]NeurIPS’22C45.037.2
+ PonderV2 (Ours)-C47.441.5
UVTR-CS [57]NeurIPS’22C48.839.2
+ PonderV2 (Ours)-C50.242.8
FUTR3D [126]arXiv’22C+L68.364.5
PointPainting [127]CVPR’20C+L69.665.8
MVP [128]NeurIPS’21C+L70.867.1
TransFusion [121]CVPR’22C+L71.367.5
AutoAlignV2 [129]ECCV’22C+L71.267.1
BEVFusion [56]NeurIPS’22C+L71.067.9
BEVFusion [130]ICRA’23C+L71.468.5
DeepInteraction [131]NeurIPS’22C+L72.669.9
CMT-M [58]ICCV’23C+L72.970.3
UVTR-M [57]NeurIPS’22C+L70.265.4
+ PonderV2 (Ours)-C+L73.269.9

TABLE XV:Comparisons of different methods with a single model on the nuScenes segmentation dataset.

MethodsBackboneVal mIoUTest mIoU
SPVNAS [132]Sparse CNN-77.4
Cylinder3D [133]Sparse CNN76.177.2
SphereFormer [134]Transformer78.481.9
SparseUNet [42]Sparse CNN73.3-
+ PonderV2 (Ours)Sparse CNN79.481.1

TABLE XVI:Comparison with different camera-based pre-training.

MethodsLabelNDSmAP
2D3D
UVTR-C (Baseline)25.223.0
+ Depth Estimator26.9↑1.725.1↑2.1
+ Detector29.4↑4.227.7↑4.7
+ 3D Detector31.7↑6.529.0↑6.0
+ PonderV232.9↑7.732.6↑9.6

TABLE XVII:Comparison with different point-based pre-training.

MethodsSupportNDSmAP
2D3D
UVTR-L (Baseline)46.739.0
+ Occupancy-based48.2↑1.541.2↑2.2
+ MAE-based48.8↑2.142.6↑3.6
+ Contrast-based49.2↑2.548.8↑9.8
+ PonderV255.8↑9.148.1↑9.1

Camera-based Pre-training In Table XVI, we conduct comparisons between PonderV2 and several camera-based pre-training approaches: 1) Depth Estimator: we follow [60] to inject 3D priors into 2D learned features via depth estimation; 2) Detector: the image encoder is initialized using pre-trained weights from MaskRCNN [135] on the nuImages dataset [111]; 3) 3D Detector: we use the weights from the widely used monocular 3D detector [61] for model initialization, which relies on 3D labels for supervision. PonderV2 demonstrates superior knowledge transfer capabilities compared to previous unsupervised or supervised pre-training methods, showing the effectiveness of our rendering-based pretext task.

Point-based Pre-training For point modality, we also present comparisons with recently proposed self-supervised methods in Table XVII: 1) Occupancy-based: we implement ALSO [47] in our framework to train the point encoder; 2) MAE-based: the leading-performing method [46] is adopted, which reconstructs masked point clouds using the chamfer distance. 3) Contrast-based: [136] is used for comparisons, which employs pixel-to-point contrastive learning to integrate 2D knowledge into 3D points. Among these methods, PonderV2 achieves the best NDS performance. While PonderV2 has a slightly lower mAP compared to the contrast-based method, it avoids the need for complex positive-negative sample assignments in contrastive learning.

5.5Effectiveness on Various Backbones

Different View Transformations In Table XVIII, we investigate different view transformation strategies for converting 2D features into 3D space, including BEVDet [137], BEVDepth [138], and BEVformer [122]. Improvements ranging from 5.2 to 6.3 NDS are observed across different transform techniques, demonstrating the strong generalization ability of the proposed method.

Different Modalities Unlike most previous pre-training methods, our framework can be seamlessly applied to various modalities. To verify the effectiveness of our approach, we set UVTR as our baseline, which contains detectors with point, camera, and fusion modalities. Table XIX shows the impact of PonderV2 on different modalities. PonderV2 consistently improves the UVTR-L, UVTR-C, and UVTR-M by 9.1, 7.7, and 6.9 NDS, respectively.

Scaling up Backbones To test PonderV2 across different backbone scales, we adopt an off-the-shelf model, ConvNeXt, and its variants with different numbers of learnable parameters. As shown in Table XX, one can observe that with our PonderV2 pre-training, all baselines are improved by large margins of +6.0∼7.7 NDS and +8.2∼10.3 mAP. The steady gains suggest that PonderV2 has the potential to boost various state-of-the-art networks.

TABLE XVIII:Pre-training effectiveness on different view transformation strategies.

MethodsView TransformNDSmAP
BEVDetPooling27.124.6
+ PonderV2Pooling32.7↑5.632.8↑8.2
BEVDepthPooling & Depth28.928.1
+ PonderV2Pooling & Depth34.1↑5.233.9↑5.8
BEVFormerTransformer26.824.5
+ PonderV2Transformer33.1↑6.331.9↑7.4

TABLE XIX:Pre-training effectiveness on different input modalities.

MethodsModalityNDSmAP
UVTR-LLiDAR46.739.0
+ PonderV2LiDAR55.8↑9.148.1↑9.1
UVTR-CCamera25.223.0
+ PonderV2Camera32.9↑7.732.6↑9.6
UVTR-MLiDAR-Camera49.952.7
+ PonderV2LiDAR-Camera56.8↑6.957.0↑4.3

TABLE XX:Pre-training effectiveness on different backbone scales.

MethodsBackbone
ConvNeXt-SConvNeXt-BConvNeXt-L
UVTR-C25.2 / 23.026.9 / 24.429.1 / 27.7
+PonderV232.9↑7.7 / 32.6↑9.634.1↑7.2 / 34.7↑10.335.1↑6.0 / 35.9↑8.2

5.6Ablation Studies

TABLE XXI:Ablation study of the mask ratio. TABLE XXII:Ablation study of the decoder depth.

ratio

NDS

mAP

0.1

31.9

32.4

0.332.932.6

0.5

32.3

32.6

0.7

32.1

32.4

layers

NDS

mAP

(2, 2)

31.3

31.3

(4, 3)

31.9

31.6

(5, 4)

32.1

32.7

(6, 4)

32.9

32.6

TABLE XXII:Ablation study of the decoder depth. TABLE XXIII:Ablation study of the decoder width. TABLE XXIV:Ablation study of the rendering technique.

dim

NDS

mAP

32

32.9

32.6

64

32.5

32.8

128

32.9

32.6

256

32.4

32.9
MethodsNDSmAP
Baseline25.223.0
UniSurf [66]32.532.1
VolSDF [139]32.832.4
NeuS [65]32.9↑7.732.6↑9.6

TABLE XXIV:Ablation study of the rendering technique.

Refer to caption

Figure 9:Illustration of the rendering results. The ground truth RGB and projected point clouds, rendered RGB, and rendered depth are shown on the left, middle, and right, respectively.

Refer to caption

Figure 10:Illustration of the detection results. The predictions are shown on multi-view images and bird’s eye view with LiDAR points.

Masking Ratio Table XXII shows the influence of the masking ratio for the camera modality. We discover that a masking ratio of 0.3, which is lower than the ratios used in previous MAE-based methods, is optimal for our method. This discrepancy could be attributed to the challenge of rendering the original image from the volume representation, which is more complex compared to image-to-image reconstruction. For the point modality, we adopt a mask ratio of 0.8, as suggested in [46], considering the spatial redundancy inherent in point clouds.

Rendering Design Our examinations in Tables XXII, XXIV, and XXIV illustrate the flexible design of our differentiable rendering. In Table XXII, we vary the depth (DSDF,DRGB) of the SDF and RGB decoders, revealing the importance of sufficient decoder depth for succeeding in downstream detection tasks. This is because deeper ones may have the ability to adequately integrate geometry or appearance cues during pre-training. Conversely, as reflected in Table XXIV, the width of the decoder has a relatively minimal impact on performance. Thus, the default dimension is set to 32 for efficiency. Additionally, we explore the effect of various rendering techniques in Table XXIV, which employ different ways for ray point sampling and accumulation. Using NeuS [65] for rendering records a 0.4 and 0.1 NDS improvement compared to UniSurf [66] and VolSDF [139] respectively, showcasing the learned representation can be improved by utilizing well-designed rendering methods and benefiting from the advancements in neural rendering.

5.7Qualitative Results

Figure 9 provides some qualitative results of the rendered image and depth map. Our approach has the capability to estimate the depth of small objects, such as cars at a distance. This quality in the pre-training process indicates that our method could encode intricate and continuous geometric representations, which would benefit the downstream tasks. In Figure 10, we present 3D detection results in camera space and BEV (Bird’s Eye View) space with LiDAR point clouds. Our model can predict accurate bounding boxes for nearby objects and also shows the capability of detecting objects from far distances.

6Conclusions, Limitations and Future Work

In this paper, we propose a novel universal pre-training paradigm for 3D representation learning, which utilizes differentiable neural rendering as the pre-text task. Our method, named PonderV2, can significantly boost over 9 downstream indoor/outdoor tasks including both high-level perception and low-level reconstruction tasks. We have also achieved SOTA performance on over 11 benchmarks. Extensive experiments have shown the flexibility and effectiveness of the proposed method.

Despite the encouraging results, this paper still serves as a start-up work on 3D foundation model. We only show the effectiveness of our pre-training paradigm on a light-weight and efficient backbone, i.e. SparseUNet. It is worth scaling up both the dataset and backbone size to evaluate the extreme boundary of PonderV2 and judge whether or to what extend it can lead to 3D foundation model. Besides, it is also interesting to sufficiently test PonderV2 on more downstream tasks such as reconstruction and robotic control tasks. This may further expand the application scope of 3D representation pre-training. Moreover, neural rendering is a bridge between 3D and 3D worlds. Thus, combining a 2D foundation model with 3D pre-training through techniques such as semantic rendering is valuable. We hope our work can help the building of 3D foundation models in the future.


http://www.kler.cn/a/510454.html

相关文章:

  • 【Django开发】django美多商城项目完整开发4.0第12篇:商品部分,表结构【附代码文档】
  • [NOIP2012 提高组] 借教室
  • vscode离线安装插件--终极解决方案
  • 快速入门:如何注册并使用GPT
  • 常见的两种虚拟化技术比较:KVM与VMware
  • HarmonyOS NEXT应用开发边学边玩系列:从零实现一影视APP (四、最近上映电影滚动展示及加载更多的实现)
  • Syncthing在ubuntu下的安装使用
  • AUTOSAR从入门到精通-自动驾驶测试技术
  • 三天急速通关Java基础知识:Day1 基本语法
  • c# 设置Regex Multiline无效问题
  • 【C++】了解stack和queue
  • nlp培训重点-3
  • Coder星球-测试用例设计
  • 【脑机接口数据处理】 如何读取Trode 的.rec文件 原始数据?
  • Linux虚拟机安装与FinalShell使用:探索Linux世界的便捷之旅
  • 机器学习:监督学习与非监督学习
  • 【Rust自学】13.8. 迭代器 Pt.4:创建自定义迭代器
  • 解锁C#语法的无限可能:从基础到进阶的编程之旅
  • YOLOv10-1.1部分代码阅读笔记-loss.py
  • 达梦数据库经验笔记
  • React第二十三章(useId)
  • 深度学习 DAY2:Transformer(一部分)
  • BPF CO-RE(三)——在用户开发中的应用
  • 开源AI图像工具—Stable Diffusion
  • ubuntu 22 安装vmware 17.5
  • SSL/TLS的数据压缩机制