Deep Learning on Object-centric 3D Neural Fields Accepted at TPAMI
Abstract
In recent years, Neural Fields (NFs) have emerged as an effective tool for encoding diverse continuous signals such as images, videos, audio, and 3D shapes. When applied to 3D data, NFs offer a solution to the fragmentation and limitations associated with prevalent discrete representations. However, given that NFs are essentially neural networks, it remains unclear whether and how they can be seamlessly integrated into deep learning pipelines for solving downstream tasks. This paper addresses this research problem and introduces \({\tt nf2vec}\), a framework capable of generating a compact latent representation for an input NF in a single inference pass. We demonstrate that \({\tt nf2vec}\) effectively embeds 3D objects represented by the input NFs and showcase how the resulting embeddings can be employed in deep learning pipelines to successfully address various tasks, all while processing exclusively NFs. We test this framework on several NFs used to represent 3D surfaces, such as unsigned/signed distance and occupancy fields. Moreover, we demonstrate the effectiveness of our approach with more complex NFs that encompass both geometry and appearance of 3D objects such as neural radiance fields.
Method
Our framework, dubbed \({\tt nf2vec}\), is composed of an encoder and a decoder. The encoder is designed to take as input the weights of a NF and produces a compact embedding that encodes all the relevant information of the input NF. A first challenge in designing an encoder for NFs consists in defining how the encoder should ingest the weights as input, since processing naively all the weights would require a huge amount of memory. The \({\tt nf2vec}\) encoder is designed with a simple architecture, consisting of a series of linear layers with batch norm and ReLU non-linearity followed by final max pooling. At each stage, the input weight matrix is transformed by one linear layer that applies the same weights to each row of the matrix. The final max pooling compresses all the rows into a single one, obtaining the desired embedding.
In order to guide the \({\tt nf2vec}\) encoder to produce meaningful embeddings, we first note that we are not interested in encoding the values of the input weights in the embeddings produced by our framework, but, rather, in storing information about the 3D object represented by the input NF. For this reason, we supervise the decoder to replicate the function approximated by the input NF instead of directly reproducing its weights, as it would be the case in a standard auto-encoder formulation. In particular, during training, we adopt an implicit decoder which takes in input the embeddings produced by the encoder and decodes the input NF from them. After the overall framework has been trained end to end, the frozen encoder can be used to compute embeddings of unseen NFs with a single forward pass, whereas the implicit decoder can be used, if needed, to reconstruct the discrete representation given an embedding.
Reconstruction
We compare 3D shapes and NeRFs reconstructed from NFs unseen during training with those reconstructed by the \({\tt nf2vec}\) decoder starting from the latent codes yielded by the encoder. Though our embedding is dramatically more compact than the original NF, the reconstructed discrete data resembles those of the original input NF.
Interpolation
We linearly interpolate between two object embeddings produced by \({\tt nf2vec}\). Results highlight that the learned latent spaces enable smooth interpolations between shapes represented as NFs.
Additionally, given two input NeRFs, we render images from networks obtained by interpolating their weights. We compare these results with those obtained from the interpolation of \({\tt nf2vec}\) embeddings. Notably, renderings obtained by averaging the weights of the two NeRFs appear blurred and lack 3D structure, whereas renderings produced by \({\tt nf2vec}\) preserve details and maintain 3D consistency.
Retrieval
We perform shape retrieval by computing the Euclidean distance between \({\tt nf2vec}\) embeddings of unseen point clouds from the test set. The retrieved shapes not only belong to the same class as the query but also exhibit similar coarse structures.
Performing the same experiment on NeRFs allows to retrieve neighbors that are similar to the query in both geometry and color.
Part segmentation
Part segmentation aims to predict a semantic (i.e. part) label for each point of a given cloud. We tackle this problem by training a decoder similar to that used to train our framework. Such decoder is fed with the \({\tt nf2vec}\) embedding of the NF representing the input cloud, concatenated with the coordinate of a 3D query, and it is trained to predict the label of the query point. Notice how the \({\tt nf2vec}\) embeddings, although task-agnostic, allow to perform a local discriminative task such as part segmentation.
Generation
We employ a Latent-GAN to generate embeddings resembling those produced by \({\tt nf2vec}\) from random noise. Generated embeddings can be decoded into discrete representations using the implicit decoder from the \({\tt nf2vec}\) training. As our framework is agnostic towards the original discrete representation of shapes used to learn the NFs, we can train Latent-GANs with embeddings representing point clouds or meshes based on the same identical protocol and architecture. Furthermore, by generating embeddings representing NFs, our method allows point cloud sampling at any arbitrary resolution, whereas SP-GAN needs a new training for each desired resolution.
Learning a mapping between \({\tt nf2vec}\) embedding spaces
We develop a transfer function specifically designed to operate on \({\tt nf2vec}\) embeddings as both input and output data. Such transfer function can be realized by a simple MLP that maps the input embedding into the output one and is trained with standard MSE loss. We explore two tasks: first, we address point cloud completion by learning a mapping from \({\tt nf2vec}\) embeddings of NFs that represent incomplete clouds to embeddings associated with complete clouds. Then, we tackle the task of surface reconstruction by training the transfer function to map \({\tt nf2vec}\) embeddings representing point clouds into embeddings that can be decoded into meshes. Indeed, by processing exclusively NF embeddings, we can obtain output shapes that are highly compatible with the input ones while preserving their distinctive details, like the pointy wings of an airplane or the flap of a car.
We also show that the same methodology allows to learn a transfer function that maps \({\tt nf2vec}\) embeddings of point clouds to \({\tt nf2vec}\) embeddings of NeRFs. The generated NeRFs preserve the geometry of the input shapes and exhibit plausible diverse colors associated with different object parts.
Cite us
@article{ramirez2023nf2vec, title = {Deep Learning on Object-centric 3D Neural Fields}, author = {Zama Ramirez, Pierluigi and De Luigi, Luca and Sirocchi, Daniele and Cardace, Adriano and Spezialetti, Riccardo and Ballerini, Francesco and Salti, Samuele and Di Stefano, Luigi}, journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence}, year = {2024} }