Deep Learning on Object-centric 3D Neural Fields
Accepted at TPAMI

* denotes equal contribution
Department of Computer Science and Engineering (DISI)
University of Bologna, Italy

Abstract

Overview of our framework

In recent years, Neural Fields (NFs) have emerged as an effective tool for encoding diverse continuous signals such as images, videos, audio, and 3D shapes. When applied to 3D data, NFs offer a solution to the fragmentation and limitations associated with prevalent discrete representations. However, given that NFs are essentially neural networks, it remains unclear whether and how they can be seamlessly integrated into deep learning pipelines for solving downstream tasks. This paper addresses this research problem and introduces \({\tt nf2vec}\), a framework capable of generating a compact latent representation for an input NF in a single inference pass. We demonstrate that \({\tt nf2vec}\) effectively embeds 3D objects represented by the input NFs and showcase how the resulting embeddings can be employed in deep learning pipelines to successfully address various tasks, all while processing exclusively NFs. We test this framework on several NFs used to represent 3D surfaces, such as unsigned/signed distance and occupancy fields. Moreover, we demonstrate the effectiveness of our approach with more complex NFs that encompass both geometry and appearance of 3D objects such as neural radiance fields.


Method

Our framework, dubbed \({\tt nf2vec}\), is composed of an encoder and a decoder. The encoder is designed to take as input the weights of a NF and produces a compact embedding that encodes all the relevant information of the input NF. A first challenge in designing an encoder for NFs consists in defining how the encoder should ingest the weights as input, since processing naively all the weights would require a huge amount of memory. The \({\tt nf2vec}\) encoder is designed with a simple architecture, consisting of a series of linear layers with batch norm and ReLU non-linearity followed by final max pooling. At each stage, the input weight matrix is transformed by one linear layer that applies the same weights to each row of the matrix. The final max pooling compresses all the rows into a single one, obtaining the desired embedding.

nf2vec encoder

In order to guide the \({\tt nf2vec}\) encoder to produce meaningful embeddings, we first note that we are not interested in encoding the values of the input weights in the embeddings produced by our framework, but, rather, in storing information about the 3D object represented by the input NF. For this reason, we supervise the decoder to replicate the function approximated by the input NF instead of directly reproducing its weights, as it would be the case in a standard auto-encoder formulation. In particular, during training, we adopt an implicit decoder which takes in input the embeddings produced by the encoder and decodes the input NF from them. After the overall framework has been trained end to end, the frozen encoder can be used to compute embeddings of unseen NFs with a single forward pass, whereas the implicit decoder can be used, if needed, to reconstruct the discrete representation given an embedding.

nf2vec framework

Reconstruction

We compare 3D shapes and NeRFs reconstructed from NFs unseen during training with those reconstructed by the \({\tt nf2vec}\) decoder starting from the latent codes yielded by the encoder. Though our embedding is dramatically more compact than the original NF, the reconstructed discrete data resembles those of the original input NF.

Shape reconstruction NeRF reconstruction

Interpolation

We linearly interpolate between two object embeddings produced by \({\tt nf2vec}\). Results highlight that the learned latent spaces enable smooth interpolations between shapes represented as NFs.

Shape interpolation
NeRF interpolation

Additionally, given two input NeRFs, we render images from networks obtained by interpolating their weights. We compare these results with those obtained from the interpolation of \({\tt nf2vec}\) embeddings. Notably, renderings obtained by averaging the weights of the two NeRFs appear blurred and lack 3D structure, whereas renderings produced by \({\tt nf2vec}\) preserve details and maintain 3D consistency.

NeRF weights interpolation

Retrieval

We perform shape retrieval by computing the Euclidean distance between \({\tt nf2vec}\) embeddings of unseen point clouds from the test set. The retrieved shapes not only belong to the same class as the query but also exhibit similar coarse structures.

Point cloud retrieval

Performing the same experiment on NeRFs allows to retrieve neighbors that are similar to the query in both geometry and color.

NeRF retrieval

Part segmentation

Part segmentation aims to predict a semantic (i.e. part) label for each point of a given cloud. We tackle this problem by training a decoder similar to that used to train our framework. Such decoder is fed with the \({\tt nf2vec}\) embedding of the NF representing the input cloud, concatenated with the coordinate of a 3D query, and it is trained to predict the label of the query point. Notice how the \({\tt nf2vec}\) embeddings, although task-agnostic, allow to perform a local discriminative task such as part segmentation.

Part segmentation

Generation

We employ a Latent-GAN to generate embeddings resembling those produced by \({\tt nf2vec}\) from random noise. Generated embeddings can be decoded into discrete representations using the implicit decoder from the \({\tt nf2vec}\) training. As our framework is agnostic towards the original discrete representation of shapes used to learn the NFs, we can train Latent-GANs with embeddings representing point clouds or meshes based on the same identical protocol and architecture. Furthermore, by generating embeddings representing NFs, our method allows point cloud sampling at any arbitrary resolution, whereas SP-GAN needs a new training for each desired resolution.

Shape generation

When applied to NeRFs, our generation method produces renderings that have a good level of realism and diversity. Notably, the 3D consistency of images obtained from different viewpoints is preserved.

NeRF generation

Learning a mapping between \({\tt nf2vec}\) embedding spaces

We develop a transfer function specifically designed to operate on \({\tt nf2vec}\) embeddings as both input and output data. Such transfer function can be realized by a simple MLP that maps the input embedding into the output one and is trained with standard MSE loss. We explore two tasks: first, we address point cloud completion by learning a mapping from \({\tt nf2vec}\) embeddings of NFs that represent incomplete clouds to embeddings associated with complete clouds. Then, we tackle the task of surface reconstruction by training the transfer function to map \({\tt nf2vec}\) embeddings representing point clouds into embeddings that can be decoded into meshes. Indeed, by processing exclusively NF embeddings, we can obtain output shapes that are highly compatible with the input ones while preserving their distinctive details, like the pointy wings of an airplane or the flap of a car.

Shape mapping

We also show that the same methodology allows to learn a transfer function that maps \({\tt nf2vec}\) embeddings of point clouds to \({\tt nf2vec}\) embeddings of NeRFs. The generated NeRFs preserve the geometry of the input shapes and exhibit plausible diverse colors associated with different object parts.

NeRF mapping

Cite us

@article{ramirez2023nf2vec,
    title = {Deep Learning on Object-centric 3D Neural Fields},
    author = {Zama Ramirez, Pierluigi 
              and De Luigi, Luca 
              and Sirocchi, Daniele 
              and Cardace, Adriano 
              and Spezialetti, Riccardo 
              and Ballerini, Francesco 
              and Salti, Samuele 
              and Di Stefano, Luigi},
    journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
    year = {2024}
}