We introduce a novel architecture for neural disparity refinement aimed at facilitating deployment of 3D computer vision on cheap and widespread consumer devices, such as mobile phones. Our approach relies on a continuous formulation that enables to estimate a refined disparity map at any arbitrary output resolution. Thereby, it can handle effectively the unbalanced camera setup typical of nowadays mobile phones, which feature both high and low resolution RGB sensors within the same device. Moreover, our neural network can process seamlessly the output of a variety of stereo methods and, by refining the disparity maps computed by a traditional matching algorithm like SGM, it can achieve unpaired zero-shot generalization performance compared to state-of-the-art end-to-end stereo models.
@inproceedings{aleotti2021neural, title={Neural Disparity Refinement for Arbitrary Resolution Stereo}, author={Aleotti, Filippo and Tosi, Fabio and Zama Ramirez, Pierluigi and Poggi, Matteo and Salti, Samuele and Di Stefano, Luigi and Mattoccia, Stefano}, booktitle={International Conference on 3D Vision}, note={3DV}, year={2021}, }
Neural Disparity Refinement, architecture overview. Given a rectified stereo pair captured using either a balanced or unbalanced (red dotted lines) stereo setting, our goal is to estimate a refined disparity map at any arbitrary spatial resolution starting from noisy disparities pre-computed by any existing stereo blackbox. We first extract deep high-dimensional features using two separate convolutional branches that are combined together by a decoder. Then, at each continuous 2D location in the image domain, we interpolate features across the levels of the decoder in order to feed them into a disparity estimation module realized through two MLPs which predict an integer disparity value and a sub-pixel offset, respectively.
Results on the SceneFlow testing set. From left to right, the RGB input image, the noisy input disparity map computed by SGM (rows 1-2), AD-Census (rows 3-4), C-CNN (rows 5-6) and the corresponding refined disparity estimated by our network.
Generalization results on Middlebury 2014 of our network (pre-trained on SceneFlow). From left to right, the RGB input image, the noisy input disparity map computed by SGM and the refined disparity estimated by our network.
Generalization results on KITTI 2015 of our network (pre-trained on SceneFlow). From left to right, the RGB input image, the noisy input disparity map computed by SGM and the refined disparity estimated by our network.
Our network is able to handle also inputs with different resolutions. The top row depicts the input image, at 3840 × 2160 and the disparity maps, D, computed by SGM when the right image, is 480 × 270 and 320 × 180 (k = 8 and 12). The bottom row shows ground-truth and estimated disparity at 3840×2160.
Thanks to our continuous formulation we can estimate a disparity map at an arbitrary resolution. In the figure we see the comparison against a standard Nearest Neighbor interpolation.