The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Saurabh Saxena Charles Herrmann Junhwa Hur Abhishek Kar Mohammad Norouzi Deqing Sun David J. Fleet


Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26% on the KITTI optical flow benchmark, about 25% better than the best published method.

SOTA results on optical flow and monocular depth estimation

Our model achieves state-of-the-art results on the public test benchmark for optical flow estimation on KITTI and for monocular depth estimation on NYU.
Table 1: Flow pre-training results
Table 2: Flow fine-tuning results
Zero-shot optical flow estimation results on Sintel and KITTI are shown in Table 1. We provide a new RAFT baseline using our proposed pre-training mixture and substantially improve the accuracy over the original. Our diffusion model outperforms even this much stronger baseline and achieves state-of-the-art zero-shot results on and KITTI. Table 2 shows results on the official test benchmarks for Sintel and KITTI for a model fine-tuned on a standard fine-tuning mixture. Our model achieves an Fl-all outlier rate of 3.26% on the public KITTI test benchmark, ~25% lower than the best published method.
Table 3: Monocular depth estimation results
On the task of monocular depth estimation, our method achieves a state-of-the-art absolute relative error of 0.074 on the indoor NYU dataset and a competitive absolute relative error of 0.055 on the KITTI dataset.

Synthetic training data

AutoFlow (AF) has recently emerged as an effective dataset for optical flow pre-training. Interestingly, we find that diffusion models trained with AutoFlow alone tend to provide very coarse flow estimates and can hallucinate shapes. The addition of FlyingThings (FT), Kubric (KU), and TartanAir (TA) remove the AF-induced bias toward polgonal-shaped regions, and significantly improve flow quality on fine detail, e.g. trees, thin structures, and motion boundaries.
Table 4: Pre-training datasets ablation for zero-shot flow estimation.
Table 5: Pre-training datasets ablation for fine-tuning performance on NYU depth v2.
Tables 4 and 5 show that adding synthetic data during pre-training substantially improves results for both zero-shot optical flow estimation and fine-tuning performance on monocular depth estimation.

Modified diffusion training for sparse label maps

Denoising diffusion training with infilling and step-unrolling
Given the presence of holes (missing labels) in groundtruth flow and depth data, the usual diffusion model training approach involving adding gaussian noise to the groundtruth label does not work due a training / inference distribution gap. To mitigate this, we first infill missing values using interpolation. Then, we add noise to the label map and train a neural network to model the conditional distribution of the noise given the RGB image(s), noisy label, and time stamp. One can optionally unroll the denoising step(s) during training (with stop gradient) to bridge the distribution gap between training and inference for yt.
Table 6: Ablation for infilling and step-unrolled training.
As the results in Table 6 show, simple nearest neighbor interpolation and step-unrolling effectively mitigate this problem.

Multimodal predictions

One strength of diffusion models is their ability to capture complex multimodal distributions. This can be effective in representing uncertainty, for example, in cases of transparent, translucent, or reflective cases. Above figure shows multiple samples on the NYU, KITTI, and Sintel datasets, showing that our model captures multimodality and provides plausible samples when ambiguities exist.

Zero-shot coarse to fine refinement

For our pretrained model, refinement helps correct wrong flow and adds details to correct flow as shown in the figure above.

Zero-shot depth completion and text to 3D

Another interesting property, arising the iterative refinement nature, of diffusion models is the ability to perform conditional inference zero-shot. We leverage this to build a simple text-to-3D scene generation pipeline by combining our model with existing text-to-image (Imagen) and text-conditioned image completion (Imagen Editor) models as shown in the figure above. Below are 3D point clouds of some scenes generated from the respective text prompts (subsampled 10x for fast visualization).

3D Caption: A kitchen
3D Caption: A bedroom

Below are more text to RGB-D results from our proposed system.

Caption: A living room
Caption: A library
Caption: A meeting room
Caption: A movie theatre


    title={The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation},
    author={Saurabh Saxena and Charles Herrmann and Junhwa Hur and Abhishek Kar and Mohammad Norouzi and Deqing Sun and David J. Fleet},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},