Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

1 Google DeepMind 2 Google Research

Overview

Zero-shot metric depth estimation is challenging due to the large variation in RGB and depth between indoor and outdoor scenes and the depth-scale ambiguity due to unknown camera intrinsics. We present DMD (Diffusion for Metric Depth), a state-of-the-art diffusion model for monocular absolute depth estimation. We make several innovations such as the use of log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Our method achieves a 25% reduction in relative error (REL) on zero-shot indoor and 33% reduction on zero-shot outdoor datasets over the current state-of-the-art using only a small number of denoising steps.

Method

  1. Joint indoor-outdoor modelling: Instead of linearly mapping the depth to [-1, 1], we parameterize depth in the log scale to more equitably allocate representation capacity between indoor and outdoor scenes.
  2. Handling diverse camera intrinsics: To avoid overfitting to training camera intrinsics, we propose augmenting training data by cropping and uncropping (with noise padding) to simulate diverse fields-of-view (FOV). We further condition on the vertical FOV which we find is critical for disambiguating depth scale.
  3. Diverse training data: We use a diverse fine-tuning mixture which dramatically improves performance over fine-tuning on the NYU and KITTI datasets alone.
  4. Inference latency: We use the \( v \)-parameterization of diffusion instead of the commonly used \( \epsilon \)-parameterization which enables using as few as 1 denoising step during inference (compared to 64 or more steps with \( \epsilon \)-parameterization).

Results

Quantitative comparison of DMD with the current SOTA for zero-shot metric depth estimation on indoor (first table) and outdoor (second table) scenes. Our method improves performance by a large margin on both domains.

Zero-shot results on indoor datasets. Zero-shot results on outdoor datasets.

Citation

@misc{saxena2023zeroshot,
      title={Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model},
      author={Saurabh Saxena and Junhwa Hur and Charles Herrmann and Deqing Sun and David J. Fleet},
      year={2023},
      eprint={2312.13252},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}