Fully Self-Supervised Depth Estimation from Defocus Clue

Haozhe Si^1,2*, Bin Zhao^1,3*, Dong Wang^1†, Yunpeng Gao³, Mulin Chen^1,3, Zhigang Wang¹, Xuelong Li^1,3†,

¹Shanghai AI Laboratory, ²University of Illinois Urbana-Champaign, ³Northwestern Polytechnical University

^*Equal contribution, ^†Corresponding author

An overview of the proposed framework. The framework consists of a neural model, DAIF-Net, and an optical model.

Abstract

Depth-from-defocus (DFD), modeling the relationship between depth and defocus pattern in images, has demonstrated promising performance in depth estimation. Recently, several self-supervised works try to overcome the difficulties in acquiring accurate depth ground-truth. However, they depend on the all-in-focus (AIF) images, which cannot be captured in real-world scenarios. Such limitation discourages the applications of DFD methods.

To tackle this issue, we propose a completely self-supervised framework that estimates depth purely from a sparse focal stack. We show that our framework circumvents the needs for the depth and AIF image ground-truth, and receives superior predictions, thus closing the gap between the theoretical success of DFD works and their applications in the real world.

In particular, we propose (i) a more realistic setting for DFD tasks, where no depth or AIF image ground-truth is available; (ii) a novel self-supervision framework that provides reliable predictions of depth and AIF image under the challenging setting.

The proposed framework uses a neural model to predict the depth and AIF image, and utilizes an optical model to validate and refine the prediction. We verify our framework on three benchmark datasets with rendered focal stacks and real focal stacks. Qualitative and quantitative evaluations show that our method provides a strong baseline for self-supervised DFD tasks.

Proposed Framework

DAIF-Net

To predict the depth and AIF image from the focal stack, we proposed the DepthAIF-Net (DAIF-Net). This architecture takes a focal stack of arbitrary size and estimates the depth map and AIF image. The parameters of encoders and bottlenecks are shared across all branches. We adopt the global pooling and fuse the branches by selecting the maxima of their features.

The DAIF-Net architecture.

Defocus Map Generation

To quantitatively measure the defocus blur in an image, we introduce the defocus map. Given an optical system, the defocus map can be calculated from the depth map once we establish the relationship between depth and defocus. As illustrated in the figure, when a point light source is out-of-focus, the light rays will converge either in front of or behind the image plane, and form a blurry circle on the image plane. The circle of confusion (CoC) measures the diameter of such a blurry circle. If the point light source is in focus, it will form an infinitely small point on the image plane, making it the sharpest projection with the minimum CoC. Therefore, CoC describes the level of blurriness, in another word, the amount of defocus. We adopt the Thin Lens Equation to calculate the CoC.

Illustration of the thin-lens equation.

Focal Stack Reconstruction

Given the defocus map and the AIF image, we can explicitly model the generation process of the defocus image. Taking advantage of the deterministic relationship, our predicted depth and AIF image can be supervised by reconstructing the input focal stack. To render a defocus image, we convolve the AIF image with the point spread function (PSF). PSF describes the pattern of a point light source transmitting to the image plane through the camera lens. In practice, we calculate the defocus blur using a simplified disc-shaped PSF, i.e., a Gaussian kernel, following previous works.

Focal Stack Dataset

Synthetic Dataset

DefocusNet Dataset is a synthetic dataset rendered using Blender. The dataset consists of random objects distributed in front of a wall. A virtual optical camera takes five images of the scene with varying focus distances and forms a focal stack. The original dataset is wildly used in supervised methods. However, the focus distances of the focal stacks are overly concentrated, causing indistinguishable defocus blur. Therefore, to perform an experiment in a similar setting, we regenerate the dataset with a set of more distributed focus distances using the code provided by the dataset author.

Real Image with Synthetic Defocus Blur

To acquire sufficient realistic defocus images for model training and evaluation, we render the focal stack datasets from RGB-D datasets using the thin-lens equation and PSF convolution layers. The RGB-D datasets we use is NYUv2 dataset. NYUv2 dataset with synthetic defocus blur is also commonly used in DFD tasks. The dataset consists of 1449 pairs of aligned RGB images and depth maps. We train our model on the 795 training pairs and evaluate on the 654 testing pairs.

Real Focal Stack Dataset

Mobile Depth is a real-world DFD dataset captured by a Samsung Galaxy S3 mobile phone. It consists of 11 aligned focal stacks with 14 to 33 images per stack. Since neither depth ground-truth nor camera parameters are provided, we only perform qualitative evaluation and comparison on this dataset with no further finetuning.

Evaluation

Results on Synthetic Data

We split the 1000 focal stacks into 500 training focal stacks and 500 testing focal stacks. For a fair comparison, we trained our method, along with the open-source state-of-the-art DFD methods, on our new training set. It is expected that our method does not perform as well as the supervised methods. This is because the DefocusNet dataset is texture-less, and our self-supervised framework is less sensitive to backgrounds, where defocus change is less obvious. Such issues do not exist for supervised methods because they always have access to the depth ground-truth. Meanwhile, we observe that our method is on par with the supervised methods when counting the results only for depths less than 0.5m, which indicates that our self-supervised method has higher accuracy in closer ranges.

Evaluation results on DefocusNet test set. Regular means all results are considered; <0.5 m only counts results for depth less than 0.5 meters.

Some examples of the framework outputs comparing with the state-of-the-art supervised works. The outputs are produced from the input focal stacks with 5 images. For the depth map, lighter colors indicate farther distances.

Results on Real Image with Synthetic Defocus Blur

We present the results of our framework trained on the NYUv2 dataset with the synthetic focal stacks. We evaluate our models on the sparse testing focal stacks and compare our results to other DFD methods. The table shows that in sceneries with complex textures, our method is on par with the state-of-the-art on the majority of the metrics.

Evaluation results on NYUv2 test set. Self-sup w/ AIF means that the method is self-supervised but utilizes AIF ground-truth. Results show that our method is on par with the state-of-the-art on NYUv2 dataset for the depth-from-defocus task.

Some examples of the framework outputs. The outputs are produced from the input focal stacks with 5 images. For the depth map, lighter colors indicate farther distances.

Results on Real Focal Stack Dataset

To evaluate our model on real focal stacks, qualitative experiments are performed on the Mobile Depth dataset. We compare our model with MobileDFF AiFDepthNet, DDF-DFV/FV and the finetuned AiFDepthNet using AIF ground-truth. While most of the deep methods give reasonable depth estimation, we claim that our framework is more advantageous since it is fully self-supervised.

Qualitative depth estimation and AIF prediction results on the Mobile Depth dataset. The warmer color indicates a larger depth. Note that AiFDepthNet* is the finetuned model using AIF information.

BibTeX

@article{si2023fully,
      title={Fully Self-Supervised Depth Estimation from Defocus Clue},
      author={Si, Haozhe and Zhao, Bin and Wang, Dong and Gao, Yupeng and Chen, Mulin and Wang, Zhigang and Li, Xuelong},
      journal={arXiv preprint arXiv:2303.10752},
      year={2023}
    }