Methods to Estimate Depth from a Single Picture | by Jacob Marks, Ph.D.

Machine Learning

Methods to Estimate Depth from a Single Picture | by Jacob Marks, Ph.D. | Jan, 2024

hhhhm

2024年1月26日

Methods to Estimate Depth from a Single Picture | by Jacob Marks, Ph.D. | Jan, 2024

[ad_1]

Run and consider monocular depth estimation fashions with Hugging Face and FiftyOne

Monocular Depth warmth maps generated with Marigold on NYU depth v2 photos. Picture courtesy of the writer.

People view the world by way of two eyes. One of many main advantages of this binocular imaginative and prescient is the power to understand depth — how close to or far objects are. The human mind infers object depths by evaluating the images captured by left and proper eyes on the similar time and decoding the disparities. This course of is named stereopsis.

Simply as depth notion performs an important function in human imaginative and prescient and navigation, the power to estimate depth is important for a variety of laptop imaginative and prescient functions, from autonomous driving to robotics, and even augmented actuality. But a slew of sensible concerns from spatial limitations to budgetary constraints usually restrict these functions to a single digital camera.

Monocular depth estimation (MDE) is the duty of predicting the depth of a scene from a single picture. Depth computation from a single picture is inherently ambiguous, as there are a number of methods to challenge the identical 3D scene onto the 2D aircraft of a picture. Consequently, MDE is a difficult process that requires (both explicitly or implicitly) factoring in lots of cues resembling object dimension, occlusion, and perspective.

On this submit, we’ll illustrate the best way to load and visualize depth map knowledge, run monocular depth estimation fashions, and consider depth predictions. We are going to achieve this utilizing knowledge from the SUN RGB-D dataset.

Particularly, we’ll cowl the next:

We are going to use the Hugging Face transformers and diffusers libraries for inference, FiftyOne for knowledge administration and visualization, and scikit-image for analysis metrics. All of those libraries are open supply and free to make use of. Disclaimer: I work at Voxel51, the lead maintainers of certainly one of these libraries (FiftyOne).

Earlier than we get began, be sure to have the entire needed libraries put in:

pip set up -U torch fiftyone diffusers transformers scikit-image

Then we’ll import the modules we’ll be utilizing all through the submit:

from glob import glob
import numpy as np
from PIL import Picture
import torchimport fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.mind as fob
from fiftyone import ViewField as F

The SUN RGB-D dataset incorporates 10,335 RGB-D photos, every of which has a corresponding RGB picture, depth picture, and digital camera intrinsics. It incorporates photos from the NYU depth v2, Berkeley B3DO, and SUN3D datasets. SUN RGB-D is probably the most in style datasets for monocular depth estimation and semantic segmentation duties!

For this walkthrough, we’ll solely use the NYU depth v2 parts. NYU depth v2 is permissively licensed for business use (MIT), and might be downloaded from Hugging Face immediately.

Downloading the Uncooked Information

First, obtain the SUN RGB-D dataset from right here and unzip it, or use the next command to obtain it immediately:

curl -o sunrgbd.zip https://rgbd.cs.princeton.edu/knowledge/SUNRGBD.zip

After which unzip it:

unzip sunrgbd.zip

If you wish to use the dataset for different duties, you may absolutely convert the annotations and cargo them into your fiftyone.Dataset. Nonetheless, for this tutorial, we’ll solely be utilizing the depth photos, so we’ll solely use the RGB photos and the depth photos (saved within the depth_bfx sub-directories).

Creating the Dataset

As a result of we’re simply occupied with getting the purpose throughout, we’ll limit ourselves to the primary 20 samples, that are all from the NYU Depth v2 portion of the dataset:

## create, title, and persist the dataset
dataset = fo.Dataset(title="SUNRGBD-20", persistent=True)## pick first 20 scenes
scene_dirs = glob("SUNRGBD/kv1/NYUdata/*")[:20]
samples = []
for scene_dir in scene_dirs:
## Get picture file path from scene listing
image_path = glob(f"{scene_dir}/picture/*")[0]
## Get depth map file path from scene listing
depth_path = glob(f"{scene_dir}/depth_bfx/*")[0]
depth_map = np.array(Picture.open(depth_path))
depth_map = (depth_map * 255 / np.max(depth_map)).astype("uint8")
## Create pattern
pattern = fo.Pattern(
filepath=image_path,
gt_depth=fo.Heatmap(map=depth_map),
)
samples.append(pattern)
## Add samples to dataset
dataset.add_samples(samples);

Right here we’re storing the depth maps as heatmaps. Every thing is represented when it comes to normalized, relative distances, the place 255 represents the utmost distance within the scene and 0 represents the minimal distance within the scene. It is a frequent approach to symbolize depth maps, though it’s removed from the one manner to take action. If we have been occupied with absolute distances, we may retailer sample-wise parameters for the minimal and most distances within the scene, and use these to reconstruct absolutely the distances from the relative distances.

Visualizing Floor Fact Information

With heatmaps saved on our samples, we are able to visualize the bottom reality knowledge:

session = fo.launch_app(dataset, auto=False)
## then open tab to localhost:5151 in browser

Floor reality depth maps for samples from the SUN RGB-D dataset. Picture courtesy of the writer.

When working with depth maps, the colour scheme and opacity of the heatmap are necessary. I’m colorblind, so I discover that the viridis colormap with opacity turned all the way in which up works greatest for me.

Visibility settings for heatmaps. Picture courtesy of the writer.

Floor Fact?

Inspecting these RGB photos and depth maps, we are able to see that there are some inaccuracies within the floor reality depth maps. For instance, on this picture, the darkish rift by way of the middle of the picture is definitely the farthest a part of the scene, however the floor reality depth map reveals it because the closest a part of the scene:

Challenge in floor reality depth knowledge for pattern from the SUN RGB-D dataset. Picture courtesy of the writer.

This is likely one of the key challenges for MDE duties: floor reality knowledge is tough to return by, and is usually noisy! It’s important to concentrate on this whereas evaluating your MDE fashions.

Now that we’ve got our dataset loaded in, we are able to run monocular depth estimation fashions on our RGB photos!

For a very long time, the state-of-the-art fashions for monocular depth estimation resembling DORN and DenseDepth have been constructed with convolutional neural networks. Lately, nonetheless, each transformer-based fashions resembling DPT and GLPN, and diffusion-based fashions like Marigold have achieved exceptional outcomes!

On this part, we’ll present you the best way to generate MDE depth map predictions with each DPT and Marigold. In each instances, you may optionally run the mannequin domestically with the respective Hugging Face library, or run remotely with Replicate.

To run through Replicate, set up the Python consumer:

pip set up replicate

And export your Replicate API token:

export REPLICATE_API_TOKEN=r8_<your_token_here>

With Replicate, it would take a minute for the mannequin to load into reminiscence on the server (cold-start downside), however as soon as it does the prediction ought to solely take a couple of seconds. Relying in your native compute sources, working on server could offer you huge speedups in comparison with working domestically, particularly for Marigold and different diffusion-based depth-estimation approaches.

Monocular Depth Estimation with DPT

The primary mannequin we’ll run is a dense-prediction transformer (DPT). DPT fashions have discovered utility in each MDE and semantic segmentation — duties that require “dense”, pixel-level predictions.

The checkpoint beneath makes use of MiDaS, which returns the inverse depth map, so we’ve got to invert it again to get a comparable depth map.

To run domestically with transformers, first we load the mannequin and picture processor:

from transformers import AutoImageProcessor, AutoModelForDepthEstimation## swap for "Intel/dpt-large" if you would like
pretrained = "Intel/dpt-hybrid-midas"
image_processor = AutoImageProcessor.from_pretrained(pretrained)
dpt_model = AutoModelForDepthEstimation.from_pretrained(pretrained)

Subsequent, we encapsulate the code for inference on a pattern, together with pre and submit processing:

def apply_dpt_model(pattern, mannequin, label_field):
picture = Picture.open(pattern.filepath)
inputs = image_processor(photos=picture, return_tensors="pt")with torch.no_grad():
outputs = mannequin(**inputs)
predicted_depth = outputs.predicted_depth
prediction = torch.nn.practical.interpolate(
predicted_depth.unsqueeze(1),
dimension=picture.dimension[::-1],
mode="bicubic",
align_corners=False,
)
output = prediction.squeeze().cpu().numpy()
## flip b/c MiDaS returns inverse depth
formatted = (255 - output * 255 / np.max(output)).astype("uint8")
pattern[label_field] = fo.Heatmap(map=formatted)
pattern.save()

Right here, we’re storing predictions in a label_field discipline on our samples, represented with a heatmap identical to the bottom reality labels.

Observe that within the apply_dpt_model() operate, between the mannequin’s ahead move and the heatmap era, discover that we make a name to torch.nn.practical.interpolate(). It’s because the mannequin’s ahead move is run on a downsampled model of the picture, and we need to return a heatmap that’s the similar dimension as the unique picture.

Why do we have to do that? If we simply need to *look* on the heatmaps, this could not matter. But when we need to evaluate the bottom reality depth maps to the mannequin’s predictions on a per-pixel foundation, we have to ensure that they’re the identical dimension.

All that’s left to do is iterate by way of the dataset:

for pattern in dataset.iter_samples(autosave=True, progress=True):
apply_dpt_model(pattern, dpt_model, "dpt")session = fo.launch_app(dataset)

Relative depth maps predicted by a hybrid MiDaS DPT mannequin on SUN RGB-D pattern photos. Picture courtesy of the writer.

To run with Replicate, you should utilize this mannequin. Here’s what the API seems like:

import replicate## instance utility to first pattern
rgb_fp = dataset.first().filepath
output = replicate.run(
"cjwbw/midas:a6ba5798f04f80d3b314de0f0a62277f21ab3503c60c84d4817de83c5edfdae0",
enter={
"model_type": "dpt_beit_large_512",
"picture":open(rgb_fp, "rb")
}
)
print(output)

Monocular Depth Estimation with Marigold

Stemming from their large success in text-to-image contexts, diffusion fashions are being utilized to an ever-broadening vary of issues. Marigold “repurposes” diffusion-based picture era fashions for monocular depth estimation.

To run Marigold domestically, you’ll need to clone the git repository:

git clone https://github.com/prs-eth/Marigold.git

This repository introduces a brand new diffusers pipeline, MarigoldPipeline, which makes making use of Marigold straightforward:

## load mannequin
from Marigold.marigold import MarigoldPipeline
pipe = MarigoldPipeline.from_pretrained("Bingxin/Marigold")## apply to first pattern, as instance
rgb_image = Picture.open(dataset.first().filepath)
output = pipe(rgb_image)
depth_image = output['depth_colored']

Submit-processing of the output depth picture is then wanted.

To as a substitute run through Replicate, we are able to create an apply_marigold_model() operate in analogy with the DPT case above and iterate over the samples in our dataset:

import replicate
import requests
import iodef marigold_model(rgb_image):
output = replicate.run(
"adirik/marigold:1a363593bc4882684fc58042d19db5e13a810e44e02f8d4c32afd1eb30464818",
enter={
"picture":rgb_image
}
)
## get the black and white depth map
response = requests.get(output[1]).content material
return response
def apply_marigold_model(pattern, mannequin, label_field):
rgb_image = open(pattern.filepath, "rb")
response = mannequin(rgb_image)
depth_image = np.array(Picture.open(io.BytesIO(response)))[:, :, 0] ## all channels are the identical
formatted = (255 - depth_image).astype("uint8")
pattern[label_field] = fo.Heatmap(map=formatted)
pattern.save()
for pattern in dataset.iter_samples(autosave=True, progress=True):
apply_marigold_model(pattern, marigold_model, "marigold")
session = fo.launch_app(dataset)

Relative depth maps predicted with Marigold endpoint on SUN RGB-D pattern photos. Picture courtesy of the writer.

Now that we’ve got predictions from a number of fashions, let’s consider them! We are going to leverage scikit-image to use three easy metrics generally used for monocular depth estimation: root imply squared error (RMSE), peak sign to noise ratio (PSNR), and structural similarity index (SSIM).

Increased PSNR and SSIM scores point out higher predictions, whereas decrease RMSE scores point out higher predictions.

Observe that the particular values I arrive at are a consequence of the particular pre-and-post processing steps I carried out alongside the way in which. What issues is the relative efficiency!

We are going to outline the analysis routine:

from skimage.metrics import peak_signal_noise_ratio, mean_squared_error, structural_similaritydef rmse(gt, pred):
"""Compute root imply squared error between floor reality and prediction"""
return np.sqrt(mean_squared_error(gt, pred))
def evaluate_depth(dataset, prediction_field, gt_field):
"""Run 3 analysis metrics for all samples for `prediction_field`
with respect to `gt_field`"""
for pattern in dataset.iter_samples(autosave=True, progress=True):
gt_map = pattern[gt_field].map
pred = pattern[prediction_field]
pred_map = pred.map
pred["rmse"] = rmse(gt_map, pred_map)
pred["psnr"] = peak_signal_noise_ratio(gt_map, pred_map)
pred["ssim"] = structural_similarity(gt_map, pred_map)
pattern[prediction_field] = pred
## add dynamic fields to dataset so we are able to view them within the App
dataset.add_dynamic_sample_fields()

After which apply the analysis to the predictions from each fashions:

evaluate_depth(dataset, "dpt", "gt_depth")
evaluate_depth(dataset, "marigold", "gt_depth")

Computing common efficiency for a sure mannequin/metric is so simple as calling the dataset’s imply()methodology on that discipline:

print("Imply Error Metrics")
for mannequin in ["dpt", "marigold"]:
print("-"*50)
for metric in ["rmse", "psnr", "ssim"]:
mean_metric_value = dataset.imply(f"{mannequin}.{metric}")
print(f"Imply {metric} for {mannequin}: {mean_metric_value}")

Imply Error Metrics
--------------------------------------------------
Imply rmse for dpt: 49.8915828817003
Imply psnr for dpt: 14.805904629602551
Imply ssim for dpt: 0.8398022368184576
--------------------------------------------------
Imply rmse for marigold: 104.0061165272178
Imply psnr for marigold: 7.93015537185192
Imply ssim for marigold: 0.42766803372861134

The entire metrics appear to agree that DPT outperforms Marigold. Nonetheless, it is very important observe that these metrics will not be good. For instance, RMSE could be very delicate to outliers, and SSIM is just not very delicate to small errors. For a extra thorough analysis, we are able to filter by these metrics within the app with a view to visualize what the mannequin is doing properly and what it’s doing poorly — or the place the metrics are failing to seize the mannequin’s efficiency.

Lastly, toggling masks on and off is a good way to visualise the variations between the bottom reality and the mannequin’s predictions:

Visible comparability of heatmaps predicted by the 2 MDE fashions and the bottom reality. Picture courtesy of the writer.

To recap, we discovered the best way to run monocular depth estimation fashions on our knowledge, the best way to consider the predictions utilizing frequent metrics, and the best way to visualize the outcomes. We additionally discovered that monocular depth estimation is a notoriously troublesome process.

Information high quality and amount are severely limiting components; fashions usually wrestle to generalize to new environments; and metrics will not be all the time good indicators of mannequin efficiency. The precise numeric values quantifying mannequin efficiency can rely vastly in your processing pipeline. And even your qualitative evaluation of predicted depth maps might be closely influenced by your coloration schemes and opacity scales.

If there’s one factor you’re taking away from this submit, I hope it’s this: it’s mission-critical that you simply have a look at the depth maps themselves, and never simply the metrics!

[ad_2]