Depth Anything v2 — state-of-the-art monocular depth

Per-pixel depthfrom a single image — no stereo camera needed.

Estimate scene depth from any photo. No stereo rig. No LiDAR. No calibration. mSightFlow hosts Depth Anything v2 — the strongest open monocular depth model — as a REST endpoint with a magma-colormap visualisation by default and raw float32 on Pro tier.

Model
Depth Anything v2 (ViT-B)
Inputs
JPG/PNG ≤ 25 MB (single image)
Outputs
per-pixel depth map (PNG, magma colormap)
Latency
~400 ms p50
Free quota
300 calls / month

Monocular depth — depth from a single image, no stereo — was hard until Depth Anything v2. Trained on 62 million unlabelled images plus 595K labelled pairs, it generalises across indoor, outdoor, macro, drone, and aerial views with a quality that two years ago required a stereo rig or LiDAR.

mSightFlow exposes Depth Anything v2 ViT-B as a REST endpoint. Returns a colourised PNG by default; Pro tier returns raw float32 depth so you can feed it into your own maths. Pair with SAM or detection to get per-object depth statistics.

When monocular depth is the right tool

AR & composition

Depth-aware background blur, focus pulling, simulated bokeh, parallax effects, AR object placement.

Defect sizing

Combine a SAM mask with the depth map to estimate physical extent of an annotated defect relative to a known reference.

Robotics-adjacent

Single-camera obstacle priority, distance-to-target estimation, scene parsing where stereo isn't available.

Relative vs metric depth — which one?

Relative depth (default)Metric depth (roadmapped)
Output unit0.0–1.0 (normalised)metres
Needs calibration?NoCamera intrinsics required
Cross-image comparison?NoYes
Good forAR, sizing relative to reference, scene priorityMeasurement, robotics, mapping
Available today?✅ YesPro tier — request access

Code — Python, Node, cURL

Python
import os, base64, requests
from pathlib import Path

resp = requests.post(
    "https://api.msightflow.ai/v1/depth",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files={"image": Path("scene.jpg").read_bytes()},
)
result = resp.json()

# Save the colourised depth map
Path("depth.png").write_bytes(base64.b64decode(result["depth_map"]))

# On Pro tier, raw float32 depth is also returned
# import numpy as np
# arr = np.frombuffer(base64.b64decode(result["raw_depth"]), dtype=np.float32)
Node.js
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("scene.jpg"));

const resp = await fetch("https://api.msightflow.ai/v1/depth", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { depth_map } = await resp.json();
fs.writeFileSync("depth.png", Buffer.from(depth_map, "base64"));
cURL
curl -X POST https://api.msightflow.ai/v1/depth \
  -H "Authorization: Bearer $MSF_API_KEY" \
  -F "image=@scene.jpg" \
  --output depth-response.json
Depth + SAM = per-object depth
# Depth + SAM = per-object depth statistics.
import json, base64, numpy as np
from PIL import Image
import io

# 1. Get depth map
depth_resp = requests.post(api + "/depth", headers=hdr, files={"image": img}).json()
depth = np.array(Image.open(io.BytesIO(base64.b64decode(depth_resp["depth_map"]))).convert("L"))

# 2. Get object mask via SAM
seg_resp = requests.post(api + "/interactive_segment", headers=hdr,
    files={"image": img}, data={"points": '[{"x":320,"y":240,"label":1}]'}).json()
mask = np.array(Image.open(io.BytesIO(base64.b64decode(seg_resp["mask"]))).convert("L")) > 127

# 3. Object depth statistics
obj_depth = depth[mask]
print(f"object median depth: {np.median(obj_depth)}, std: {np.std(obj_depth):.1f}")

Pricing — same as every other endpoint

Free

$0

  • 300 API calls / month
  • Colourised PNG output
  • All inference endpoints
  • No credit card
Start free

Pro

$29/mo

  • Unlimited calls
  • Raw float32 depth output
  • Higher per-provider quotas
Go Pro

Related features

FAQ

What's the difference between relative and metric depth?

Depth Anything v2 (the model we host) returns relative depth — depths are accurate relative to each other within the same image, but not in metres. For metric depth you need either a calibrated stereo rig, a known reference object in the scene, or a metric-finetuned model (roadmapped for Pro tier).

Why monocular instead of stereo?

Monocular works with any single-frame image — phone photos, drone footage, archived video, historical photos. No special hardware required. Stereo and structured-light depth are higher-accuracy but require capture-side investment. Many real applications (AR effects, defect sizing relative to a reference object, scene priority) work fine with relative depth.

What input resolution should I use?

Depth Anything v2 internally resizes to 518 px on the long side, so larger inputs don't add accuracy and they slow down the call. Send images at the resolution you care about (we resize the output back to match your input). For batch processing of large image archives, downsize to ~1024 px to save bandwidth.

Can I use this for SLAM or 3D reconstruction?

Not directly. SLAM needs camera-pose tracking, which mSightFlow doesn't do. Depth Anything v2 outputs gives you depth per frame; combining frames into a coherent 3D model is a separate workflow (try Open3D or COLMAP downstream).

How accurate is it?

On standard benchmarks Depth Anything v2 ViT-B is state-of-the-art for zero-shot monocular depth. It handles indoor scenes, outdoor scenes, and macro shots well. Transparent or highly reflective surfaces remain the hardest case — expect noise on glass and polished metal.

One image. Full depth. No rig.

300 free API calls / month. Depth Anything v2. No credit card.