Depth Anything v2 — state-of-the-art monocular depth

Per-pixel depthfrom a single image — no stereo camera needed.

Estimate scene depth from any photo. No stereo rig. No LiDAR. No calibration. mSightFlow hosts Depth Anything v2 — the strongest open monocular depth model — as a REST endpoint with a magma-colormap visualisation by default and raw float32 on Pro tier.

Try in Studio free Read API reference

Model: Depth Anything v2 (ViT-B)
Inputs: JPG/PNG ≤ 25 MB (single image)
Outputs: per-pixel depth map (PNG, magma colormap)
Latency: ~400 ms p50
Free quota: 300 calls / month

Monocular depth — depth from a single image, no stereo — was hard until Depth Anything v2. Trained on 62 million unlabelled images plus 595K labelled pairs, it generalises across indoor, outdoor, macro, drone, and aerial views with a quality that two years ago required a stereo rig or LiDAR.

mSightFlow exposes Depth Anything v2 ViT-B as a REST endpoint. Returns a colourised PNG by default; Pro tier returns raw float32 depth so you can feed it into your own maths. Pair with SAM or detection to get per-object depth statistics.

When monocular depth is the right tool

AR & composition

Depth-aware background blur, focus pulling, simulated bokeh, parallax effects, AR object placement.

Defect sizing

Combine a SAM mask with the depth map to estimate physical extent of an annotated defect relative to a known reference.

Robotics-adjacent

Single-camera obstacle priority, distance-to-target estimation, scene parsing where stereo isn't available.

Relative vs metric depth — which one?

	Relative depth (default)	Metric depth (roadmapped)
Output unit	0.0–1.0 (normalised)	metres
Needs calibration?	No	Camera intrinsics required
Cross-image comparison?	No	Yes
Good for	AR, sizing relative to reference, scene priority	Measurement, robotics, mapping
Available today?	✅ Yes	Pro tier — request access

Code — Python, Node, cURL

Python

import os, base64, requests
from pathlib import Path

resp = requests.post(
    "https://api.msightflow.ai/v1/depth",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files={"image": Path("scene.jpg").read_bytes()},
)
result = resp.json()

# Save the colourised depth map
Path("depth.png").write_bytes(base64.b64decode(result["depth_map"]))

# On Pro tier, raw float32 depth is also returned
# import numpy as np
# arr = np.frombuffer(base64.b64decode(result["raw_depth"]), dtype=np.float32)

Node.js

import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("scene.jpg"));

const resp = await fetch("https://api.msightflow.ai/v1/depth", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { depth_map } = await resp.json();
fs.writeFileSync("depth.png", Buffer.from(depth_map, "base64"));

cURL

curl -X POST https://api.msightflow.ai/v1/depth \
  -H "Authorization: Bearer $MSF_API_KEY" \
  -F "image=@scene.jpg" \
  --output depth-response.json

Depth + SAM = per-object depth

# Depth + SAM = per-object depth statistics.
import json, base64, numpy as np
from PIL import Image
import io

# 1. Get depth map
depth_resp = requests.post(api + "/depth", headers=hdr, files={"image": img}).json()
depth = np.array(Image.open(io.BytesIO(base64.b64decode(depth_resp["depth_map"]))).convert("L"))

# 2. Get object mask via SAM
seg_resp = requests.post(api + "/interactive_segment", headers=hdr,
    files={"image": img}, data={"points": '[{"x":320,"y":240,"label":1}]'}).json()
mask = np.array(Image.open(io.BytesIO(base64.b64decode(seg_resp["mask"]))).convert("L")) > 127

# 3. Object depth statistics
obj_depth = depth[mask]
print(f"object median depth: {np.median(obj_depth)}, std: {np.std(obj_depth):.1f}")

Pricing — same as every other endpoint

Free

300 API calls / month
Colourised PNG output
All inference endpoints
No credit card

Start free

Standard

$10/mo

5,400 API calls / month
500 exports / month
Batch up to 10 images / call

Pick Standard

Pro

$29/mo

Unlimited calls
Raw float32 depth output
Higher per-provider quotas

Go Pro

Related features

SAM segmentation

Mask + depth = per-object depth statistics. Sizing, AR, priority — all unlocked.

Learn more

Pose estimation

17 keypoints in image coordinates. Combine with depth for 3D-aware pose, AR avatars, ergonomics.

Learn more

Object detection

Bboxes + depth = scene-priority object lists. The detection backbone for robotics-adjacent uses.

Learn more

FAQ

What's the difference between relative and metric depth?

Depth Anything v2 (the model we host) returns relative depth — depths are accurate relative to each other within the same image, but not in metres. For metric depth you need either a calibrated stereo rig, a known reference object in the scene, or a metric-finetuned model (roadmapped for Pro tier).

Why monocular instead of stereo?

Monocular works with any single-frame image — phone photos, drone footage, archived video, historical photos. No special hardware required. Stereo and structured-light depth are higher-accuracy but require capture-side investment. Many real applications (AR effects, defect sizing relative to a reference object, scene priority) work fine with relative depth.

What input resolution should I use?

Depth Anything v2 internally resizes to 518 px on the long side, so larger inputs don't add accuracy and they slow down the call. Send images at the resolution you care about (we resize the output back to match your input). For batch processing of large image archives, downsize to ~1024 px to save bandwidth.

Can I use this for SLAM or 3D reconstruction?

Not directly. SLAM needs camera-pose tracking, which mSightFlow doesn't do. Depth Anything v2 outputs gives you depth per frame; combining frames into a coherent 3D model is a separate workflow (try Open3D or COLMAP downstream).

How accurate is it?

On standard benchmarks Depth Anything v2 ViT-B is state-of-the-art for zero-shot monocular depth. It handles indoor scenes, outdoor scenes, and macro shots well. Transparent or highly reflective surfaces remain the hardest case — expect noise on glass and polished metal.

One image. Full depth. No rig.

300 free API calls / month. Depth Anything v2. No credit card.

Start free Or try in Studio