Zero-shot object detection— describe what to find in plain English.

Stop labelling for every class you might ever need. Pass a text prompt — “a person wearing a hard hat”, “rust on a metal surface”, “small drone in sky” — and get back bounding boxes and confidence scores. No training, no datasets, no model selection.

Try in Studio free Read API reference

Model: Grounding DINO (Tiny)
Inputs: JPG/PNG ≤ 25 MB + text prompt
Outputs: bounding boxes · labels · confidence
Latency: ~350 ms p50
Free quota: 300 calls / month

Grounding DINO is an open-vocabulary detector — trained on image-text pairs, it can detect objects described by text prompts it has never seen during training. That breaks the traditional detection workflow (collect labels for every class, train, fail on a new class, collect more labels). With Grounding DINO, the prompt is the class.

mSightFlow hosts the “Tiny” checkpoint (good balance of speed and accuracy) as a REST endpoint with optional visualisation overlay, confidence-threshold tuning, and seamless pairing with SAM segmentation for end-to-end zero-shot detect-and-segment.

When zero-shot detection is the right tool

Rapid prototyping

Validate whether CV works on your problem before committing to a labelling project. Two minutes from idea to numbers.

Dataset bootstrapping

Pre-label a new class with prompts, then refine with auto-labelling and active learning.

Long-tail classes

Rare objects that don't have public datasets — niche tooling, small wildlife, custom industrial defects. The prompt covers them.

How it works — three steps

01
Write a text prompt
Comma-separated nouns: person, hard hat, safety vest. Or phrasal: small drone in sky. Up to 20 classes per call.
02
POST image + prompt
/v1/detect/zero-shot with your image, prompt, and optional confidence threshold. ~350 ms p50.
03
Get bounding boxes
Per-detection box, label (matched to your prompt), and confidence. Pipe into SAM, auto-label, or your downstream code.

Code — Python, Node, cURL

Python

import os, requests
from pathlib import Path

api_key = os.environ["MSF_API_KEY"]

resp = requests.post(
    "https://api.msightflow.ai/v1/detect/zero-shot",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"image": Path("factory_floor.jpg").read_bytes()},
    data={
        "prompt": "person, hard hat, safety vest",
        "confidence_threshold": "0.25",
        "return_overlay": "true",
    },
)
for d in resp.json()["detections"]:
    print(f"{d['label']:>14} @ {d['box']}  ({d['confidence']:.2f})")

Node.js

import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("factory_floor.jpg"));
form.append("prompt", "person, hard hat, safety vest");
form.append("confidence_threshold", "0.25");

const resp = await fetch("https://api.msightflow.ai/v1/detect/zero-shot", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { detections } = await resp.json();
console.log(`Found ${detections.length} objects`);

cURL

curl -X POST https://api.msightflow.ai/v1/detect/zero-shot \
  -H "Authorization: Bearer $MSF_API_KEY" \
  -F "image=@factory_floor.jpg" \
  -F "prompt=person, hard hat, safety vest" \
  -F "confidence_threshold=0.25"

Zero-shot → SAM (detect-and-segment)

# Zero-shot detection → SAM segmentation: end-to-end zero-shot pipeline.
det = requests.post(
    "https://api.msightflow.ai/v1/detect/zero-shot",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"image": image_bytes},
    data={"prompt": "cracked weld"},
).json()

for d in det["detections"]:
    cx, cy = (d["box"][0] + d["box"][2]) / 2, (d["box"][1] + d["box"][3]) / 2
    seg = requests.post(
        "https://api.msightflow.ai/v1/segment/interactive",
        headers={"Authorization": f"Bearer {api_key}"},
        files={"image": image_bytes},
        data={"points": f'[{{"x":{cx},"y":{cy},"label":1}}]'},
    ).json()
    # seg["mask"] now contains a pixel-perfect mask for the crack

Prompt-writing tips

Use nouns, not verbs. helmet works; wearing a helmet works less well.
Comma-separate distinct classes. Each comma starts a new class. Don't comma-separate adjectives.
Add visual modifiers when colour or size matters. red car, small drone, thin crack.
Avoid abstract nouns. safety won't detect anything. safety helmet will.
Lower the threshold for recall, raise it for precision. Default 0.25 is balanced for general scenes.
For unusual classes, A/B-test prompt phrasings. weld bead vs welding seam can give different recall.

Pricing — same as every other endpoint

Free

300 API calls / month
50 exports / month
All inference endpoints
No credit card

Start free

Standard

$10/mo

5,400 API calls / month
500 exports / month
Batch up to 10 images / call

Pick Standard

Pro

$29/mo

Unlimited calls
Unlimited exports
Higher per-provider quotas

Go Pro

Full feature matrix on the pricing page.

Related features

SAM segmentation

Pair Grounding DINO bboxes with SAM masks for end-to-end zero-shot segmentation.

Learn more

Auto-labelling

Dispatch detect + segment + classify + pose in one call to bootstrap an annotation pass.

Learn more

Trained YOLO detection

When the classes are stable and you want lower latency, fine-tuned YOLO beats zero-shot.

Learn more

FAQ

How is zero-shot detection different from a fine-tuned YOLO model?

A fine-tuned detector only knows the classes in its training set. Grounding DINO is trained on text-image pairs and can detect any class you describe in natural language — including classes that don't have public datasets. Trade-off: a fine-tuned model is faster and usually more accurate on the classes it was trained for. Use zero-shot for rapid prototyping, rare classes, and dataset bootstrapping.

What prompt syntax works best?

Comma-separated noun phrases are the most reliable: 'person, helmet, safety vest'. Add modifiers when needed: 'red car', 'small drone in sky', 'cracked weld'. Avoid full sentences. Each comma starts a new class.

How many classes can I detect in one call?

Up to 20 comma-separated class queries per call. Beyond that, accuracy starts to drop because the model has to fit all prompts in its text encoder. For more classes, split into multiple calls.

Can I combine this with SAM for segmentation?

Yes — that's the canonical zero-shot dataset-bootstrapping flow. Grounding DINO returns bboxes; pass each bbox center to /v1/segment/interactive as a point prompt, and SAM returns a pixel-perfect mask. End-to-end zero-shot segmentation with no training.

What's the confidence threshold?

Default 0.25. Lower it (e.g. 0.15) for rare-class recall; raise it (e.g. 0.4) for cleaner precision. Returned confidences are calibrated separately for text queries vs box quality, so the same threshold may behave differently across prompts.

Prompt. Detect. Ship.

300 free API calls / month. Grounding DINO. Any class. No training.

Start free Or try in Studio

When zero-shot detection is the right tool

Rapid prototyping

Dataset bootstrapping

Long-tail classes

How it works — three steps

Write a text prompt

POST image + prompt

Get bounding boxes

Code — Python, Node, cURL

Prompt-writing tips

Pricing — same as every other endpoint

Free

Standard

Pro

Related features

SAM segmentation

Auto-labelling

Trained YOLO detection

FAQ

Prompt. Detect. Ship.