Zero-shot object detection— describe what to find in plain English.
Stop labelling for every class you might ever need. Pass a text prompt — “a person wearing a hard hat”, “rust on a metal surface”, “small drone in sky” — and get back bounding boxes and confidence scores. No training, no datasets, no model selection.
- Model
- Grounding DINO (Tiny)
- Inputs
- JPG/PNG ≤ 25 MB + text prompt
- Outputs
- bounding boxes · labels · confidence
- Latency
- ~350 ms p50
- Free quota
- 300 calls / month
Grounding DINO is an open-vocabulary detector — trained on image-text pairs, it can detect objects described by text prompts it has never seen during training. That breaks the traditional detection workflow (collect labels for every class, train, fail on a new class, collect more labels). With Grounding DINO, the prompt is the class.
mSightFlow hosts the “Tiny” checkpoint (good balance of speed and accuracy) as a REST endpoint with optional visualisation overlay, confidence-threshold tuning, and seamless pairing with SAM segmentation for end-to-end zero-shot detect-and-segment.
When zero-shot detection is the right tool
Rapid prototyping
Validate whether CV works on your problem before committing to a labelling project. Two minutes from idea to numbers.
Dataset bootstrapping
Pre-label a new class with prompts, then refine with auto-labelling and active learning.
Long-tail classes
Rare objects that don't have public datasets — niche tooling, small wildlife, custom industrial defects. The prompt covers them.
How it works — three steps
- 01
Write a text prompt
Comma-separated nouns:
person, hard hat, safety vest. Or phrasal:small drone in sky. Up to 20 classes per call. - 02
POST image + prompt
/v1/detect/zero-shotwith your image, prompt, and optional confidence threshold. ~350 ms p50. - 03
Get bounding boxes
Per-detection box, label (matched to your prompt), and confidence. Pipe into SAM, auto-label, or your downstream code.
Code — Python, Node, cURL
import os, requests
from pathlib import Path
api_key = os.environ["MSF_API_KEY"]
resp = requests.post(
"https://api.msightflow.ai/v1/detect/zero-shot",
headers={"Authorization": f"Bearer {api_key}"},
files={"image": Path("factory_floor.jpg").read_bytes()},
data={
"prompt": "person, hard hat, safety vest",
"confidence_threshold": "0.25",
"return_overlay": "true",
},
)
for d in resp.json()["detections"]:
print(f"{d['label']:>14} @ {d['box']} ({d['confidence']:.2f})")
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";
const form = new FormData();
form.append("image", fs.createReadStream("factory_floor.jpg"));
form.append("prompt", "person, hard hat, safety vest");
form.append("confidence_threshold", "0.25");
const resp = await fetch("https://api.msightflow.ai/v1/detect/zero-shot", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
body: form,
});
const { detections } = await resp.json();
console.log(`Found ${detections.length} objects`);
curl -X POST https://api.msightflow.ai/v1/detect/zero-shot \
-H "Authorization: Bearer $MSF_API_KEY" \
-F "image=@factory_floor.jpg" \
-F "prompt=person, hard hat, safety vest" \
-F "confidence_threshold=0.25"
# Zero-shot detection → SAM segmentation: end-to-end zero-shot pipeline.
det = requests.post(
"https://api.msightflow.ai/v1/detect/zero-shot",
headers={"Authorization": f"Bearer {api_key}"},
files={"image": image_bytes},
data={"prompt": "cracked weld"},
).json()
for d in det["detections"]:
cx, cy = (d["box"][0] + d["box"][2]) / 2, (d["box"][1] + d["box"][3]) / 2
seg = requests.post(
"https://api.msightflow.ai/v1/segment/interactive",
headers={"Authorization": f"Bearer {api_key}"},
files={"image": image_bytes},
data={"points": f'[{{"x":{cx},"y":{cy},"label":1}}]'},
).json()
# seg["mask"] now contains a pixel-perfect mask for the crack
Prompt-writing tips
- Use nouns, not verbs.
helmetworks;wearing a helmetworks less well. - Comma-separate distinct classes. Each comma starts a new class. Don't comma-separate adjectives.
- Add visual modifiers when colour or size matters.
red car,small drone,thin crack. - Avoid abstract nouns.
safetywon't detect anything.safety helmetwill. - Lower the threshold for recall, raise it for precision. Default 0.25 is balanced for general scenes.
- For unusual classes, A/B-test prompt phrasings.
weld beadvswelding seamcan give different recall.
Pricing — same as every other endpoint
Standard
$10/mo
- 5,400 API calls / month
- 500 exports / month
- Batch up to 10 images / call
Full feature matrix on the pricing page.
Related features
SAM segmentation
Pair Grounding DINO bboxes with SAM masks for end-to-end zero-shot segmentation.
Learn moreAuto-labelling
Dispatch detect + segment + classify + pose in one call to bootstrap an annotation pass.
Learn moreTrained YOLO detection
When the classes are stable and you want lower latency, fine-tuned YOLO beats zero-shot.
Learn moreFAQ
How is zero-shot detection different from a fine-tuned YOLO model?
A fine-tuned detector only knows the classes in its training set. Grounding DINO is trained on text-image pairs and can detect any class you describe in natural language — including classes that don't have public datasets. Trade-off: a fine-tuned model is faster and usually more accurate on the classes it was trained for. Use zero-shot for rapid prototyping, rare classes, and dataset bootstrapping.
What prompt syntax works best?
Comma-separated noun phrases are the most reliable: 'person, helmet, safety vest'. Add modifiers when needed: 'red car', 'small drone in sky', 'cracked weld'. Avoid full sentences. Each comma starts a new class.
How many classes can I detect in one call?
Up to 20 comma-separated class queries per call. Beyond that, accuracy starts to drop because the model has to fit all prompts in its text encoder. For more classes, split into multiple calls.
Can I combine this with SAM for segmentation?
Yes — that's the canonical zero-shot dataset-bootstrapping flow. Grounding DINO returns bboxes; pass each bbox center to /v1/segment/interactive as a point prompt, and SAM returns a pixel-perfect mask. End-to-end zero-shot segmentation with no training.
What's the confidence threshold?
Default 0.25. Lower it (e.g. 0.15) for rare-class recall; raise it (e.g. 0.4) for cleaner precision. Returned confidences are calibrated separately for text queries vs box quality, so the same threshold may behave differently across prompts.
Prompt. Detect. Ship.
300 free API calls / month. Grounding DINO. Any class. No training.