Powered by Grounding DINO

Zero-shot object detection— describe what to find in plain English.

Stop labelling for every class you might ever need. Pass a text prompt — “a person wearing a hard hat”, “rust on a metal surface”, “small drone in sky” — and get back bounding boxes and confidence scores. No training, no datasets, no model selection.

Model
Grounding DINO (Tiny)
Inputs
JPG/PNG ≤ 25 MB + text prompt
Outputs
bounding boxes · labels · confidence
Latency
~350 ms p50
Free quota
300 calls / month

Grounding DINO is an open-vocabulary detector — trained on image-text pairs, it can detect objects described by text prompts it has never seen during training. That breaks the traditional detection workflow (collect labels for every class, train, fail on a new class, collect more labels). With Grounding DINO, the prompt is the class.

mSightFlow hosts the “Tiny” checkpoint (good balance of speed and accuracy) as a REST endpoint with optional visualisation overlay, confidence-threshold tuning, and seamless pairing with SAM segmentation for end-to-end zero-shot detect-and-segment.

When zero-shot detection is the right tool

Rapid prototyping

Validate whether CV works on your problem before committing to a labelling project. Two minutes from idea to numbers.

Dataset bootstrapping

Pre-label a new class with prompts, then refine with auto-labelling and active learning.

Long-tail classes

Rare objects that don't have public datasets — niche tooling, small wildlife, custom industrial defects. The prompt covers them.

How it works — three steps

  1. 01

    Write a text prompt

    Comma-separated nouns: person, hard hat, safety vest. Or phrasal: small drone in sky. Up to 20 classes per call.

  2. 02

    POST image + prompt

    /v1/detect/zero-shot with your image, prompt, and optional confidence threshold. ~350 ms p50.

  3. 03

    Get bounding boxes

    Per-detection box, label (matched to your prompt), and confidence. Pipe into SAM, auto-label, or your downstream code.

Code — Python, Node, cURL

Python
import os, requests
from pathlib import Path

api_key = os.environ["MSF_API_KEY"]

resp = requests.post(
    "https://api.msightflow.ai/v1/detect/zero-shot",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"image": Path("factory_floor.jpg").read_bytes()},
    data={
        "prompt": "person, hard hat, safety vest",
        "confidence_threshold": "0.25",
        "return_overlay": "true",
    },
)
for d in resp.json()["detections"]:
    print(f"{d['label']:>14} @ {d['box']}  ({d['confidence']:.2f})")
Node.js
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("factory_floor.jpg"));
form.append("prompt", "person, hard hat, safety vest");
form.append("confidence_threshold", "0.25");

const resp = await fetch("https://api.msightflow.ai/v1/detect/zero-shot", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { detections } = await resp.json();
console.log(`Found ${detections.length} objects`);
cURL
curl -X POST https://api.msightflow.ai/v1/detect/zero-shot \
  -H "Authorization: Bearer $MSF_API_KEY" \
  -F "image=@factory_floor.jpg" \
  -F "prompt=person, hard hat, safety vest" \
  -F "confidence_threshold=0.25"
Zero-shot → SAM (detect-and-segment)
# Zero-shot detection → SAM segmentation: end-to-end zero-shot pipeline.
det = requests.post(
    "https://api.msightflow.ai/v1/detect/zero-shot",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"image": image_bytes},
    data={"prompt": "cracked weld"},
).json()

for d in det["detections"]:
    cx, cy = (d["box"][0] + d["box"][2]) / 2, (d["box"][1] + d["box"][3]) / 2
    seg = requests.post(
        "https://api.msightflow.ai/v1/segment/interactive",
        headers={"Authorization": f"Bearer {api_key}"},
        files={"image": image_bytes},
        data={"points": f'[{{"x":{cx},"y":{cy},"label":1}}]'},
    ).json()
    # seg["mask"] now contains a pixel-perfect mask for the crack

Prompt-writing tips

  1. Use nouns, not verbs. helmet works; wearing a helmet works less well.
  2. Comma-separate distinct classes. Each comma starts a new class. Don't comma-separate adjectives.
  3. Add visual modifiers when colour or size matters. red car, small drone, thin crack.
  4. Avoid abstract nouns. safety won't detect anything. safety helmet will.
  5. Lower the threshold for recall, raise it for precision. Default 0.25 is balanced for general scenes.
  6. For unusual classes, A/B-test prompt phrasings. weld bead vs welding seam can give different recall.

Pricing — same as every other endpoint

Free

$0

  • 300 API calls / month
  • 50 exports / month
  • All inference endpoints
  • No credit card
Start free

Pro

$29/mo

  • Unlimited calls
  • Unlimited exports
  • Higher per-provider quotas
Go Pro

Full feature matrix on the pricing page.

Related features

FAQ

How is zero-shot detection different from a fine-tuned YOLO model?

A fine-tuned detector only knows the classes in its training set. Grounding DINO is trained on text-image pairs and can detect any class you describe in natural language — including classes that don't have public datasets. Trade-off: a fine-tuned model is faster and usually more accurate on the classes it was trained for. Use zero-shot for rapid prototyping, rare classes, and dataset bootstrapping.

What prompt syntax works best?

Comma-separated noun phrases are the most reliable: 'person, helmet, safety vest'. Add modifiers when needed: 'red car', 'small drone in sky', 'cracked weld'. Avoid full sentences. Each comma starts a new class.

How many classes can I detect in one call?

Up to 20 comma-separated class queries per call. Beyond that, accuracy starts to drop because the model has to fit all prompts in its text encoder. For more classes, split into multiple calls.

Can I combine this with SAM for segmentation?

Yes — that's the canonical zero-shot dataset-bootstrapping flow. Grounding DINO returns bboxes; pass each bbox center to /v1/segment/interactive as a point prompt, and SAM returns a pixel-perfect mask. End-to-end zero-shot segmentation with no training.

What's the confidence threshold?

Default 0.25. Lower it (e.g. 0.15) for rare-class recall; raise it (e.g. 0.4) for cleaner precision. Returned confidences are calibrated separately for text queries vs box quality, so the same threshold may behave differently across prompts.

Prompt. Detect. Ship.

300 free API calls / month. Grounding DINO. Any class. No training.