Segment Anything (SAM)— click once, get a perfect mask.
Meta's Segment Anything Model gives you pixel-perfect masks from a single point. mSightFlow hosts SAM ViT-Base as a REST endpoint — no GPU setup, no model download, no PyTorch install. Drop it into your annotation tool, dataset pipeline, or product UI in under 10 lines of code.
- Model
- SAM ViT-Base (Meta AI)
- Inputs
- JPG, PNG · ≤ 25 MB
- Outputs
- binary mask (base64 PNG) · bbox · confidence
- Latency
- ~200 ms p50
- Free quota
- 300 calls / month
SAM (Segment Anything Model) was trained on the largest-ever segmentation dataset — 11 million images and 1.1 billion masks — with the goal of foundation-model segmentation: one model that can mask any object you point at, without per-class training. That breaks the usual segmentation pipeline (label thousands of class instances, train a model, watch it fail on a new class). With SAM, you point and you get the mask.
mSightFlow hosts SAM in two modes — point-prompt for interactive annotation and Everything mode for fully-automatic mask generation across an image — and includes the connective tissue that turns SAM's raw masks into COCO / YOLO polygons, auto-labelled datasets, or refined annotations.
When SAM is the right tool
Annotation tools
Replace the polygon-drawing flow with a one-click prompt. A ~3-hour image batch becomes a ~20-minute one. Pair with active learning to label only what your model is unsure about.
Dataset bootstrapping
Pair SAM with Grounding DINO to build a COCO-format dataset for a brand-new class with zero training data. Detect with text prompt, segment with SAM, export.
Background removal & cutouts
Point at the foreground subject; ship the resulting alpha-matte to your e-commerce, AR, or editing pipeline. SAM masks are tighter than mid-2020s salience-segmentation models.
Two modes: point-prompt vs Everything
Point-prompt mode
You know what you want. Click on it.
- One or more
(x, y, label)points -
label=1foreground (include),label=0background (exclude) - Returns the single mask containing your prompts
- Best for annotation UIs, AR cutouts, interactive product flows
Everything mode
Mask every object on a grid. No prompts.
-
points_per_sideparameter (4 coarse → 32 fine) - Returns an array of distinct masks with bboxes
- Best for batch dataset annotation, change detection, object counting
- Slower (proportional to
points_per_side²)
How it works — three steps
- 01
Send your image and a point
POST an image and one or more
(x, y, label)points to/v1/segment/interactivewith your bearer token. Free tier: 300 calls / month, no credit card. - 02
Get a mask back
The response is a binary mask (base64 PNG), bounding box, and confidence score. Median latency on a 1024-px image is around 200 ms.
- 03
Save, refine, or convert
Add positive points to grow the mask or negative points to shrink it. Convert the mask to a COCO polygon with
/labeling/mask-to-polygon, or pipe straight into dataset export.
Code — Python, Node, cURL
Drop into any HTTP-capable language. Python and Node SDKs are optional sugar over the REST endpoint.
import os, base64
import requests
from pathlib import Path
api_key = os.environ["MSF_API_KEY"]
resp = requests.post(
"https://api.msightflow.ai/v1/segment/interactive",
headers={"Authorization": f"Bearer {api_key}"},
files={"image": Path("image.jpg").read_bytes()},
data={
"points": '[{"x": 320, "y": 240, "label": 1}]', # 1 = foreground
"return_overlay": "true",
},
)
result = resp.json()
# Save the mask
Path("mask.png").write_bytes(base64.b64decode(result["mask"]))
print("bbox:", result["bbox"], "confidence:", result["confidence"])
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";
const form = new FormData();
form.append("image", fs.createReadStream("image.jpg"));
form.append("points", JSON.stringify([{ x: 320, y: 240, label: 1 }]));
form.append("return_overlay", "true");
const resp = await fetch("https://api.msightflow.ai/v1/segment/interactive", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
body: form,
});
const result = await resp.json();
console.log("bbox:", result.bbox, "confidence:", result.confidence);
curl -X POST https://api.msightflow.ai/v1/segment/interactive \
-H "Authorization: Bearer $MSF_API_KEY" \
-F "image=@image.jpg" \
-F 'points=[{"x":320,"y":240,"label":1}]' \
-F "return_overlay=true"
# "Everything mode": auto-mask every object in the image.
resp = requests.post(
"https://api.msightflow.ai/v1/segment/everything",
headers={"Authorization": f"Bearer {api_key}"},
files={"image": Path("image.jpg").read_bytes()},
data={"points_per_side": "16"}, # 4 (coarse) - 32 (fine)
)
result = resp.json()
# result["masks"]: list of {mask: base64, bbox, confidence}
print(f"Found {len(result['masks'])} distinct objects")
Integrate SAM into your annotation tool
A typical click-to-annotate workflow on top of mSightFlow:
- User clicks on an object in your image canvas → grab the click coordinates relative to the image.
- POST
/v1/segment/interactivewith that point. - Render the returned mask as a coloured overlay on the canvas (decode base64 PNG → ImageBitmap → drawImage).
- If the mask is wrong, capture a refinement click (positive or negative) and re-POST with the full point list. SAM resolves them jointly.
- When the user accepts, POST
/labeling/mask-to-polygonto convert to a polygon and save the annotation in your project. - At export time, /export writes the annotations as COCO / YOLO / Pascal VOC with the polygons attached.
SAM vs semantic segmentation — which one?
| SAM (this page) | Semantic / instance segmentation | |
|---|---|---|
| Outputs class labels? | No — class-agnostic | Yes — labels from training set |
| Works on unseen classes? | ✅ Yes | ❌ Only trained classes |
| Interactive prompting? | ✅ Point, box, or mask prompts | ❌ No prompts |
| Auto-mask everything? | ✅ Everything mode | ✅ Instance segmentation |
| Latency | ~200 ms point-prompt; ~3-15 s Everything mode | ~50-200 ms per image |
| Best for | Annotation UIs, dataset bootstrapping, cutouts | Production inference with known classes |
In real workflows, the two complement each other: SAM for the mask geometry, Grounding DINO or a trained detector for the class label. mSightFlow's auto-labelling pipeline runs both for you.
Pricing — same as every other endpoint
SAM consumes one API call per /v1/segment/interactive request. Everything mode also costs one call regardless of how many masks it returns.
Standard
$10/mo
- 5,400 API calls / month
- 500 exports / month
- Batch up to 10 images / call
Full feature matrix on the pricing page.
Related features
Zero-shot detection
Grounding DINO. Type a text prompt → get bounding boxes. Pair with SAM to label new classes from scratch.
Learn moreAuto-labelling
One-call dispatcher that runs detect + SAM + classify + pose to bootstrap an annotation pass.
Learn moreSemantic segmentation
Class-aware masks from trained models. Use when class labels matter and inputs are in your domain.
Learn moreFAQ
Which SAM model does mSightFlow use?
SAM ViT-Base — the smallest of Meta's three SAM checkpoints. It's a strong balance of accuracy and speed (~200 ms median on a single 1024-px image). ViT-Large and ViT-Huge are roadmapped for the Pro tier when latency-critical use cases demand them.
What's the difference between point-prompt and Everything mode?
Point-prompt segments the single object containing your click; Everything mode runs SAM on a regular grid of prompts (4-32 points per side) and returns every distinct mask in the image. Use point-prompt for click-to-annotate workflows; use Everything mode for automatic mask generation across a dataset.
Can I use multiple points?
Yes. Pass an array of points each tagged label=1 (foreground / include) or label=0 (background / exclude). The model resolves them jointly, which is the standard way to fix a mask that initially gets the wrong region.
Is the mask exportable to COCO or YOLO?
Yes. Use /labeling/mask-to-polygon to convert the binary mask to a polygon via cv2.findContours + Douglas-Peucker simplification. Polygons round-trip cleanly to COCO segmentation format and YOLO polygon format.
Is SAM the same as semantic segmentation?
No. SAM is class-agnostic — it gives you a mask for whatever you point at, with no class label attached. Semantic segmentation models output class labels (`person`, `car`, `road`) but require a class to be in their training set. Many real workflows combine them: SAM for the mask geometry, a classifier or zero-shot detection for the class.
Can SAM run on the edge?
Today, mSightFlow hosts SAM as a cloud REST endpoint. You can call it from any internet-connected device (Jetson, Raspberry Pi, ESP32-CAM). On-device SAM is feasible but slow on edge hardware without quantisation; an edge SDK is roadmapped.
Click. Mask. Ship.
300 free API calls / month. SAM ViT-Base. No credit card. No setup.