For ML data engineers — the strongest mSightFlow persona

The labelling toolis the bottleneck — fix it.

From raw images to a labelled, augmented, exported dataset — and a closed-loop retraining cycle. SAM-assisted annotation, active learning, inter-annotator agreement, and COCO / YOLO / Pascal VOC export — all wired together so you build models instead of writing dataset-prep scripts.

Try in Studio free 2-minute quickstart

Annotation tooling: SAM · auto-label · active learning
QA built in: IAA · class-balance · annotator stats
Export formats: COCO · YOLO · Pascal VOC
Speed-up: 3-5× fewer labels for same accuracy
Free quota: 300 calls + 50 exports / month

What slows you down today

Four pains every ML data engineer recognises. Each maps to a built-in mSightFlow feature.

Labelling cost dominates everything

A 10k-image bounding-box pass at 30 sec/image is 83 hours of annotator time. Auto-labelling + SAM-refine typically cuts that to 1-2 hours. Active learning means you label only what moves accuracy.

Label inconsistency invisibly tanks models

Two annotators with different opinions = a model trained on contradictions. IAA built into the project surfaces it within the first 100 images, not after you've scaled to 10k.

Format conversion eats Fridays

Every team eventually writes their own COCO ↔ YOLO ↔ VOC converter. You shouldn't. The platform exports all three with deterministic splits and a webhook.

Closed-loop retraining is a real engineering build

Score-the-pool, pick-the-uncertain, annotate, retrain, repeat. mSightFlow's active-learning + assignments + export wires this together; you bring the trainer.

End-to-end workflow — 8 steps, one platform

From a folder of raw images to a model trained on labelled, augmented, quality-gated data — and back into the loop.

01 Import

JPG / PNG via drag-and-drop, REST upload, or cloud-bucket connector. Project-scoped — no cross-project leakage.

02 Auto-label

Dispatch detect + segment + pose + OCR + zero-shot + caption in one /labeling/auto call. COCO output with confidence scores.

03 SAM-refine

Click-to-segment in Studio with SAM ViT-Base. Add positive/negative points until the mask is right.

04 Active learning

/labeling/score-batch returns uncertainty-sorted queue. Hand top-50 to annotators for max accuracy-per-label.

05 QA + IAA

/quality/agreement computes per-image inter-annotator IoU. /quality/alerts flags class imbalance and label drift.

06 Augment

Server-side Albumentations with bbox-aware transforms. 3-5× growth typical. COCO / YOLO export of augmented set.

07 Export

COCO JSON, YOLO TXT (+ auto dataset.yaml), or Pascal VOC with deterministic split + DatasetVersion snapshot.

08 Retrain → loop

Train on export, score unlabelled pool with the fresh model, send top-N uncertain to annotators. Loop until plateau.

The four calls that run the loop

Auto-label (8 tasks in one call)

import os, requests

resp = requests.post(
    "https://api.msightflow.ai/v1/label/auto",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    data={
        "project_id": "PROJECT_ID",
        "tasks": "detect,segment,pose",   # mix-and-match — 8 task types available
    },
)
# Suggestions land in your project automatically. Source flagged as 'ai_generated'.
print(resp.json()["summary"])

Closed-loop active learning

# Closed-loop: score-batch → annotate top-N → retrain → repeat
import requests

# 1. Score the unlabelled pool with your current model
queue = requests.get(
    "https://api.msightflow.ai/v1/label/score-batch",
    headers=hdr,
    params={"project_id": "PROJECT_ID",
            "strategy": "diverse_uncertainty",
            "limit": 50},
).json()

# 2. Assign the top-50 to a specific annotator (PUT per image).
#    For algorithmic round-robin across all project members, swap this loop
#    for a single POST to /v1/projects/PROJECT_ID/auto-assign instead.
for q in queue["queue"]:
    requests.put(
        f"https://api.msightflow.ai/v1/projects/images/{q['image_id']}/assign",
        headers=hdr,
        json={"user_id": "ANNOTATOR_USER_ID"},
    )

# 3. (After human review) Export the latest verified set.
#    Sync GET — the response body is the dataset ZIP. Configure the
#    project's "project.exported" webhook in project settings to also
#    notify https://ci.example.com/cv-train-trigger.
export = requests.get(
    "https://api.msightflow.ai/v1/projects/PROJECT_ID/export",
    headers=hdr,
    params={"format": "yolo", "split": "80/10/10"},
    stream=True,
)
with open("dataset.zip", "wb") as f:
    for chunk in export.iter_content(1 << 20):
        f.write(chunk)

Server-side augmentation + export

# Augment-and-export in one request — server-side Albumentations
requests.post(
    "https://api.msightflow.ai/v1/projects/PROJECT_ID/export-augmented",
    headers=hdr,
    json={
        "pipeline": [
            {"type": "HorizontalFlip", "p": 0.5},
            {"type": "Rotate", "limit": 15, "p": 0.7},
            {"type": "RandomBrightnessContrast", "brightness_limit": 0.2, "p": 0.6},
            {"type": "GaussNoise", "var_limit": [10, 50], "p": 0.3},
        ],
        "augmentations_per_image": 4,
        "format": "yolo",
        "split": {"train": 0.8, "val": 0.1, "test": 0.1},
        "webhook_url": "https://ci.example.com/cv-dataset-ready",
    },
)

IAA quality gate before export

# Inter-annotator agreement check before exporting — catches label drift
iaa = requests.get(
    "https://api.msightflow.ai/v1/projects/PROJECT_ID/quality/agreement",
    headers=hdr,
    params={"min_annotators": 2},
).json()

if iaa["mean_iou"] < 0.70:
    print(f"⚠️  IAA = {iaa['mean_iou']:.3f}  — refine your label spec before scaling up")
    # Look at iaa['by_image'] to find the worst-agreement cases for review
else:
    print(f"✅ IAA = {iaa['mean_iou']:.3f}  — proceed to export")

Build-it-yourself vs mSightFlow — honest accounting

	Build it yourself	mSightFlow
SAM ViT-Base hosting + serving	2-3 weeks (GPU, CUDA, batching, queue)	Already there. REST in 200 ms.
Grounding DINO + YOLO + CLIP + BLIP + EasyOCR	2-3 months (each = one project)	All five hosted, one bearer token.
Annotation UI with SAM-assist	2-4 months (canvas, polygon ops, undo, multi-user)	Built into /studio.
Active-learning + IAA + class-balance	1-2 months	Three REST endpoints, free in every tier.
COCO ↔ YOLO ↔ Pascal VOC export with split	2-3 weeks (the “easy” part that always takes longer)	One API call, webhook on done.
Bbox-aware augmentation pipeline	1-2 weeks if you already know Albumentations	JSON pipeline config, server-side.
Dataset versioning + reproducible splits	1 week + ongoing maintenance	DatasetVersion snapshots, deterministic seeds.
Total before first model	~6 months for a small team	2 minutes via the quickstart.
Pricing	Engineering time + GPU compute + storage	$0 / $10 / $29 per month.

Build-it-yourself is the right call when your scale or specificity justifies the engineering. For everyone else, the time saved building dataset infra is time spent on the model that matters.

Pricing — same as every other tier

Free

300 API calls / month
50 exports / month
All annotation + QA + IAA features
No credit card

Start free

Standard

$10/mo

5,400 API calls / month
500 exports / month
Batch up to 10 images / call
Webhook on export completion

Pick Standard

Pro

$29/mo

Unlimited calls + exports
Custom strategy weights for active learning
Higher per-provider quotas

Go Pro

The features that power this workflow

Auto-labelling

8 task types in one call — detect, segment, pose, classify, OCR, zero-shot, caption.

Learn more

SAM segmentation

Click-to-segment with SAM ViT-Base. ~10× faster than polygon-drawing.

Learn more

Active learning

Uncertainty + diversity sampling. 3-5× fewer labels for the same accuracy.

Learn more

Annotation quality + IAA

Inter-annotator agreement, per-annotator stats, class-balance alerts. Free.

Learn more

Data augmentation

Server-side Albumentations with bbox-aware transforms. JSON pipeline config.

Learn more

Dataset export

COCO / YOLO / Pascal VOC with auto dataset.yaml + train/val/test split + webhook.

Learn more

FAQ — for ML data engineers

How does this compare to building the pipeline myself?

Three things mSightFlow gets you that take weeks to build solo: (1) SAM + Grounding DINO + YOLO + CLIP + BLIP all hosted with consistent JSON output — saves the GPU + serving infra you'd otherwise have to maintain. (2) Active-learning uncertainty scoring + inter-annotator agreement already wired into the project model. (3) COCO / YOLO / Pascal VOC export with auto dataset.yaml + reproducible splits. You bring the data + the human reviewers; we handle the model serving and the connective tissue.

Can I use my own annotation team / vendor?

Yes. Either bring annotators into mSightFlow projects (they label inside our UI, IAA + quality controls apply automatically), or use external annotators on exported data and re-import via the REST API. Common hybrid: pre-label with auto-label inside mSightFlow, export to your vendor for human review, re-import their corrections.

How do I close the loop — retrain → score → label → retrain?

The closed-loop pattern is: (a) export current labelled set to your trainer (YOLO / Detectron / etc.), train, (b) run /v1/label/score-batch on the unlabelled pool with your fresh model's predictions, (c) take the top-N uncertain images for human review inside mSightFlow, (d) export and retrain. Most teams hit a plateau within 3-5 rounds of this loop on detection tasks.

What's the bring-your-own-model story?

Pro tier hosts your ONNX or PyTorch model behind the same /v1/cv-tools/<your-tool>/run shape as the bundled tools. Your model joins the auto-label aggregator and active-learning scoring loop. Useful when you've spent time tuning a model on your specific data — keep it, gain the labelling / IAA / export tooling around it.

Reproducibility?

Every export creates a DatasetVersion snapshot — project state at export time is preserved so you can re-download the same dataset later, even if the project changes. Splits use a deterministic hash of image_id seeded by the export request, so the same project + ratios always produce the same train/val/test files. Pass a seed parameter to vary the split for cross-validation experiments.

How does mSightFlow integrate with my training pipeline?

Export-via-webhook is the most common integration: POST your export request with webhook_url; your CI / Airflow / Argo / GitHub Action receives the dataset URL when ready and kicks off training. For continuous-pull, poll the project state via /v1/quality/overview to detect 'enough new labels' before retraining.

Stop writing dataset infra. Train the model that matters.

300 API calls + 50 exports / month, free. Auto-label, SAM-refine, active-learn, augment, export — all in one platform.

Start free 2-minute quickstart