IAA · approval rates · class-balance · drift alerts

Catch label driftbefore it costs you a model.

The fastest way to train a bad model is to label inconsistently and find out at evaluation time. mSightFlow ships inter-annotator agreement, per-annotator accuracy, class-imbalance alerts, and a ground-truth benchmark mode in every project — no separate tool, no separate dashboard.

Try in Studio free Read API reference

Method: IoU agreement + class-distribution + benchmark vs ground-truth
Inputs: project_id
Outputs: agreement matrix · per-annotator stats · imbalance alerts · review queue
Endpoints: 6 (overview, agreement, …)
Pricing: Included in every tier

Most CV teams discover their annotation problems at the model- evaluation stage — when the mAP curve plateaus and post-mortems reveal that two annotators were labelling the same class differently. mSightFlow surfaces the problem during labelling: agreement metrics, per-annotator scorecards, class-distribution alerts, and drift detection. The quality dashboard is built into every project on every tier.

When annotation QA is the right tool

Multi-annotator teams

Run IAA over a 5-20% overlap pool to confirm annotators agree on edge cases. Below IoU 0.7 means your label spec needs sharpening.

Class-imbalance auditing

Auto-alerts when class distribution skews too far. Imbalanced datasets train models that ignore minority classes — catch it pre-training.

Onboarding new annotators

Benchmark mode against a small ground-truth pool. New annotators see their accuracy delta vs the team and self-correct.

Six endpoints, one quality story

Endpoint	What it returns
`/projects/{project_id}/quality/overview`	Labelled count, approval rate, rejection rate, AI-assisted share
`/projects/{project_id}/quality/agreement`	Per-image inter-annotator IoU (when multiple annotators overlap)
`/projects/{project_id}/quality/annotators`	Per-annotator: speed, approval rate, agreement with peers
`/projects/{project_id}/quality/images`	Per-image QA score, flag reasons, review status
`/projects/{project_id}/quality/benchmark`	Per-annotator accuracy vs ground-truth set
`/projects/{project_id}/quality/alerts`	Class imbalance, low-approval annotators, IAA drift

All six are read-only GET endpoints. Polling them at the start of a labelling sprint, then again at the end, surfaces drift in real time.

Code — overview, IAA, annotators, alerts

Overview

import os, requests

hdr = {"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"}
project_id = "PROJECT_ID"
api = f"https://api.msightflow.ai/v1/projects/{project_id}/quality"

# Project-level overview
overview = requests.get(f"{api}/overview", headers=hdr).json()
print(overview)
# → {
#     "labelled": 4231, "unlabelled": 569, "approval_rate": 0.93,
#     "rejection_rate": 0.04, "ai_assisted_share": 0.71
#   }

Inter-annotator IoU

# Per-image inter-annotator IoU (requires multiple annotators per image)
iaa = requests.get(f"{api}/agreement", headers=hdr,
                   params={"min_annotators": 2}).json()

print(f"mean IoU across {iaa['n_images']} images: {iaa['mean_iou']:.3f}")
for entry in iaa["by_image"][:10]:
    print(f"  {entry['image_id']}  IoU={entry['iou']:.3f}  n_annotators={entry['n']}")

Per-annotator stats

# Per-annotator stats — speed, approval rate, agreement with peers
ann = requests.get(f"{api}/annotators", headers=hdr).json()

for a in ann["annotators"]:
    print(f"  {a['annotator_id']:>20}  "
          f"labelled={a['n_labelled']:>5}  "
          f"approval={a['approval_rate']:.2f}  "
          f"speed={a['avg_seconds_per_image']:.1f}s  "
          f"iaa_with_peers={a['mean_iou']:.3f}")

Alerts

# Project-level alerts: imbalance, low-approval annotators, drift
alerts = requests.get(f"{api}/alerts", headers=hdr).json()

for a in alerts["alerts"]:
    print(f"[{a['severity']}] {a['type']}: {a['message']}")
    # e.g. [warning] class_imbalance: 'person' has 45× more samples than 'helmet'
    #      [error] low_approval: annotator user_abc123 approval rate dropped to 0.61

Pricing — included in every tier

Annotation QA is not a paid add-on. All six endpoints are free in every tier — including Free.

Free

All 6 quality endpoints
IAA · class balance · annotator stats

Start free

Standard

$10/mo

Everything in Free
5,400 API calls / month
Custom ground-truth pools

Pick Standard

Pro

$29/mo

Unlimited API calls
Higher per-provider quotas
Custom QA thresholds

Go Pro

Related features

Active learning

QA flags images annotators disagree on. Active learning prioritises them for re-review.

Learn more

Auto-labelling

QA helps audit AI-generated suggestions: filter unverified annotations before they reach training.

Learn more

Dataset export

Filter exports by approval state, IAA threshold, or annotator quality. Ship only the verified subset.

Learn more

FAQ

What is inter-annotator agreement (IAA) and why does it matter?

IAA measures how much two annotators agree when labelling the same image. For detection it's typically expressed as IoU (Intersection over Union) of bboxes; for classification it's Cohen's kappa. Low IAA means your label definitions are ambiguous — the model trained on those labels will be limited by that ambiguity. Tracking IAA early surfaces label-definition issues before you've labelled 10,000 images.

When should I require multi-annotator overlap?

Best practice: overlap 5-10% of your dataset across two or three annotators to compute IAA. Push the threshold to ~20% for high-stakes / ambiguous-class projects (medical, legal, content moderation). Below 5%, the IAA estimate is noisy.

What's a good IAA target?

For bounding-box detection, IoU > 0.7 between annotators is healthy; > 0.85 is excellent. For classification (single-label-per-image), Cohen's kappa > 0.8 is the publishable threshold for reliable annotators. If you're below those bars, your label definitions probably need refinement before you scale up labelling.

How does benchmark mode work?

Provide a small set of ground-truth annotations (say, 50-100 images you've personally verified). The benchmark endpoint computes each annotator's accuracy against that ground truth, surfacing systematic errors a particular annotator makes. Useful for onboarding new annotators and for periodic audits.

What does the /alerts endpoint flag?

Four classes of alert: (1) class imbalance > 10:1 ratio between most and least frequent classes, (2) annotator approval rate < 70% over a rolling window, (3) outlier images that disagree with all auto-label suggestions, and (4) projects where IAA has dropped > 0.1 from the rolling average. Each alert has a recommended action.

Catch label drift before evaluation does.

Six QA endpoints, free in every tier. IAA, alerts, per-annotator scorecards.

Start free Or try in Studio