IAA · approval rates · class-balance · drift alerts

Catch label driftbefore it costs you a model.

The fastest way to train a bad model is to label inconsistently and find out at evaluation time. mSightFlow ships inter-annotator agreement, per-annotator accuracy, class-imbalance alerts, and a ground-truth benchmark mode in every project — no separate tool, no separate dashboard.

Method
IoU agreement + class-distribution + benchmark vs ground-truth
Inputs
project_id
Outputs
agreement matrix · per-annotator stats · imbalance alerts · review queue
Endpoints
6 (overview, agreement, …)
Pricing
Included in every tier

Most CV teams discover their annotation problems at the model- evaluation stage — when the mAP curve plateaus and post-mortems reveal that two annotators were labelling the same class differently. mSightFlow surfaces the problem during labelling: agreement metrics, per-annotator scorecards, class-distribution alerts, and drift detection. The quality dashboard is built into every project on every tier.

When annotation QA is the right tool

Multi-annotator teams

Run IAA over a 5-20% overlap pool to confirm annotators agree on edge cases. Below IoU 0.7 means your label spec needs sharpening.

Class-imbalance auditing

Auto-alerts when class distribution skews too far. Imbalanced datasets train models that ignore minority classes — catch it pre-training.

Onboarding new annotators

Benchmark mode against a small ground-truth pool. New annotators see their accuracy delta vs the team and self-correct.

Six endpoints, one quality story

EndpointWhat it returns
/projects/{project_id}/quality/overviewLabelled count, approval rate, rejection rate, AI-assisted share
/projects/{project_id}/quality/agreementPer-image inter-annotator IoU (when multiple annotators overlap)
/projects/{project_id}/quality/annotatorsPer-annotator: speed, approval rate, agreement with peers
/projects/{project_id}/quality/imagesPer-image QA score, flag reasons, review status
/projects/{project_id}/quality/benchmarkPer-annotator accuracy vs ground-truth set
/projects/{project_id}/quality/alertsClass imbalance, low-approval annotators, IAA drift

All six are read-only GET endpoints. Polling them at the start of a labelling sprint, then again at the end, surfaces drift in real time.

Code — overview, IAA, annotators, alerts

Overview
import os, requests

hdr = {"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"}
project_id = "PROJECT_ID"
api = f"https://api.msightflow.ai/v1/projects/{project_id}/quality"

# Project-level overview
overview = requests.get(f"{api}/overview", headers=hdr).json()
print(overview)
# → {
#     "labelled": 4231, "unlabelled": 569, "approval_rate": 0.93,
#     "rejection_rate": 0.04, "ai_assisted_share": 0.71
#   }
Inter-annotator IoU
# Per-image inter-annotator IoU (requires multiple annotators per image)
iaa = requests.get(f"{api}/agreement", headers=hdr,
                   params={"min_annotators": 2}).json()

print(f"mean IoU across {iaa['n_images']} images: {iaa['mean_iou']:.3f}")
for entry in iaa["by_image"][:10]:
    print(f"  {entry['image_id']}  IoU={entry['iou']:.3f}  n_annotators={entry['n']}")
Per-annotator stats
# Per-annotator stats — speed, approval rate, agreement with peers
ann = requests.get(f"{api}/annotators", headers=hdr).json()

for a in ann["annotators"]:
    print(f"  {a['annotator_id']:>20}  "
          f"labelled={a['n_labelled']:>5}  "
          f"approval={a['approval_rate']:.2f}  "
          f"speed={a['avg_seconds_per_image']:.1f}s  "
          f"iaa_with_peers={a['mean_iou']:.3f}")
Alerts
# Project-level alerts: imbalance, low-approval annotators, drift
alerts = requests.get(f"{api}/alerts", headers=hdr).json()

for a in alerts["alerts"]:
    print(f"[{a['severity']}] {a['type']}: {a['message']}")
    # e.g. [warning] class_imbalance: 'person' has 45× more samples than 'helmet'
    #      [error] low_approval: annotator user_abc123 approval rate dropped to 0.61

Pricing — included in every tier

Annotation QA is not a paid add-on. All six endpoints are free in every tier — including Free.

Free

$0

  • All 6 quality endpoints
  • IAA · class balance · annotator stats
Start free

Pro

$29/mo

  • Unlimited API calls
  • Higher per-provider quotas
  • Custom QA thresholds
Go Pro

Related features

FAQ

What is inter-annotator agreement (IAA) and why does it matter?

IAA measures how much two annotators agree when labelling the same image. For detection it's typically expressed as IoU (Intersection over Union) of bboxes; for classification it's Cohen's kappa. Low IAA means your label definitions are ambiguous — the model trained on those labels will be limited by that ambiguity. Tracking IAA early surfaces label-definition issues before you've labelled 10,000 images.

When should I require multi-annotator overlap?

Best practice: overlap 5-10% of your dataset across two or three annotators to compute IAA. Push the threshold to ~20% for high-stakes / ambiguous-class projects (medical, legal, content moderation). Below 5%, the IAA estimate is noisy.

What's a good IAA target?

For bounding-box detection, IoU > 0.7 between annotators is healthy; > 0.85 is excellent. For classification (single-label-per-image), Cohen's kappa > 0.8 is the publishable threshold for reliable annotators. If you're below those bars, your label definitions probably need refinement before you scale up labelling.

How does benchmark mode work?

Provide a small set of ground-truth annotations (say, 50-100 images you've personally verified). The benchmark endpoint computes each annotator's accuracy against that ground truth, surfacing systematic errors a particular annotator makes. Useful for onboarding new annotators and for periodic audits.

What does the /alerts endpoint flag?

Four classes of alert: (1) class imbalance > 10:1 ratio between most and least frequent classes, (2) annotator approval rate < 70% over a rolling window, (3) outlier images that disagree with all auto-label suggestions, and (4) projects where IAA has dropped > 0.1 from the rolling average. Each alert has a recommended action.

Catch label drift before evaluation does.

Six QA endpoints, free in every tier. IAA, alerts, per-annotator scorecards.