Catch label driftbefore it costs you a model.
The fastest way to train a bad model is to label inconsistently and find out at evaluation time. mSightFlow ships inter-annotator agreement, per-annotator accuracy, class-imbalance alerts, and a ground-truth benchmark mode in every project — no separate tool, no separate dashboard.
- Method
- IoU agreement + class-distribution + benchmark vs ground-truth
- Inputs
- project_id
- Outputs
- agreement matrix · per-annotator stats · imbalance alerts · review queue
- Endpoints
- 6 (overview, agreement, …)
- Pricing
- Included in every tier
Most CV teams discover their annotation problems at the model- evaluation stage — when the mAP curve plateaus and post-mortems reveal that two annotators were labelling the same class differently. mSightFlow surfaces the problem during labelling: agreement metrics, per-annotator scorecards, class-distribution alerts, and drift detection. The quality dashboard is built into every project on every tier.
When annotation QA is the right tool
Multi-annotator teams
Run IAA over a 5-20% overlap pool to confirm annotators agree on edge cases. Below IoU 0.7 means your label spec needs sharpening.
Class-imbalance auditing
Auto-alerts when class distribution skews too far. Imbalanced datasets train models that ignore minority classes — catch it pre-training.
Onboarding new annotators
Benchmark mode against a small ground-truth pool. New annotators see their accuracy delta vs the team and self-correct.
Six endpoints, one quality story
| Endpoint | What it returns |
|---|---|
/projects/{project_id}/quality/overview | Labelled count, approval rate, rejection rate, AI-assisted share |
/projects/{project_id}/quality/agreement | Per-image inter-annotator IoU (when multiple annotators overlap) |
/projects/{project_id}/quality/annotators | Per-annotator: speed, approval rate, agreement with peers |
/projects/{project_id}/quality/images | Per-image QA score, flag reasons, review status |
/projects/{project_id}/quality/benchmark | Per-annotator accuracy vs ground-truth set |
/projects/{project_id}/quality/alerts | Class imbalance, low-approval annotators, IAA drift |
All six are read-only GET endpoints. Polling them at the start of a labelling sprint, then again at the end, surfaces drift in real time.
Code — overview, IAA, annotators, alerts
import os, requests
hdr = {"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"}
project_id = "PROJECT_ID"
api = f"https://api.msightflow.ai/v1/projects/{project_id}/quality"
# Project-level overview
overview = requests.get(f"{api}/overview", headers=hdr).json()
print(overview)
# → {
# "labelled": 4231, "unlabelled": 569, "approval_rate": 0.93,
# "rejection_rate": 0.04, "ai_assisted_share": 0.71
# }
# Per-image inter-annotator IoU (requires multiple annotators per image)
iaa = requests.get(f"{api}/agreement", headers=hdr,
params={"min_annotators": 2}).json()
print(f"mean IoU across {iaa['n_images']} images: {iaa['mean_iou']:.3f}")
for entry in iaa["by_image"][:10]:
print(f" {entry['image_id']} IoU={entry['iou']:.3f} n_annotators={entry['n']}")
# Per-annotator stats — speed, approval rate, agreement with peers
ann = requests.get(f"{api}/annotators", headers=hdr).json()
for a in ann["annotators"]:
print(f" {a['annotator_id']:>20} "
f"labelled={a['n_labelled']:>5} "
f"approval={a['approval_rate']:.2f} "
f"speed={a['avg_seconds_per_image']:.1f}s "
f"iaa_with_peers={a['mean_iou']:.3f}")
# Project-level alerts: imbalance, low-approval annotators, drift
alerts = requests.get(f"{api}/alerts", headers=hdr).json()
for a in alerts["alerts"]:
print(f"[{a['severity']}] {a['type']}: {a['message']}")
# e.g. [warning] class_imbalance: 'person' has 45× more samples than 'helmet'
# [error] low_approval: annotator user_abc123 approval rate dropped to 0.61
Pricing — included in every tier
Annotation QA is not a paid add-on. All six endpoints are free in every tier — including Free.
Related features
Active learning
QA flags images annotators disagree on. Active learning prioritises them for re-review.
Learn moreAuto-labelling
QA helps audit AI-generated suggestions: filter unverified annotations before they reach training.
Learn moreDataset export
Filter exports by approval state, IAA threshold, or annotator quality. Ship only the verified subset.
Learn moreFAQ
What is inter-annotator agreement (IAA) and why does it matter?
IAA measures how much two annotators agree when labelling the same image. For detection it's typically expressed as IoU (Intersection over Union) of bboxes; for classification it's Cohen's kappa. Low IAA means your label definitions are ambiguous — the model trained on those labels will be limited by that ambiguity. Tracking IAA early surfaces label-definition issues before you've labelled 10,000 images.
When should I require multi-annotator overlap?
Best practice: overlap 5-10% of your dataset across two or three annotators to compute IAA. Push the threshold to ~20% for high-stakes / ambiguous-class projects (medical, legal, content moderation). Below 5%, the IAA estimate is noisy.
What's a good IAA target?
For bounding-box detection, IoU > 0.7 between annotators is healthy; > 0.85 is excellent. For classification (single-label-per-image), Cohen's kappa > 0.8 is the publishable threshold for reliable annotators. If you're below those bars, your label definitions probably need refinement before you scale up labelling.
How does benchmark mode work?
Provide a small set of ground-truth annotations (say, 50-100 images you've personally verified). The benchmark endpoint computes each annotator's accuracy against that ground truth, surfacing systematic errors a particular annotator makes. Useful for onboarding new annotators and for periodic audits.
What does the /alerts endpoint flag?
Four classes of alert: (1) class imbalance > 10:1 ratio between most and least frequent classes, (2) annotator approval rate < 70% over a rolling window, (3) outlier images that disagree with all auto-label suggestions, and (4) projects where IAA has dropped > 0.1 from the rolling average. Each alert has a recommended action.
Catch label drift before evaluation does.
Six QA endpoints, free in every tier. IAA, alerts, per-annotator scorecards.