Describe images,or answer questions about them.
Two modes, one endpoint. Caption describes the image; VQA answers a specific question. Choose BLIP-Base for speed and free-tier coverage, or cloud LLMs (GPT-4o / Claude Vision) when you need reasoning or structured JSON output.
- Model
- BLIP-Base + optional GPT-4o / Claude Vision
- Inputs
- JPG/PNG ≤ 25 MB · optional text question
- Outputs
- caption string OR VQA answer · optional JSON schema
- Latency
- ~300 ms (BLIP) / ~2 s (cloud)
- Free quota
- 300 calls / month
Captioning answers “what is in this image?” in natural language. VQA answers “…and is this thing present?”. Both modes work behind one endpoint with the same provider switch (BLIP for cheap-and-fast, cloud LLMs for hard reasoning), so you can prototype on BLIP and graduate hot paths to GPT-4o without changing your code.
When captioning + VQA are the right tools
Accessibility — auto alt-text
Generate WCAG-compliant alt-text for image-heavy products and CMSs. BLIP for the bulk; cloud fallback when the image carries meaning a one-line caption won't capture.
E-commerce descriptions
Auto-generate product titles, colour tags, material guesses, and category labels from photos. Pair with output_schema for structured catalogues.
Moderation & flagging
Natural-language moderation: “is this image safe for under-13?”, “is there alcohol visible?”. Flexible policies without per-class training.
Two modes, three providers
| BLIP-Base (default) | GPT-4o | Claude Vision | |
|---|---|---|---|
| Latency | ~300 ms | ~2 s | ~2.5 s |
| Cost | 1 quota unit | 1 + provider quota | 1 + provider quota |
| Free tier? | Yes | Provider quota required | Provider quota required |
| Reasoning / counting | Limited | Strong | Strongest |
| Structured JSON | No | Yes | Yes |
| Best for | Alt-text, simple captions, batch | Reasoning, structured extraction | Long-form, careful description |
Code — Caption · VQA · Structured
import os, requests
from pathlib import Path
resp = requests.post(
"https://api.msightflow.ai/v1/describe",
headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
files={"image": Path("product.jpg").read_bytes()},
data={"mode": "caption", "provider": "blip"},
)
print(resp.json()["caption"])
# → "a pair of red sneakers on a wooden floor"
# VQA mode — ask a specific question.
resp = requests.post(
"https://api.msightflow.ai/v1/describe",
headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
files={"image": Path("warehouse.jpg").read_bytes()},
data={
"mode": "vqa",
"question": "How many people are wearing hard hats?",
"provider": "cloud",
"model": "gpt-4o",
},
)
print(resp.json()["answer"])
# → "Three people are wearing hard hats."
# Structured extraction — pass a JSON schema.
resp = requests.post(
"https://api.msightflow.ai/v1/describe",
headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
files={"image": Path("listing.jpg").read_bytes()},
data={
"mode": "vqa",
"question": "Describe this product listing for an e-commerce catalogue.",
"provider": "cloud",
"model": "gpt-4o",
"output_schema": '''{
"title": "string",
"main_colour": "string",
"material": "string",
"size_label": "string",
"alt_text": "string"
}''',
},
)
print(resp.json()["structured"])
# → {"title": "Red sneakers", "main_colour": "red", "material": "canvas", ...}
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";
const form = new FormData();
form.append("image", fs.createReadStream("product.jpg"));
form.append("mode", "caption");
const resp = await fetch("https://api.msightflow.ai/v1/describe", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
body: form,
});
const { caption } = await resp.json();
console.log(caption);
Pricing — same as every other endpoint
Related features
OCR
OCR reads text in the image; captioning/VQA describes what the image shows. Pair them for receipt + understanding.
Learn moreZero-shot detection
Where VQA returns prose, zero-shot detection returns bounding boxes. Use both for visual reasoning + grounding.
Learn moreAI / deepfake detection
VQA can answer “does this look AI-generated?” — but the specialised detector is more accurate. Use both for defence in depth.
Learn moreFAQ
BLIP vs GPT-4o vs Claude — when does which win?
BLIP-Base is fast (~300 ms), free, and gives competent literal descriptions. GPT-4o and Claude Vision are slower (~2-3 s) and use a paid provider quota, but handle reasoning ('what's unusual about this scene?'), counting, and structured extraction much better. For alt-text and bulk catalogue captioning, BLIP is enough. For UGC moderation, ad copy, accessibility-grade alt-text, and structured field extraction, use the cloud fallback.
What's VQA and when do I use it?
Visual Question Answering: you send an image AND a question; the model returns an answer. 'How many people in this image?', 'Is there a forklift?', 'What colour is the car?'. VQA replaces a custom classification model for ad-hoc questions you don't want to train for. It's also a fast way to extract structured info from receipts and forms when paired with a JSON output schema.
Can the output be structured JSON?
Yes, with the cloud provider. Pass output_schema as a JSON-Schema string, e.g. {"product_name":"string","colour":"string","defects":["string"]}. The model is prompted to return JSON matching that schema. BLIP doesn't support structured output — fall back to cloud when you need schemas.
How does this differ from OCR?
OCR extracts text that appears IN the image. Captioning/VQA describes WHAT the image shows. Both are often paired: OCR to read the receipt text, captioning/VQA to interpret what the receipt is for. The endpoints are complementary, not overlapping.
Is the model hallucination-prone?
BLIP is grounded enough to mostly stick to what's in the image but will occasionally invent details on busy scenes. GPT-4o and Claude Vision are stronger here. For high-stakes pipelines (KYC, medical, legal), pair with structured output schemas + low temperature + downstream validation.
Describe. Ask. Answer.
300 free API calls / month. BLIP + cloud LLMs. Structured output.