BLIP-Base + cloud-LLM reasoning

Describe images,or answer questions about them.

Two modes, one endpoint. Caption describes the image; VQA answers a specific question. Choose BLIP-Base for speed and free-tier coverage, or cloud LLMs (GPT-4o / Claude Vision) when you need reasoning or structured JSON output.

Try in Studio free Read API reference

Model: BLIP-Base + optional GPT-4o / Claude Vision
Inputs: JPG/PNG ≤ 25 MB · optional text question
Outputs: caption string OR VQA answer · optional JSON schema
Latency: ~300 ms (BLIP) / ~2 s (cloud)
Free quota: 300 calls / month

Captioning answers “what is in this image?” in natural language. VQA answers “…and is this thing present?”. Both modes work behind one endpoint with the same provider switch (BLIP for cheap-and-fast, cloud LLMs for hard reasoning), so you can prototype on BLIP and graduate hot paths to GPT-4o without changing your code.

When captioning + VQA are the right tools

Accessibility — auto alt-text

Generate WCAG-compliant alt-text for image-heavy products and CMSs. BLIP for the bulk; cloud fallback when the image carries meaning a one-line caption won't capture.

E-commerce descriptions

Auto-generate product titles, colour tags, material guesses, and category labels from photos. Pair with output_schema for structured catalogues.

Moderation & flagging

Natural-language moderation: “is this image safe for under-13?”, “is there alcohol visible?”. Flexible policies without per-class training.

Two modes, three providers

	BLIP-Base (default)	GPT-4o	Claude Vision
Latency	~300 ms	~2 s	~2.5 s
Cost	1 quota unit	1 + provider quota	1 + provider quota
Free tier?	Yes	Provider quota required	Provider quota required
Reasoning / counting	Limited	Strong	Strongest
Structured JSON	No	Yes	Yes
Best for	Alt-text, simple captions, batch	Reasoning, structured extraction	Long-form, careful description

Code — Caption · VQA · Structured

Python — caption (BLIP)

import os, requests
from pathlib import Path

resp = requests.post(
    "https://api.msightflow.ai/v1/describe",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files={"image": Path("product.jpg").read_bytes()},
    data={"mode": "caption", "provider": "blip"},
)
print(resp.json()["caption"])
# → "a pair of red sneakers on a wooden floor"

Python — VQA (GPT-4o)

# VQA mode — ask a specific question.
resp = requests.post(
    "https://api.msightflow.ai/v1/describe",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files={"image": Path("warehouse.jpg").read_bytes()},
    data={
        "mode": "vqa",
        "question": "How many people are wearing hard hats?",
        "provider": "cloud",
        "model": "gpt-4o",
    },
)
print(resp.json()["answer"])
# → "Three people are wearing hard hats."

Python — structured JSON

# Structured extraction — pass a JSON schema.
resp = requests.post(
    "https://api.msightflow.ai/v1/describe",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files={"image": Path("listing.jpg").read_bytes()},
    data={
        "mode": "vqa",
        "question": "Describe this product listing for an e-commerce catalogue.",
        "provider": "cloud",
        "model": "gpt-4o",
        "output_schema": '''{
            "title": "string",
            "main_colour": "string",
            "material": "string",
            "size_label": "string",
            "alt_text": "string"
        }''',
    },
)
print(resp.json()["structured"])
# → {"title": "Red sneakers", "main_colour": "red", "material": "canvas", ...}

Node.js — caption

import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("product.jpg"));
form.append("mode", "caption");

const resp = await fetch("https://api.msightflow.ai/v1/describe", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { caption } = await resp.json();
console.log(caption);

Pricing — same as every other endpoint

Free

300 API calls / month
BLIP captions + VQA
No credit card

Start free

Standard

$10/mo

5,400 API calls / month
Cloud-LLM fallback
Structured output schema

Pick Standard

Pro

$29/mo

Unlimited calls
Higher per-provider quotas (OpenAI, Anthropic, Google)

Go Pro

Related features

OCR

OCR reads text in the image; captioning/VQA describes what the image shows. Pair them for receipt + understanding.

Learn more

Zero-shot detection

Where VQA returns prose, zero-shot detection returns bounding boxes. Use both for visual reasoning + grounding.

Learn more

AI / deepfake detection

VQA can answer “does this look AI-generated?” — but the specialised detector is more accurate. Use both for defence in depth.

Learn more

FAQ

BLIP vs GPT-4o vs Claude — when does which win?

BLIP-Base is fast (~300 ms), free, and gives competent literal descriptions. GPT-4o and Claude Vision are slower (~2-3 s) and use a paid provider quota, but handle reasoning ('what's unusual about this scene?'), counting, and structured extraction much better. For alt-text and bulk catalogue captioning, BLIP is enough. For UGC moderation, ad copy, accessibility-grade alt-text, and structured field extraction, use the cloud fallback.

What's VQA and when do I use it?

Visual Question Answering: you send an image AND a question; the model returns an answer. 'How many people in this image?', 'Is there a forklift?', 'What colour is the car?'. VQA replaces a custom classification model for ad-hoc questions you don't want to train for. It's also a fast way to extract structured info from receipts and forms when paired with a JSON output schema.

Can the output be structured JSON?

Yes, with the cloud provider. Pass output_schema as a JSON-Schema string, e.g. {"product_name":"string","colour":"string","defects":["string"]}. The model is prompted to return JSON matching that schema. BLIP doesn't support structured output — fall back to cloud when you need schemas.

How does this differ from OCR?

OCR extracts text that appears IN the image. Captioning/VQA describes WHAT the image shows. Both are often paired: OCR to read the receipt text, captioning/VQA to interpret what the receipt is for. The endpoints are complementary, not overlapping.

Is the model hallucination-prone?

BLIP is grounded enough to mostly stick to what's in the image but will occasionally invent details on busy scenes. GPT-4o and Claude Vision are stronger here. For high-stakes pipelines (KYC, medical, legal), pair with structured output schemas + low temperature + downstream validation.

Describe. Ask. Answer.

300 free API calls / month. BLIP + cloud LLMs. Structured output.

Start free Or try in Studio