CLIP ViT-B/32 — visual similarity foundation model

Find similar imagesacross your dataset — no training, no labels.

Reverse image search, dataset deduplication, content moderation, e-commerce catalogue search. CLIP gives you a 512-d vector per image; visually similar images cluster together in that space. mSightFlow exposes both the embedding endpoint and a hosted search API.

Try in Studio free Read API reference

Model: CLIP ViT-B/32 (OpenAI)
Inputs: JPG/PNG ≤ 25 MB
Outputs: 512-d L2-normalised vector OR top-k ranked matches
Latency: ~90 ms p50
Free quota: 300 calls / month

OpenAI's CLIP learned to map images into a 512-dimensional vector space where semantically similar images cluster together — from 400 million image-text pairs scraped across the web. That geometry lets you do search, deduplication, clustering, and content moderation with cosine similarity instead of trained classifiers.

mSightFlow exposes CLIP ViT-B/32 in two shapes: the raw /v1/embed endpoint (returns the 512-d vector for you to index in your own vector DB), and /v1/embed/search (hosted in-memory search over a posted dataset, top-k ranked, no infra to set up).

When CLIP search is the right tool

E-commerce catalogue search

"Show me products like this one." Reverse image search for retail. Snap a sneaker, get similar listings.

Dataset deduplication

Find semantically-duplicate images (different angles of the same scene, recompressed copies) before training to avoid bias.

Content moderation

"Does this image look like any of these flagged references?" Soft matching for UGC platforms. Pair with deepfake detection.

Two endpoints — pick by infra fit

	/v1/embed	/v1/embed/search
Returns	512-d L2-normalised vector	Top-k ranked matches with similarity scores
Vector DB needed?	Yes (yours)	No
Best for	Production-scale search (millions of images)	Prototypes, small catalogues, one-off lookups
Dataset size	Unlimited (you index)	≤ 10k images per call
Latency	~90 ms per image	~90 ms × N + cosine pass

Code — Python, Node, cURL

Python — embed yourself

import os, requests, numpy as np
from pathlib import Path

api_key = os.environ["MSF_API_KEY"]

def embed(path):
    r = requests.post(
        "https://api.msightflow.ai/v1/embed",
        headers={"Authorization": f"Bearer {api_key}"},
        files={"image": Path(path).read_bytes()},
    ).json()
    return np.array(r["vector"], dtype=np.float32)   # already L2-normalised

# Build a tiny index
vectors = {p.name: embed(p) for p in Path("dataset/").glob("*.jpg")}

# Query
query = embed("query.jpg")
scores = {name: float(query @ v) for name, v in vectors.items()}
for name, sim in sorted(scores.items(), key=lambda kv: -kv[1])[:5]:
    print(f"{sim:.3f}  {name}")

Python — hosted search

# Hosted search — no vector DB required for small datasets.
import os, requests

files = [("query", open("query.jpg", "rb"))]
for p in Path("dataset/").glob("*.jpg"):
    files.append(("dataset", open(p, "rb")))

resp = requests.post(
    "https://api.msightflow.ai/v1/embed/search",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files=files,
    data={"top_k": "5"},
)
for hit in resp.json()["results"]:
    print(f"{hit['similarity']:.3f}  {hit['filename']}")

Node.js — embed

import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("query.jpg"));

const resp = await fetch("https://api.msightflow.ai/v1/embed", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { vector } = await resp.json();
console.log("dim:", vector.length, "norm:", Math.sqrt(vector.reduce((s,v)=>s+v*v,0)));

cURL

curl -X POST https://api.msightflow.ai/v1/embed \
  -H "Authorization: Bearer $MSF_API_KEY" \
  -F "image=@query.jpg"

Pricing — same as every other endpoint

Free

300 API calls / month
CLIP ViT-B/32 embeddings
Hosted search ≤ 10k images
No credit card

Start free

Standard

$10/mo

5,400 API calls / month
Bulk-embed up to 10k / batch

Pick Standard

Pro

$29/mo

Unlimited calls
Higher per-provider quotas

Go Pro

Related features

Auto-labelling

CLIP embeddings cluster a dataset so you can label one image per cluster — then propagate labels by similarity.

Learn more

Zero-shot detection

Grounding DINO for text-prompted detection. CLIP shares the same image-text space and powers retrieval.

Learn more

AI / deepfake detection

One of the four detectors (UniversalFakeDetect) is a CLIP linear-probe. Same backbone, different head.

Learn more

FAQ

What does the 512-d vector mean?

CLIP ViT-B/32 maps every image to a point in 512-dimensional space such that visually similar images are close in cosine distance. The vectors are L2-normalised, so cosine similarity reduces to a dot product. Vectors are stable — re-embed an image and you get the same vector ±floating-point noise.

When should I use /embedding/ vs /embedding/search/?

Use /embedding/ when you have your own vector database (Pinecone, Weaviate, pgvector, etc.) and want to integrate CLIP embeddings into an existing search infra. Use /embedding/search/ for one-off searches over small datasets (≤ 10k images) where standing up a vector DB is overkill — we run the cosine search in memory and return ranked results.

Can I do text-to-image search?

Roadmapped. CLIP's text encoder maps phrases to the same 512-d space as the image encoder, so text-to-image search ('a red sneaker', 'a cracked weld') works in principle. We'll ship a /search/text endpoint in a future release; for now use /v1/detect/zero-shot for text-prompted detection.

Is this better than perceptual hashing for deduplication?

Yes for semantic dedup (different angles of the same product, retouched variants), no for exact-byte dedup (which dhash / phash do better). The right tool depends on the threat model. mSightFlow also ships /cv_tools/perceptual_dedup for the dhash use case.

Can I use a larger CLIP model?

ViT-L/14 and OpenCLIP variants are roadmapped for Pro tier. ViT-B/32 is the speed/quality sweet spot for most search and dedup use cases; the larger checkpoints help mostly on long-tail visual distinctions.

One vector. Infinite matches.

300 free API calls / month. CLIP ViT-B/32. Embed or search.

Start free Or try in Studio