CLIP ViT-B/32 — visual similarity foundation model

Find similar imagesacross your dataset — no training, no labels.

Reverse image search, dataset deduplication, content moderation, e-commerce catalogue search. CLIP gives you a 512-d vector per image; visually similar images cluster together in that space. mSightFlow exposes both the embedding endpoint and a hosted search API.

Model
CLIP ViT-B/32 (OpenAI)
Inputs
JPG/PNG ≤ 25 MB
Outputs
512-d L2-normalised vector OR top-k ranked matches
Latency
~90 ms p50
Free quota
300 calls / month

OpenAI's CLIP learned to map images into a 512-dimensional vector space where semantically similar images cluster together — from 400 million image-text pairs scraped across the web. That geometry lets you do search, deduplication, clustering, and content moderation with cosine similarity instead of trained classifiers.

mSightFlow exposes CLIP ViT-B/32 in two shapes: the raw /v1/embed endpoint (returns the 512-d vector for you to index in your own vector DB), and /v1/embed/search (hosted in-memory search over a posted dataset, top-k ranked, no infra to set up).

When CLIP search is the right tool

E-commerce catalogue search

"Show me products like this one." Reverse image search for retail. Snap a sneaker, get similar listings.

Dataset deduplication

Find semantically-duplicate images (different angles of the same scene, recompressed copies) before training to avoid bias.

Content moderation

"Does this image look like any of these flagged references?" Soft matching for UGC platforms. Pair with deepfake detection.

Two endpoints — pick by infra fit

/v1/embed/v1/embed/search
Returns512-d L2-normalised vectorTop-k ranked matches with similarity scores
Vector DB needed?Yes (yours)No
Best forProduction-scale search (millions of images)Prototypes, small catalogues, one-off lookups
Dataset sizeUnlimited (you index)≤ 10k images per call
Latency~90 ms per image~90 ms × N + cosine pass

Code — Python, Node, cURL

Python — embed yourself
import os, requests, numpy as np
from pathlib import Path

api_key = os.environ["MSF_API_KEY"]

def embed(path):
    r = requests.post(
        "https://api.msightflow.ai/v1/embed",
        headers={"Authorization": f"Bearer {api_key}"},
        files={"image": Path(path).read_bytes()},
    ).json()
    return np.array(r["vector"], dtype=np.float32)   # already L2-normalised

# Build a tiny index
vectors = {p.name: embed(p) for p in Path("dataset/").glob("*.jpg")}

# Query
query = embed("query.jpg")
scores = {name: float(query @ v) for name, v in vectors.items()}
for name, sim in sorted(scores.items(), key=lambda kv: -kv[1])[:5]:
    print(f"{sim:.3f}  {name}")
Python — hosted search
# Hosted search — no vector DB required for small datasets.
import os, requests

files = [("query", open("query.jpg", "rb"))]
for p in Path("dataset/").glob("*.jpg"):
    files.append(("dataset", open(p, "rb")))

resp = requests.post(
    "https://api.msightflow.ai/v1/embed/search",
    headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
    files=files,
    data={"top_k": "5"},
)
for hit in resp.json()["results"]:
    print(f"{hit['similarity']:.3f}  {hit['filename']}")
Node.js — embed
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";

const form = new FormData();
form.append("image", fs.createReadStream("query.jpg"));

const resp = await fetch("https://api.msightflow.ai/v1/embed", {
  method: "POST",
  headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
  body: form,
});
const { vector } = await resp.json();
console.log("dim:", vector.length, "norm:", Math.sqrt(vector.reduce((s,v)=>s+v*v,0)));
cURL
curl -X POST https://api.msightflow.ai/v1/embed \
  -H "Authorization: Bearer $MSF_API_KEY" \
  -F "image=@query.jpg"

Pricing — same as every other endpoint

Free

$0

  • 300 API calls / month
  • CLIP ViT-B/32 embeddings
  • Hosted search ≤ 10k images
  • No credit card
Start free

Pro

$29/mo

  • Unlimited calls
  • Higher per-provider quotas
Go Pro

Related features

FAQ

What does the 512-d vector mean?

CLIP ViT-B/32 maps every image to a point in 512-dimensional space such that visually similar images are close in cosine distance. The vectors are L2-normalised, so cosine similarity reduces to a dot product. Vectors are stable — re-embed an image and you get the same vector ±floating-point noise.

When should I use /embedding/ vs /embedding/search/?

Use /embedding/ when you have your own vector database (Pinecone, Weaviate, pgvector, etc.) and want to integrate CLIP embeddings into an existing search infra. Use /embedding/search/ for one-off searches over small datasets (≤ 10k images) where standing up a vector DB is overkill — we run the cosine search in memory and return ranked results.

Can I do text-to-image search?

Roadmapped. CLIP's text encoder maps phrases to the same 512-d space as the image encoder, so text-to-image search ('a red sneaker', 'a cracked weld') works in principle. We'll ship a /search/text endpoint in a future release; for now use /v1/detect/zero-shot for text-prompted detection.

Is this better than perceptual hashing for deduplication?

Yes for semantic dedup (different angles of the same product, retouched variants), no for exact-byte dedup (which dhash / phash do better). The right tool depends on the threat model. mSightFlow also ships /cv_tools/perceptual_dedup for the dhash use case.

Can I use a larger CLIP model?

ViT-L/14 and OpenCLIP variants are roadmapped for Pro tier. ViT-B/32 is the speed/quality sweet spot for most search and dedup use cases; the larger checkpoints help mostly on long-tail visual distinctions.

One vector. Infinite matches.

300 free API calls / month. CLIP ViT-B/32. Embed or search.