Find similar imagesacross your dataset — no training, no labels.
Reverse image search, dataset deduplication, content moderation, e-commerce catalogue search. CLIP gives you a 512-d vector per image; visually similar images cluster together in that space. mSightFlow exposes both the embedding endpoint and a hosted search API.
- Model
- CLIP ViT-B/32 (OpenAI)
- Inputs
- JPG/PNG ≤ 25 MB
- Outputs
- 512-d L2-normalised vector OR top-k ranked matches
- Latency
- ~90 ms p50
- Free quota
- 300 calls / month
OpenAI's CLIP learned to map images into a 512-dimensional vector space where semantically similar images cluster together — from 400 million image-text pairs scraped across the web. That geometry lets you do search, deduplication, clustering, and content moderation with cosine similarity instead of trained classifiers.
mSightFlow exposes CLIP ViT-B/32 in two shapes: the raw /v1/embed endpoint (returns the 512-d vector for you to index in your own vector DB), and /v1/embed/search (hosted in-memory search over a posted dataset, top-k ranked, no infra to set up).
When CLIP search is the right tool
E-commerce catalogue search
"Show me products like this one." Reverse image search for retail. Snap a sneaker, get similar listings.
Dataset deduplication
Find semantically-duplicate images (different angles of the same scene, recompressed copies) before training to avoid bias.
Content moderation
"Does this image look like any of these flagged references?" Soft matching for UGC platforms. Pair with deepfake detection.
Two endpoints — pick by infra fit
| /v1/embed | /v1/embed/search | |
|---|---|---|
| Returns | 512-d L2-normalised vector | Top-k ranked matches with similarity scores |
| Vector DB needed? | Yes (yours) | No |
| Best for | Production-scale search (millions of images) | Prototypes, small catalogues, one-off lookups |
| Dataset size | Unlimited (you index) | ≤ 10k images per call |
| Latency | ~90 ms per image | ~90 ms × N + cosine pass |
Code — Python, Node, cURL
import os, requests, numpy as np
from pathlib import Path
api_key = os.environ["MSF_API_KEY"]
def embed(path):
r = requests.post(
"https://api.msightflow.ai/v1/embed",
headers={"Authorization": f"Bearer {api_key}"},
files={"image": Path(path).read_bytes()},
).json()
return np.array(r["vector"], dtype=np.float32) # already L2-normalised
# Build a tiny index
vectors = {p.name: embed(p) for p in Path("dataset/").glob("*.jpg")}
# Query
query = embed("query.jpg")
scores = {name: float(query @ v) for name, v in vectors.items()}
for name, sim in sorted(scores.items(), key=lambda kv: -kv[1])[:5]:
print(f"{sim:.3f} {name}")
# Hosted search — no vector DB required for small datasets.
import os, requests
files = [("query", open("query.jpg", "rb"))]
for p in Path("dataset/").glob("*.jpg"):
files.append(("dataset", open(p, "rb")))
resp = requests.post(
"https://api.msightflow.ai/v1/embed/search",
headers={"Authorization": f"Bearer {os.environ['MSF_API_KEY']}"},
files=files,
data={"top_k": "5"},
)
for hit in resp.json()["results"]:
print(f"{hit['similarity']:.3f} {hit['filename']}")
import fetch from "node-fetch";
import FormData from "form-data";
import fs from "fs";
const form = new FormData();
form.append("image", fs.createReadStream("query.jpg"));
const resp = await fetch("https://api.msightflow.ai/v1/embed", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.MSF_API_KEY}` },
body: form,
});
const { vector } = await resp.json();
console.log("dim:", vector.length, "norm:", Math.sqrt(vector.reduce((s,v)=>s+v*v,0)));
curl -X POST https://api.msightflow.ai/v1/embed \
-H "Authorization: Bearer $MSF_API_KEY" \
-F "image=@query.jpg"
Pricing — same as every other endpoint
Free
$0
- 300 API calls / month
- CLIP ViT-B/32 embeddings
- Hosted search ≤ 10k images
- No credit card
Related features
Auto-labelling
CLIP embeddings cluster a dataset so you can label one image per cluster — then propagate labels by similarity.
Learn moreZero-shot detection
Grounding DINO for text-prompted detection. CLIP shares the same image-text space and powers retrieval.
Learn moreAI / deepfake detection
One of the four detectors (UniversalFakeDetect) is a CLIP linear-probe. Same backbone, different head.
Learn moreFAQ
What does the 512-d vector mean?
CLIP ViT-B/32 maps every image to a point in 512-dimensional space such that visually similar images are close in cosine distance. The vectors are L2-normalised, so cosine similarity reduces to a dot product. Vectors are stable — re-embed an image and you get the same vector ±floating-point noise.
When should I use /embedding/ vs /embedding/search/?
Use /embedding/ when you have your own vector database (Pinecone, Weaviate, pgvector, etc.) and want to integrate CLIP embeddings into an existing search infra. Use /embedding/search/ for one-off searches over small datasets (≤ 10k images) where standing up a vector DB is overkill — we run the cosine search in memory and return ranked results.
Can I do text-to-image search?
Roadmapped. CLIP's text encoder maps phrases to the same 512-d space as the image encoder, so text-to-image search ('a red sneaker', 'a cracked weld') works in principle. We'll ship a /search/text endpoint in a future release; for now use /v1/detect/zero-shot for text-prompted detection.
Is this better than perceptual hashing for deduplication?
Yes for semantic dedup (different angles of the same product, retouched variants), no for exact-byte dedup (which dhash / phash do better). The right tool depends on the threat model. mSightFlow also ships /cv_tools/perceptual_dedup for the dhash use case.
Can I use a larger CLIP model?
ViT-L/14 and OpenCLIP variants are roadmapped for Pro tier. ViT-B/32 is the speed/quality sweet spot for most search and dedup use cases; the larger checkpoints help mostly on long-tail visual distinctions.
One vector. Infinite matches.
300 free API calls / month. CLIP ViT-B/32. Embed or search.