Engineering

Implementing a Multi-Modal Search Engine Using CLIP and a Vector Database

Allan Porras

30 Oct 2025 — 9 min read

For decades, search has been dominated by text-based keyword matching, augmented by systems like TF-IDF and BM25. While effective, this paradigm fails when dealing with the web's most prevalent data type: visual media. Users increasingly want to search with images and find images using natural language descriptions, not just predefined tags. This is the domain of multi-modal search.

The challenge has been bridging the semantic gap between unstructured pixel data (images) and unstructured text (language). A system that can understand "a golden retriever catching a red frisbee" and find a matching image without relying on explicit tags has, until recently, been computationally prohibitive or insufficiently accurate.

This article provides a technical blueprint for building a high-performance, scalable multi-modal search engine. We will leverage two key technologies:

CLIP (Contrastive Language-Image Pre-Training): An OpenAI model that embeds both text and images into a shared, high-dimensional vector space.
Vector Databases (e.g., Milvus, Pinecone, Weaviate): Specialized databases designed to store, index, and perform ultra-fast similarity searches on billions of these embedding vectors.

This guide is intended for CTOs and engineers, focusing on architectural patterns, practical implementation, and the performance trade-offs inherent in such a system.

The Core Technology Stack

A successful multi-modal search system is built on two pillars: the Encoder (which understands content) and the Index (which finds content).

The Encoder: CLIP

CLIP is the engine that creates a "shared language" between text and images. It is not one model, but two (a text encoder and an image encoder) that are trained jointly. Their goal is to ensure that the vector for the text "a photo of a dog" is placed near the vector for an actual photo of a dog in the embedding space.

This "nearness" is typically measured by cosine similarity, which calculates the angle between two vectors. A high similarity (close to 1.0) means the concepts are semantically related.

Practical Implementation: Generating Embeddings

We will use the transformers library from Hugging Face, which provides an easy-to-use interface for CLIP models.

import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests

# Load the pre-trained model and processor
# "openai/clip-vit-base-patch32" is a common choice.
# For higher accuracy, consider "openai/clip-vit-large-patch14"
MODEL_ID = "openai/clip-vit-base-patch32"

device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained(MODEL_ID).to(device)
processor = CLIPProcessor.from_pretrained(MODEL_ID)

def get_image_embedding(image_path_or_url: str) -> list[float]:
    """
    Generates a 512-dimension embedding vector for a given image.
    """
    try:
        if image_path_or_url.startswith("http"):
            image = Image.open(requests.get(image_path_or_url, stream=True).raw)
        else:
            image = Image.open(image_path_or_url)
    except Exception as e:
        print(f"Error loading image: {e}")
        return None

    with torch.no_grad():
        inputs = processor(images=image, return_tensors="pt", padding=True).to(device)
        image_features = model.get_image_features(**inputs)
        
        # Normalize for cosine similarity search
        image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
        
        return image_features.cpu().numpy()[0].tolist()

def get_text_embedding(text: str) -> list[float]:
    """
    Generates a 512-dimension embedding vector for a given text string.
    """
    with torch.no_grad():
        inputs = processor(text=[text], return_tensors="pt", padding=True).to(device)
        text_features = model.get_text_features(**inputs)
        
        # Normalize for cosine similarity search
        text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)

        return text_features.cpu().numpy()[0].tolist()

# --- Example Usage ---
text_emb = get_text_embedding("a panorama of a mountain range at sunrise")
image_emb = get_image_embedding("https://example.com/images/mountain.jpg")

# The output vectors (text_emb, image_emb) are now ready for
# storage or comparison.
print(f"Generated text embedding of shape: {len(text_emb)}")
print(f"Generated image embedding of shape: {len(image_emb)}")

The Index: Vector Databases

A 512-dimension vector is a dense list of 512 floating-point numbers. Finding the "nearest" vectors to a query vector among billions of entries requires a specialized index. A SELECT * FROM images WHERE embedding = ? query is not possible, and a linear scan (calculating cosine similarity against every vector) is impossibly slow.

Vector databases solve this by implementing Approximate Nearest Neighbor (ANN) search algorithms, suchas HNSW (Hierarchical Navigable Small World).

What it does: HNSW builds a multi-layered graph structure that allows for logarithmic-time (extremely fast) searching.
The Trade-off: It's "approximate" for a reason. You trade perfect 100% recall (finding the absolute closest match) for immense speed. For semantic search, 99% recall is indistinguishable from perfect, as the 2nd or 3rd match is often just as semantically relevant as the 1st.
Key Players: Milvus, Pinecone, Weaviate, Qdrant, and Faiss (a library, not a full DB).

These databases provide a simple API: upsert (insert/update) a vector with an ID, and query with a vector to get back the IDs of the top_k nearest neighbors.

System Architecture and Data Ingestion

We need two distinct pipelines: one for Ingestion (populating the database) and one for Querying (serving search requests).

The Ingestion Pipeline (Batch/Streaming)

The goal is to process every image in your collection, generate its CLIP embedding, and store it. This is a highly parallelizable, asynchronous task.

Architecture:

Image Source: An S3 bucket, local file system, or existing database.
Message Queue (e.g., SQS, RabbitMQ, Kafka): An ImageAdded event is published to a queue. The message contains a unique image_id and its location (e.g., s3://my-bucket/image-123.jpg).
Embedding Workers (e.g., Lambda, Kubernetes Pods, Celery):
- These workers consume messages from the queue.
- They download the image.
- They run the get_image_embedding() function from Section 2.1.
- They upsert the result into two databases:
  1. Vector Database: vector_db.upsert(id=image_id, vector=embedding_vector)
  2. Metadata Database (e.g., PostgreSQL, DynamoDB): metadata_db.insert(id=image_id, url=image_url, description="...")
    - Crucial: The vector database only stores vectors and IDs. You must store the mapping from image_id to its actual data (like the image URL) in a separate, conventional database.

Pseudo-code for an Ingestion Worker:

# Assume vector_db and metadata_db are initialized clients
# Assume 'message' is a consumed object from SQS/Kafka
# message_body = {"image_id": "img_abc_123", "image_url": "s3://..."}

def process_ingestion_message(message_body):
    image_id = message_body.get("image_id")
    image_url = message_body.get("image_url")
    
    if not image_id or not image_url:
        print("Invalid message, skipping.")
        return

    # 1. Generate Embedding
    # Note: Model loading is slow. In production, the model
    # should be pre-loaded in the worker's global scope.
    embedding = get_image_embedding(image_url)
    
    if embedding is None:
        print(f"Failed to generate embedding for {image_id}")
        return

    try:
        # 2. Upsert to Vector Database
        # API will vary by provider (Pinecone, Milvus, etc.)
        vector_db_client.upsert(
            collection_name="image_embeddings",
            vectors=[
                {"id": image_id, "values": embedding}
            ]
        )
        
        # 3. Store metadata
        metadata_db_client.put_item(
            TableName="image_metadata",
            Item={
                "image_id": image_id,
                "s3_url": image_url,
                "created_at": "..."
            }
        )
        print(f"Successfully ingested {image_id}")

    except Exception as e:
        print(f"Error during DB upsert: {e}")
        # Implement retry logic or move to Dead Letter Queue (DLQ)

The Real-Time Query Pipeline

This is the user-facing part of the system, exposed via an API. It must be low-latency.

Modality 1: Text-to-Image Search

User sends a POST /search/text request with {"query": "a red car on a sunny day"}.
The API server calls get_text_embedding("a red car on a sunny day").
The resulting query vector is sent to the Vector Database: vector_db.query(vector=query_vector, top_k=10).
The Vector DB returns a list of Match objects, e.g., [{"id": "img_xyz_789", "score": 0.92}, {"id": "img_abc_123", "score": 0.88}].
The API server takes the list of IDs (["img_xyz_789", "img_abc_123"]) and queries the Metadata Database to fetch the corresponding URLs.
The server returns the list of URLs and scores to the user.

Modality 2: Image-to-Image Search

User sends a POST /search/image request with an uploaded image file.
The API server calls get_image_embedding(uploaded_image_file).
The pipeline is now identical to steps 3-6 of the Text-to-Image search.

Example API Implementation (using FastAPI):

from fastapi import FastAPI, File, UploadFile, Form
from pydantic import BaseModel
import shutil

# --- Assume all functions from Section 2.1 are defined above ---
# --- Assume vector_db_client and metadata_db_client are initialized ---

app = FastAPI(title="Multi-Modal Search API")

class TextSearchQuery(BaseModel):
    query: str
    top_k: int = 10

class SearchResult(BaseModel):
    id: str
    url: str
    score: float

# This is a placeholder. Use a real DB client (e.g., boto3 for DynamoDB)
def fetch_metadata_from_db(image_ids: list[str]) -> dict:
    # MOCKUP: Simulating a batch lookup
    # In reality: SELECT * FROM image_metadata WHERE image_id IN (...)
    mock_db = {
        "img_abc_123": "https://.../image1.jpg",
        "img_xyz_789": "https://.../image2.png",
    }
    return {img_id: mock_db.get(img_id) for img_id in image_ids if img_id in mock_db}

@app.post("/search/text", response_model=list[SearchResult])
async def search_by_text(query: TextSearchQuery):
    """
    Search for images using a natural language text query.
    """
    # 1. Generate text embedding for the query
    query_embedding = get_text_embedding(query.query)
    
    # 2. Query the Vector Database
    # API format depends on the provider
    query_response = vector_db_client.query(
        collection_name="image_embeddings",
        query_vector=query_embedding,
        top_k=query.top_k
    ) # Example response: [{"id": "img_abc_123", "score": 0.92}, ...]

    # 3. Extract IDs and fetch metadata
    matches = query_response.get("matches", [])
    image_ids = [match["id"] for match in matches]
    metadata_map = fetch_metadata_from_db(image_ids)
    
    # 4. Format and return results
    results = []
    for match in matches:
        image_id = match["id"]
        url = metadata_map.get(image_id)
        if url:
            results.append(
                SearchResult(id=image_id, url=url, score=match["score"])
            )
    return results

@app.post("/search/image", response_model=list[SearchResult])
async def search_by_image(file: UploadFile = File(...), top_k: int = Form(10)):
    """
    Search for similar images using an uploaded image.
    """
    # Save temp file to process
    temp_file_path = f"/tmp/{file.filename}"
    with open(temp_file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)

    # 1. Generate image embedding for the query image
    query_embedding = get_image_embedding(temp_file_path)
    
    # 2. Query the Vector Database (identical to text search logic)
    query_response = vector_db_client.query(
        collection_name="image_embeddings",
        query_vector=query_embedding,
        top_k=top_k
    )
    
    # 3. Extract IDs and fetch metadata (identical to text search logic)
    matches = query_response.get("matches", [])
    image_ids = [match["id"] for match in matches]
    metadata_map = fetch_metadata_from_db(image_ids)
    
    # 4. Format and return results (identical to text search logic)
    results = []
    for match in matches:
        image_id = match["id"]
        url = metadata_map.get(image_id)
        if url:
            results.append(
                SearchResult(id=image_id, url=url, score=match["score"])
            )
    return results

Architectural Considerations for CTOs

Building the prototype is straightforward. Scaling it to billions of images and sub-100ms p99 latency introduces critical challenges.

Indexing Performance vs. Recall: The HNSW algorithm has two key build-time parameters: M (max connections per node) and efConstruction (size of the dynamic list for new "best" neighbors). Increasing these improves the quality (recall) of the graph index at the cost of higher build times and a larger memory footprint. A high ef (search-time parameter) increases accuracy at the cost of latency. This is your primary tuning knob. Start with sensible defaults and benchmark recall vs. latency with your own data.
The Hardware is Not Optional: Vector search is memory-bound. The HNSW index must, in most architectures (like Milvus), reside entirely in RAM. For a billion 512-dim vectors (as float32), you need: 1,000,000,000 (vectors) * 512 (dims) * 4 (bytes/float32) ≈ 2.048 TBThis is just for the raw vectors. The graph index itself adds 1.5x-2x overhead. This system requires memory-optimized machines, and sharding the index across a cluster becomes mandatory at scale.
Model Deployment: Warmth is Key:The CLIP model (e.g., clip-vit-large-patch14) is large (over 1GB). If you serve the API endpoints via a serverless function (like AWS Lambda), you will suffer from catastrophic cold start latencies (10-15 seconds) as the model is downloaded and loaded into memory. Solution: Use provisioned concurrency (to keep functions warm) or, more appropriately, deploy the API to a persistent container-based service (ECS, Kubernetes) where the model is loaded once at boot.
Domain-Specific Finetuning: CLIP is trained on the general web. It may struggle with highly specialized domains (e.g., medical X-rays, satellite imagery, fashion SKUs). For a true competitive advantage, you must finetune CLIP on your own dataset. This involves creating a dataset of (image, text) pairs specific to your domain and continuing the training process. This adapts the embedding space to understand your niche's specific semantics, dramatically improving search relevance.

Conclusion

The combination of CLIP and vector databases has democratized multi-modal search. This architecture moves beyond simple tagging and allows applications to achieve a true, semantic-level understanding of visual and textual data.

By decoupling the asynchronous, heavy-lifting of ingestion from the low-latency, real-time demands of querying, you can build a scalable and resilient system. The primary challenges are not conceptual but operational: managing the memory footprint of the vector index, tuning the ANN parameters for the right speed/accuracy trade-off, and optimizing the model-serving infrastructure to eliminate cold starts.

This system is no longer a research project; it is a practical and essential component for any modern application dealing with large-scale media assets.

FAQs

What is a multi-modal search engine?

A multi-modal search engine is a system that can understand and search across different types of data, such as text and images. Unlike traditional keyword-based search, it allows users to find relevant images using natural language text descriptions or to use a query image to find other semantically similar images.

How do CLIP and vector databases work together in search?

CLIP (Contrastive Language-Image Pre-Training) is a model that generates numerical representations, called embeddings, for both text and images in a shared vector space. A vector database is a specialized system designed to store these embeddings and perform high-speed similarity searches. In this system, CLIP creates the embeddings, and the vector database indexes them to efficiently find the closest matches for a given search query, whether it's text or an image.

What are the main components of a multi-modal search architecture?

A multi-modal search architecture typically consists of two primary pipelines.

An ingestion pipeline processes media files (like images), uses the CLIP model to generate their vector embeddings, and then stores those embeddings in a vector database. It also stores corresponding metadata (like image URLs) in a separate, conventional database.
A query pipeline takes a user's search query (either text or an image), generates an embedding for it, and sends that embedding to the vector database to find the most similar items. It then uses the results to retrieve the full metadata for the user.