Optimizing Image Search: Traditional vs. Machine Learning Methods

In this post, I’ll share how we approached the challenge of ranking image search results by combining traditional IR techniques with machine learning. Our system needed to handle multiple data sources—including blogs, web pages, and shopping platforms—while adapting to user behavior and semantic relevance.

🔍 Setting the Context

Our search engine powered image discovery across multiple heterogeneous sources:

Blogs: Rich in user-generated content and long-form text
Web pages: Often structured and SEO-optimized
Shopping feeds: Containing products with metadata, prices, and images

The diversity of these sources made ranking quality especially important, since a single query could return vastly different types of content.

⚙️ Traditional Ranking: BM25 + Engagement Signals

Our baseline search system relied on a linear scoring function that combined:

BM25: A keyword-based ranking algorithm that scores documents based on term frequency (TF), inverse document frequency (IDF), and document length normalization.
Engagement metrics: Clicks, tags, and favorites helped adjust rankings based on user behavior.

This hybrid scoring approach was inspired by how engines like Elasticsearch operate, but we layered in behavioral data to surface more relevant results.

💡 BM25 is excellent for precision on keyword matches but lacks understanding of meaning or synonyms.

🧠 Enhancing Ranking with Machine Learning: DPR + CLIP

To overcome limitations like semantic mismatch and the long-tail problem (where only popular content gets visibility), we implemented a Dense Passage Retrieval (DPR) system.

Key Components

CLIP encoders: Used to generate embeddings for both image-text documents and text-only queries
Multimodal representation: Combined text and visual data into a shared vector space
Similarity search: Performed nearest neighbor search between query and indexed document embeddings

Embeddings were generated daily through batch processing and stored in a retrieval index optimized for vector search.

🧪 Evaluation Metrics & Experimentation

User-Side Evaluation: Click-Through Rate (CTR)

CTR was our primary business metric. However, it wasn’t always a reliable measure of ranking quality. This is because the final search result page included results from multiple ranking systems, managed by other teams. As a result, changes to our ranking model were often diluted in the aggregated CTR numbers.

Model Evaluation: NDCG and MRR

To better evaluate ranking quality in isolation, we created curated evaluation sets and used:

NDCG (Normalized Discounted Cumulative Gain): Measures how well the model ranks relevant items near the top.
MRR (Mean Reciprocal Rank): Focuses on the position of the first relevant result—useful for queries with single-click intents.

🧩 Challenges & Trade-offs

Multimodal Mismatch
Matching text-only queries against image + text documents was challenging. We relied on CLIP’s cross-modal training but still encountered misalignment in certain domains.
Cold Start
Items with few or no engagement signals performed poorly in the traditional model. DPR helped by allowing semantically similar, low-engagement content to be retrieved—but required expensive inference.
Latency & Caching
DPR was not fast enough for cold requests. While we cached popular queries, initial requests often returned empty or slow results.
Experiment Complexity
Because final pages included results blended with other pipelines, isolating our model’s impact during A/B tests required careful result bucketing and experiment design.

🌱 Key Takeaways

BM25 + engagement is fast and interpretable but limited for semantic or long-tail queries.
Dense retrieval improves relevance, especially for ambiguous or rare queries.
NDCG/MRR provide more reliable offline evaluation than CTR alone, especially when multiple systems contribute to results.
Trade-offs between accuracy, speed, and interpretability are constant in real-world search applications.

Tech.Zealot