In this post, I’ll share how we approached the challenge of ranking image search results by combining traditional IR techniques with machine learning. Our system needed to handle multiple data sources—including blogs, web pages, and shopping platforms—while adapting to user behavior and semantic relevance.
🔍 Setting the Context
Our search engine powered image discovery across multiple heterogeneous sources:
- Blogs: Rich in user-generated content and long-form text
- Web pages: Often structured and SEO-optimized
- Shopping feeds: Containing products with metadata, prices, and images
The diversity of these sources made ranking quality especially important, since a single query could return vastly different types of content.
⚙️ Traditional Ranking: BM25 + Engagement Signals
Our baseline search system relied on a linear scoring function that combined:
- BM25: A keyword-based ranking algorithm that scores documents based on term frequency (TF), inverse document frequency (IDF), and document length normalization.
- Engagement metrics: Clicks, tags, and favorites helped adjust rankings based on user behavior.
This hybrid scoring approach was inspired by how engines like Elasticsearch operate, but we layered in behavioral data to surface more relevant results.
💡 BM25 is excellent for precision on keyword matches but lacks understanding of meaning or synonyms.
🧠 Enhancing Ranking with Machine Learning: DPR + CLIP
To overcome limitations like semantic mismatch and the long-tail problem (where only popular content gets visibility), we implemented a Dense Passage Retrieval (DPR) system.
Key Components
- CLIP encoders: Used to generate embeddings for both image-text documents and text-only queries
- Multimodal representation: Combined text and visual data into a shared vector space
- Similarity search: Performed nearest neighbor search between query and indexed document embeddings
Embeddings were generated daily through batch processing and stored in a retrieval index optimized for vector search.
🧪 Evaluation Metrics & Experimentation
User-Side Evaluation: Click-Through Rate (CTR)
CTR was our primary business metric. However, it wasn’t always a reliable measure of ranking quality. This is because the final search result page included results from multiple ranking systems, managed by other teams. As a result, changes to our ranking model were often diluted in the aggregated CTR numbers.
Model Evaluation: NDCG and MRR
To better evaluate ranking quality in isolation, we created curated evaluation sets and used:
- NDCG (Normalized Discounted Cumulative Gain): Measures how well the model ranks relevant items near the top.
- MRR (Mean Reciprocal Rank): Focuses on the position of the first relevant result—useful for queries with single-click intents.
🧩 Challenges & Trade-offs
- Multimodal Mismatch
Matching text-only queries against image + text documents was challenging. We relied on CLIP’s cross-modal training but still encountered misalignment in certain domains. - Cold Start
Items with few or no engagement signals performed poorly in the traditional model. DPR helped by allowing semantically similar, low-engagement content to be retrieved—but required expensive inference. - Latency & Caching
DPR was not fast enough for cold requests. While we cached popular queries, initial requests often returned empty or slow results. - Experiment Complexity
Because final pages included results blended with other pipelines, isolating our model’s impact during A/B tests required careful result bucketing and experiment design.
🌱 Key Takeaways
- BM25 + engagement is fast and interpretable but limited for semantic or long-tail queries.
- Dense retrieval improves relevance, especially for ambiguous or rare queries.
- NDCG/MRR provide more reliable offline evaluation than CTR alone, especially when multiple systems contribute to results.
- Trade-offs between accuracy, speed, and interpretability are constant in real-world search applications.

Leave a comment