Combining Dense and Sparse Vectors Effectively

Imagine you're looking for a book in a massive library. You might search by the exact title (keyword search) or describe what the book is about (semantic search). Sometimes you need both approaches to find exactly what you're looking for. This is precisely why we use multi-vector collections in vector databases!

The Problem with the Single Vector Approach

Let's start with a real-world example to understand the limitation:

Document: "The neural network model exhibits overfitting behaviour on the training dataset"

User Query 1: "overfitting neural network"
User Query 2: "AI model performing poorly on training data"

With a single dense vector approach:

Query 1 might not match well because dense vectors focus on the overall meaning
Query 2 might miss the document because it doesn't contain exact terms like "AI" or "performing poorly"

This is where the magic of combining different vector types comes in!

What Are Multi-Vector Collections?

Multi-vector collections allow you to store multiple different representations of the same content within a single collection. Think of it as having different "lenses" through which you can view and search your data:

Dense Vectors: Understand meaning and context
Sparse Vectors: Focus on exact keywords and terms
Hybrid Approach: Combines both for superior search results

# Example of multi-vector structure
document_vectors = {
    "semantic": [0.123, -0.456, 0.789, ...],      # Dense vector (1536 dims)
    "keywords": [0, 0, 0.67, 0, 0.45, 0, ...],    # Sparse vector (5000 dims)
    "metadata": {"title": "ML Paper", "author": "John Doe"}
}

Why Do We Need a Multi-Vector Approach?

1. Complementary Strengths

Dense Vectors Excel At:

Understanding synonyms ("car" ≈ ", automobile")
Capturing context and meaning
Finding conceptually similar content
Handling paraphrases and different ways of expressing ideas

Sparse Vectors Excel At:

Exact keyword matching
Finding specific terms or phrases
Technical terminology searches
Proper nouns and unique identifiers

2. Real-World Search Scenarios

Consider an e-commerce product search:

# Product: "Apple MacBook Pro 16-inch M2 laptop computer"

# User searches: "16 inch Apple laptop"
# Dense vector: Understands "laptop" ≈ "computer" ≈ "MacBook"
# Sparse vector: Matches exact terms "16", "inch", "Apple", "laptop"
# Combined: Perfect match!

# User searches: "portable workstation for developers"  
# Dense vector: Connects "portable workstation" with "laptop computer"
# Sparse vector: Might miss due to different terminology
# Combined: Dense carries the weight, sparse provides precision

3. Improved Retrieval Quality

Studies show that hybrid search (dense + sparse) typically achieves:

20-40% better recall than dense alone
15-30% better precision than sparse alone
More robust results across different query types

Types of Sparse Vector Creation Methods

1. TF-IDF (Term Frequency-Inverse Document Frequency)

The classic statistical approach that weighs terms by their importance.

from sklearn.feature_extraction.text import TfidfVectorizer

# Simple TF-IDF example
documents = [
    "machine learning algorithms",
    "deep learning neural networks", 
    "artificial intelligence applications"
]

vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(documents)

# For new document
new_doc = "machine learning models"
sparse_vector = vectorizer.transform([new_doc]).toarray()[0]
print(f"Sparse vector shape: {sparse_vector.shape}")  # (1000,)
print(f"Non-zero elements: {(sparse_vector != 0).sum()}")  # Only few non-zero

When to use TF-IDF:

General-purpose keyword matching
When you have a well-defined vocabulary
Documents with clear term boundaries

2. BM25 (Best Matching 25)

An improved version of TF-IDF that handles document length better.

from rank_bm25 import BM25Okapi
import numpy as np

# BM25 implementation
documents = [
    "machine learning algorithms for data science",
    "deep learning and neural network architectures",
    "natural language processing with transformers"
]

# Tokenize documents
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Create sparse vector for query
query = "machine learning data"
query_tokens = query.split()

# Get BM25 scores for all documents
scores = bm25.get_scores(query_tokens)
print(f"BM25 scores: {scores}")

# Convert to sparse vector representation
def create_bm25_sparse_vector(query_tokens, bm25_model, vocab_size):
    # This is a simplified representation
    sparse_vector = np.zeros(vocab_size)
    # Map BM25 scores to vocabulary positions
    # (Implementation depends on your vocabulary mapping)
    return sparse_vector

When to use BM25:

Document retrieval systems
When document length varies significantly
Search engines and information retrieval

3. SPLADE (Sparse Lexical and Expansion)

A neural approach that learns to create sparse vectors.

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# SPLADE creates learned sparse vectors
model_name = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

def create_splade_vector(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Apply activation and create sparse representation
    sparse_vector = torch.relu(logits.squeeze()).cpu().numpy()

    # Keep only top-k dimensions (for sparsity)
    top_k = 100
    top_indices = np.argsort(sparse_vector)[-top_k:]
    final_sparse = np.zeros_like(sparse_vector)
    final_sparse[top_indices] = sparse_vector[top_indices]

    return final_sparse

# Usage
text = "machine learning model optimization"
splade_vector = create_splade_vector(text)
print(f"SPLADE vector sparsity: {(splade_vector == 0).sum() / len(splade_vector):.2%}")

When to use SPLADE:

When you need learned sparse representations
Complex domain-specific terminology
When you can afford the computational cost

4. ColBERT (Contextualized Late Interaction)

Creates multiple vectors per document for fine-grained matching.

# ColBERT conceptual approach
def colbert_sparse_simulation(text, model):
    # ColBERT creates a vector for each token
    tokens = text.split()
    token_vectors = []

    for token in tokens:
        # Each token gets its own contextualized vector
        token_embedding = model.encode(token)  # Simplified
        token_vectors.append(token_embedding)

    return token_vectors

# This creates multiple sparse-like representations
# that can be stored and searched efficiently

Practical Implementation with Qdrant

Here's how to implement multi-vector collections in your setup:

1. Collection Setup

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)

# Create collection with multiple vectors
client.create_collection(
    collection_name="hybrid_search",
    vectors_config={
        "semantic": VectorParams(size=1536, distance=Distance.COSINE),  # OpenAI
        "keywords": VectorParams(size=5000, distance=Distance.DOT),     # TF-IDF
        "bm25": VectorParams(size=3000, distance=Distance.DOT),         # BM25
    }
)

2. Document Processing Pipeline

import openai
from sklearn.feature_extraction.text import TfidfVectorizer

class MultiVectorProcessor:
    def __init__(self):
        self.tfidf = TfidfVectorizer(max_features=5000)
        self.openai_client = openai.Client()

    def create_dense_vector(self, text):
        """Create semantic dense vector using OpenAI"""
        response = self.openai_client.embeddings.create(
            input=text,
            model="text-embedding-ada-002"
        )
        return response.data[0].embedding

    def create_sparse_vector(self, text, method="tfidf"):
        """Create sparse vector using specified method"""
        if method == "tfidf":
            return self.tfidf.transform([text]).toarray()[0].tolist()
        # Add other methods as needed

    def process_document(self, text, doc_id):
        """Process single document into multi-vector format"""
        return {
            "id": doc_id,
            "vectors": {
                "semantic": self.create_dense_vector(text),
                "keywords": self.create_sparse_vector(text, "tfidf"),
            },
            "payload": {
                "text": text,
                "processed_at": time.time()
            }
        }

3. Hybrid Search Implementation

def hybrid_search(query, collection_name, weights=None):
    """
    Perform hybrid search combining multiple vector types
    """
    if weights is None:
        weights = {"semantic": 0.7, "keywords": 0.3}

    processor = MultiVectorProcessor()

    # Create query vectors
    query_semantic = processor.create_dense_vector(query)
    query_keywords = processor.create_sparse_vector(query)

    # Search with semantic vector
    semantic_results = client.search(
        collection_name=collection_name,
        query_vector=("semantic", query_semantic),
        limit=20
    )

    # Search with keyword vector
    keyword_results = client.search(
        collection_name=collection_name,
        query_vector=("keywords", query_keywords),
        limit=20
    )

    # Combine results with weighted scoring
    combined_results = combine_and_rerank(
        semantic_results, keyword_results, weights
    )

    return combined_results[:10]  # Top 10 results

def combine_and_rerank(semantic_results, keyword_results, weights):
    """Combine and rerank results from different vector searches"""
    result_scores = {}

    # Score semantic results
    for result in semantic_results:
        doc_id = result.id
        result_scores[doc_id] = result_scores.get(doc_id, 0) + \
                               (result.score * weights["semantic"])

    # Score keyword results  
    for result in keyword_results:
        doc_id = result.id
        result_scores[doc_id] = result_scores.get(doc_id, 0) + \
                               (result.score * weights["keywords"])

    # Sort by combined score
    sorted_results = sorted(result_scores.items(), 
                          key=lambda x: x[1], reverse=True)
    return sorted_results

When to Use a Multi-Vector Approach

Ideal Use Cases:

E-commerce Search: Product catalogues need both exact matches and semantic understanding
Legal Document Retrieval: Exact legal terms + conceptual case law matching
Academic Paper Search: Technical keywords + research concept similarity
Customer Support: FAQ systems benefit from keyword precision + intent understanding
Enterprise Search: Internal documents with domain-specific terminology

Why Dense and Sparse Vectors Work Better Together: A Beginner's Guide to Multi-Vector Collections

The Problem with the Single Vector Approach

What Are Multi-Vector Collections?

Why Do We Need a Multi-Vector Approach?

1. Complementary Strengths

2. Real-World Search Scenarios

3. Improved Retrieval Quality

Types of Sparse Vector Creation Methods

1. TF-IDF (Term Frequency-Inverse Document Frequency)

2. BM25 (Best Matching 25)

3. SPLADE (Sparse Lexical and Expansion)

4. ColBERT (Contextualized Late Interaction)

Practical Implementation with Qdrant

1. Collection Setup

2. Document Processing Pipeline

3. Hybrid Search Implementation

When to Use a Multi-Vector Approach

Ideal Use Cases:

Comments

Qdrant

A Guide to Document Chunking and Vector Search

More from this blog

Qdrant 101

A Guide to Document Chunking and Vector Search

Kubernetes/Helm-charts commonly used commands

WSL Networking and Port Forwarding

Command Palette

The Problem with the Single Vector Approach

What Are Multi-Vector Collections?

Why Do We Need a Multi-Vector Approach?

1. Complementary Strengths

2. Real-World Search Scenarios

3. Improved Retrieval Quality

Types of Sparse Vector Creation Methods

1. TF-IDF (Term Frequency-Inverse Document Frequency)

2. BM25 (Best Matching 25)

3. SPLADE (Sparse Lexical and Expansion)

4. ColBERT (Contextualized Late Interaction)

Practical Implementation with Qdrant

1. Collection Setup

2. Document Processing Pipeline

3. Hybrid Search Implementation

When to Use a Multi-Vector Approach

Ideal Use Cases:

Comments

Qdrant

A Guide to Document Chunking and Vector Search

More from this blog