Why Dense and Sparse Vectors Work Better Together: A Beginner's Guide to Multi-Vector Collections
Imagine you're looking for a book in a massive library. You might search by the exact title (keyword search) or describe what the book is about (semantic search). Sometimes you need both approaches to find exactly what you're looking for. This is precisely why we use multi-vector collections in vector databases!
The Problem with the Single Vector Approach
Let's start with a real-world example to understand the limitation:
Document: "The neural network model exhibits overfitting behaviour on the training dataset"
User Query 1: "overfitting neural network"
User Query 2: "AI model performing poorly on training data"
With a single dense vector approach:
Query 1 might not match well because dense vectors focus on the overall meaning
Query 2 might miss the document because it doesn't contain exact terms like "AI" or "performing poorly"
This is where the magic of combining different vector types comes in!
What Are Multi-Vector Collections?
Multi-vector collections allow you to store multiple different representations of the same content within a single collection. Think of it as having different "lenses" through which you can view and search your data:
Dense Vectors: Understand meaning and context
Sparse Vectors: Focus on exact keywords and terms
Hybrid Approach: Combines both for superior search results
# Example of multi-vector structure
document_vectors = {
"semantic": [0.123, -0.456, 0.789, ...], # Dense vector (1536 dims)
"keywords": [0, 0, 0.67, 0, 0.45, 0, ...], # Sparse vector (5000 dims)
"metadata": {"title": "ML Paper", "author": "John Doe"}
}
Why Do We Need a Multi-Vector Approach?
1. Complementary Strengths
Dense Vectors Excel At:
Understanding synonyms ("car" ≈ ", automobile")
Capturing context and meaning
Finding conceptually similar content
Handling paraphrases and different ways of expressing ideas
Sparse Vectors Excel At:
Exact keyword matching
Finding specific terms or phrases
Technical terminology searches
Proper nouns and unique identifiers
2. Real-World Search Scenarios
Consider an e-commerce product search:
# Product: "Apple MacBook Pro 16-inch M2 laptop computer"
# User searches: "16 inch Apple laptop"
# Dense vector: Understands "laptop" ≈ "computer" ≈ "MacBook"
# Sparse vector: Matches exact terms "16", "inch", "Apple", "laptop"
# Combined: Perfect match!
# User searches: "portable workstation for developers"
# Dense vector: Connects "portable workstation" with "laptop computer"
# Sparse vector: Might miss due to different terminology
# Combined: Dense carries the weight, sparse provides precision
3. Improved Retrieval Quality
Studies show that hybrid search (dense + sparse) typically achieves:
20-40% better recall than dense alone
15-30% better precision than sparse alone
More robust results across different query types
Types of Sparse Vector Creation Methods
1. TF-IDF (Term Frequency-Inverse Document Frequency)
The classic statistical approach that weighs terms by their importance.
from sklearn.feature_extraction.text import TfidfVectorizer
# Simple TF-IDF example
documents = [
"machine learning algorithms",
"deep learning neural networks",
"artificial intelligence applications"
]
vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(documents)
# For new document
new_doc = "machine learning models"
sparse_vector = vectorizer.transform([new_doc]).toarray()[0]
print(f"Sparse vector shape: {sparse_vector.shape}") # (1000,)
print(f"Non-zero elements: {(sparse_vector != 0).sum()}") # Only few non-zero
When to use TF-IDF:
General-purpose keyword matching
When you have a well-defined vocabulary
Documents with clear term boundaries
2. BM25 (Best Matching 25)
An improved version of TF-IDF that handles document length better.
from rank_bm25 import BM25Okapi
import numpy as np
# BM25 implementation
documents = [
"machine learning algorithms for data science",
"deep learning and neural network architectures",
"natural language processing with transformers"
]
# Tokenize documents
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
# Create sparse vector for query
query = "machine learning data"
query_tokens = query.split()
# Get BM25 scores for all documents
scores = bm25.get_scores(query_tokens)
print(f"BM25 scores: {scores}")
# Convert to sparse vector representation
def create_bm25_sparse_vector(query_tokens, bm25_model, vocab_size):
# This is a simplified representation
sparse_vector = np.zeros(vocab_size)
# Map BM25 scores to vocabulary positions
# (Implementation depends on your vocabulary mapping)
return sparse_vector
When to use BM25:
Document retrieval systems
When document length varies significantly
Search engines and information retrieval
3. SPLADE (Sparse Lexical and Expansion)
A neural approach that learns to create sparse vectors.
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# SPLADE creates learned sparse vectors
model_name = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
def create_splade_vector(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Apply activation and create sparse representation
sparse_vector = torch.relu(logits.squeeze()).cpu().numpy()
# Keep only top-k dimensions (for sparsity)
top_k = 100
top_indices = np.argsort(sparse_vector)[-top_k:]
final_sparse = np.zeros_like(sparse_vector)
final_sparse[top_indices] = sparse_vector[top_indices]
return final_sparse
# Usage
text = "machine learning model optimization"
splade_vector = create_splade_vector(text)
print(f"SPLADE vector sparsity: {(splade_vector == 0).sum() / len(splade_vector):.2%}")
When to use SPLADE:
When you need learned sparse representations
Complex domain-specific terminology
When you can afford the computational cost
4. ColBERT (Contextualized Late Interaction)
Creates multiple vectors per document for fine-grained matching.
# ColBERT conceptual approach
def colbert_sparse_simulation(text, model):
# ColBERT creates a vector for each token
tokens = text.split()
token_vectors = []
for token in tokens:
# Each token gets its own contextualized vector
token_embedding = model.encode(token) # Simplified
token_vectors.append(token_embedding)
return token_vectors
# This creates multiple sparse-like representations
# that can be stored and searched efficiently
Practical Implementation with Qdrant
Here's how to implement multi-vector collections in your setup:
1. Collection Setup
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
client = QdrantClient("localhost", port=6333)
# Create collection with multiple vectors
client.create_collection(
collection_name="hybrid_search",
vectors_config={
"semantic": VectorParams(size=1536, distance=Distance.COSINE), # OpenAI
"keywords": VectorParams(size=5000, distance=Distance.DOT), # TF-IDF
"bm25": VectorParams(size=3000, distance=Distance.DOT), # BM25
}
)
2. Document Processing Pipeline
import openai
from sklearn.feature_extraction.text import TfidfVectorizer
class MultiVectorProcessor:
def __init__(self):
self.tfidf = TfidfVectorizer(max_features=5000)
self.openai_client = openai.Client()
def create_dense_vector(self, text):
"""Create semantic dense vector using OpenAI"""
response = self.openai_client.embeddings.create(
input=text,
model="text-embedding-ada-002"
)
return response.data[0].embedding
def create_sparse_vector(self, text, method="tfidf"):
"""Create sparse vector using specified method"""
if method == "tfidf":
return self.tfidf.transform([text]).toarray()[0].tolist()
# Add other methods as needed
def process_document(self, text, doc_id):
"""Process single document into multi-vector format"""
return {
"id": doc_id,
"vectors": {
"semantic": self.create_dense_vector(text),
"keywords": self.create_sparse_vector(text, "tfidf"),
},
"payload": {
"text": text,
"processed_at": time.time()
}
}
3. Hybrid Search Implementation
def hybrid_search(query, collection_name, weights=None):
"""
Perform hybrid search combining multiple vector types
"""
if weights is None:
weights = {"semantic": 0.7, "keywords": 0.3}
processor = MultiVectorProcessor()
# Create query vectors
query_semantic = processor.create_dense_vector(query)
query_keywords = processor.create_sparse_vector(query)
# Search with semantic vector
semantic_results = client.search(
collection_name=collection_name,
query_vector=("semantic", query_semantic),
limit=20
)
# Search with keyword vector
keyword_results = client.search(
collection_name=collection_name,
query_vector=("keywords", query_keywords),
limit=20
)
# Combine results with weighted scoring
combined_results = combine_and_rerank(
semantic_results, keyword_results, weights
)
return combined_results[:10] # Top 10 results
def combine_and_rerank(semantic_results, keyword_results, weights):
"""Combine and rerank results from different vector searches"""
result_scores = {}
# Score semantic results
for result in semantic_results:
doc_id = result.id
result_scores[doc_id] = result_scores.get(doc_id, 0) + \
(result.score * weights["semantic"])
# Score keyword results
for result in keyword_results:
doc_id = result.id
result_scores[doc_id] = result_scores.get(doc_id, 0) + \
(result.score * weights["keywords"])
# Sort by combined score
sorted_results = sorted(result_scores.items(),
key=lambda x: x[1], reverse=True)
return sorted_results
When to Use a Multi-Vector Approach
Ideal Use Cases:
E-commerce Search: Product catalogues need both exact matches and semantic understanding
Legal Document Retrieval: Exact legal terms + conceptual case law matching
Academic Paper Search: Technical keywords + research concept similarity
Customer Support: FAQ systems benefit from keyword precision + intent understanding
Enterprise Search: Internal documents with domain-specific terminology
