Untitled Publication

Qdrant 101

Ummer Farooq — Thu, 21 Aug 2025 06:41:47 GMT

Introduction

Qdrant is an open-source vector database designed for storing, searching, and managing high-dimensional vectors. It's particularly useful for AI applications like semantic search, recommendation systems, and RAG (Retrieval-Augmented Generation) implementations.

Why Qdrant?

High Performance: Written in Rust for optimal speed
Scalability: Handles millions of vectors efficiently
Rich Filtering: Complex metadata filtering capabilities
Multiple Metrics: Support for various similarity metrics
Easy Integration: REST API and multiple language clients

Key Terminologies

Core Concepts

Vector: A high-dimensional numerical representation of data (text, images, etc.)

# Example vector (384 dimensions)
vector = [0.1, -0.2, 0.3, ..., 0.5]

Embedding: The process of converting raw data into vector representations

# Text to vector embedding
text = "Artificial intelligence is transforming healthcare"
# After embedding: [0.023, -0.156, 0.891, ...]

Point: A vector combined with an optional payload (metadata) and unique ID

point = {
    "id": 1,
    "vector": [0.1, 0.2, 0.3, ...],  # 1536-dim vector
    "payload": {
        "title": "AI in Medical Diagnosis", 
        "category": "healthcare",
        "confidence": 0.95
    }
}

Collection: A named set of vectors with the same dimensionality and distance metric

# Collections are like tables in traditional databases
collection_name = "documents"

Payload: Metadata associated with a vector for filtering and additional information

payload = {
    "title": "AI in Healthcare",
    "author": "John Doe",
    "published_date": "2024-01-15",
    "tags": ["AI", "healthcare", "machine learning"]
}

Distance Metric: Method to measure similarity between vectors

Cosine Distance: Measures angle between vectors
- Range: 0 (identical) to 2 (opposite)
- Best for: Text embeddings, when magnitude doesn't matter
- Use case: Document similarity, semantic search
Euclidean Distance: Measures straight-line distance in space
- Range: 0 (identical) to ∞
- Best for: When both direction and magnitude matter
- Use case: Image features, spatial data
Dot Product: Measures alignment and magnitude
- Range: -∞ to +∞ (higher = more similar)
- Best for: When you want to consider vector magnitude
- Use case: Recommendation systems with user preferences

Advanced Concepts

Quantization: Technique to reduce memory usage by compressing vectors

Why: Reduces memory footprint by 2-8x, enables larger datasets
Trade-off: Slight accuracy loss for significant memory savings
Types: Scalar (INT8), Product (PQ), Binary

Indexing: Data structure optimization for faster search

HNSW: Hierarchical Navigable Small World graphs
Purpose: Trade memory for search speed
Configuration: Affects build time, memory usage, and search accuracy

Sharding: Distributing data across multiple nodes

Horizontal scaling: Split collection across multiple machines
Load balancing: Distribute query load
Fault tolerance: Replicas ensure availability

Replication: Creating copies for fault tolerance

Data safety: Multiple copies prevent data loss
Read scaling: Distribute read queries across replicas
Consistency: Strong or eventual consistency models

Installation & Setup

Docker Installation (Recommended)

Basic Setup

# Basic setup
docker run -p 6333:6333 qdrant/qdrant

# With persistent storage
docker run -p 6333:6333 -v $(pwd)/storage:/qdrant/storage qdrant/qdrant

Docker Compose for Production

# docker-compose.yml
version: '3.7'
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_storage:/qdrant/storage
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__STORAGE__STORAGE_PATH=/qdrant/storage
    restart: unless-stopped

volumes:
  qdrant_storage:

Python Client Installation

pip install qdrant-client openai python-dotenv

Basic Connection Setup

import os
import openai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

# Setup
os.environ["OPENAI_API_KEY"] = "your-key-here"
openai.api_key = os.getenv("OPENAI_API_KEY")

# Connect to Qdrant
client = QdrantClient("localhost", port=6333)

# Test OpenAI embedding
def get_embedding(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response['data'][0]['embedding']

# Test
test_embedding = get_embedding("Hello world")
print(f"Embedding dimension: {len(test_embedding)}")  # Should be 1536

Core Concepts

Vector Similarity

Vector similarity is the foundation of semantic search. Let's explore how it works:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Similar medical texts
text1 = "AI is revolutionizing medical diagnosis"
text2 = "Artificial intelligence transforms healthcare"
text3 = "The weather is sunny today"

vec1 = get_embedding(text1)
vec2 = get_embedding(text2)
vec3 = get_embedding(text3)

print(f"Medical similarity: {cosine_similarity(vec1, vec2):.3f}")  # High
print(f"Different topic: {cosine_similarity(vec1, vec3):.3f}")    # Low

Basic Operations

1. Create Collection

# Create collection for OpenAI embeddings
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,  # OpenAI ada-002 dimension
        distance=Distance.COSINE
    )
)

2. Insert Documents

from qdrant_client.models import PointStruct
import uuid

def insert_document(title, content, category):
    # Generate embedding
    text = f"{title}. {content}"
    vector = get_embedding(text)

    # Create point
    point = PointStruct(
        id=str(uuid.uuid4()),
        vector=vector,
        payload={
            "title": title,
            "content": content,
            "category": category,
            "word_count": len(content.split())
        }
    )

    # Insert
    client.upsert(
        collection_name="documents",
        points=[point]
    )
    return point.id

# Example usage
doc_id = insert_document(
    "AI in Healthcare", 
    "Machine learning is transforming medical diagnosis...",
    "technology"
)

3. Advanced Search with OpenAI Embeddings

def search_documents(query, limit=5):
    # Generate query embedding
    query_vector = get_embedding(query)

    # Search
    results = client.search(
        collection_name="documents",
        query_vector=query_vector,
        limit=limit,
        with_payload=True
    )

    # Format results
    return [{
        "title": r.payload["title"],
        "score": r.score,
        "category": r.payload["category"]
    } for r in results]

# Search example
results = search_documents("artificial intelligence medicine")
for result in results:
    print(f"{result['title']} - Score: {result['score']:.3f}")

4. Filter Search

from qdrant_client.models import Filter, FieldCondition, MatchValue

def search_with_filter(query, category, limit=5):
    query_vector = get_embedding(query)

    category_filter = Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value=category)
            )
        ]
    )

    results = client.search(
        collection_name="documents",
        query_vector=query_vector,
        query_filter=category_filter,
        limit=limit
    )

    return results

# Search only in technology category
tech_results = search_with_filter("machine learning", "technology")

Performance Optimization

Quantization

Why Quantization? Quantization reduces memory usage by representing vectors with lower precision. This is crucial when dealing with millions of vectors.

Memory Savings:

Scalar (INT8): General purpose, 75% memory reduction, 1-3% accuracy loss
Product: Maximum compression, 87-95% reduction, 5-15% accuracy loss
Binary: Extreme compression, 96% reduction, 20-40% accuracy loss

from qdrant_client.models import ScalarQuantization, QuantizationType

# Apply INT8 quantization (recommended for most cases)
client.update_collection(
    collection_name="documents",
    quantization_config=ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=QuantizationType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)

Indexing Optimization

# Optimize for search speed
client.update_collection(
    collection_name="documents",
    hnsw_config=models.HnswConfigDiff(
        m=32,              # Higher = better quality, more memory
        ef_construct=200,  # Higher = better quality, slower build
        full_scan_threshold=10000
    )
)

Advanced Operations

Complex Filtering

from qdrant_client.models import Range, MatchAny

# Complex filter: technology OR healthcare, with substantial content
complex_filter = Filter(
    must=[
        FieldCondition(key="category", match=MatchAny(any=["technology", "healthcare"])),
        FieldCondition(key="word_count", range=Range(gte=100))
    ],
    must_not=[
        FieldCondition(key="status", match=MatchValue(value="draft"))
    ]
)

results = client.search(
    collection_name="documents",
    query_vector=query_vector,
    query_filter=complex_filter,
    limit=10
)

Batch Operations

def batch_insert_documents(documents, batch_size=100):
    points = []

    for doc in documents:
        text = f"{doc['title']}. {doc['content']}"
        vector = get_embedding(text)

        points.append(PointStruct(
            id=str(uuid.uuid4()),
            vector=vector,
            payload=doc
        ))

    # Insert in batches
    for i in range(0, len(points), batch_size):
        batch = points[i:i + batch_size]
        client.upsert(collection_name="documents", points=batch)
        print(f"Inserted batch {i//batch_size + 1}")

# Usage
sample_docs = [
    {"title": "Doc 1", "content": "Content 1...", "category": "tech"},
    {"title": "Doc 2", "content": "Content 2...", "category": "health"}
]
batch_insert_documents(sample_docs)

Best Practices

1. Embedding Generation

class EmbeddingManager:
    def __init__(self):
        self.model = "text-embedding-ada-002"

    def preprocess_text(self, text):
        # Clean and normalize text
        text = text.strip()
        text = ' '.join(text.split())  # Normalize whitespace

        # Truncate if too long (8191 tokens max for ada-002)
        max_chars = 8191 * 4  # ~4 chars per token
        if len(text) > max_chars:
            text = text[:max_chars]

        return text

    def get_embedding(self, text):
        processed_text = self.preprocess_text(text)

        response = openai.Embedding.create(
            input=processed_text,
            model=self.model
        )
        return response['data'][0]['embedding']

    def optimize_for_search(self, title, content):
        # Weight title more heavily
        return f"{title}. {title}. {content}"

2. Error Handling

import time
from functools import wraps

def retry_on_failure(max_retries=3):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise e
                    wait_time = 2 ** attempt
                    print(f"Attempt {attempt + 1} failed, retrying in {wait_time}s...")
                    time.sleep(wait_time)
        return wrapper
    return decorator

@retry_on_failure(max_retries=3)
def safe_upsert(collection_name, points):
    return client.upsert(collection_name=collection_name, points=points)

3. Collection Design

def create_optimized_collection(name, expected_size):
    if expected_size < 10000:
        # Small collection - no quantization needed
        config = VectorParams(size=1536, distance=Distance.COSINE)
        quantization = None
    elif expected_size < 1000000:
        # Medium collection - use scalar quantization
        config = VectorParams(size=1536, distance=Distance.COSINE)
        quantization = ScalarQuantization(
            scalar=models.ScalarQuantizationConfig(
                type=QuantizationType.INT8,
                quantile=0.99
            )
        )
    else:
        # Large collection - aggressive optimization
        config = VectorParams(
            size=1536, 
            distance=Distance.COSINE,
            hnsw_config=models.HnswConfigDiff(m=16, ef_construct=100)
        )
        quantization = models.ProductQuantization(
            product=models.ProductQuantizationConfig(
                compression=models.CompressionRatio.X8
            )
        )

    client.create_collection(
        collection_name=name,
        vectors_config=config,
        quantization_config=quantization
    )

Production Considerations

Health Monitoring

def health_check():
    try:
        collections = client.get_collections()

        for collection in collections.collections:
            info = client.get_collection(collection.name)
            print(f"Collection: {collection.name}")
            print(f"  Points: {info.points_count}")
            print(f"  Indexed: {info.indexed_vectors_count}")

            if info.vectors_count != info.indexed_vectors_count:
                print(f"  Warning: Indexing behind!")

        return True
    except Exception as e:
        print(f"Health check failed: {e}")
        return False

Backup Strategy

def backup_collection(collection_name, backup_file):
    all_points = []
    offset = None

    while True:
        result = client.scroll(
            collection_name=collection_name,
            offset=offset,
            limit=1000,
            with_payload=True,
            with_vectors=True
        )

        points, next_offset = result
        all_points.extend([{
            "id": p.id,
            "vector": p.vector,
            "payload": p.payload
        } for p in points])

        if next_offset is None:
            break
        offset = next_offset

    import json
    with open(backup_file, 'w') as f:
        json.dump(all_points, f)

    print(f"Backed up {len(all_points)} points")

Why Dense and Sparse Vectors Work Better Together: A Beginner's Guide to Multi-Vector Collections

Ummer Farooq — Thu, 14 Aug 2025 12:22:05 GMT

Imagine you're looking for a book in a massive library. You might search by the exact title (keyword search) or describe what the book is about (semantic search). Sometimes you need both approaches to find exactly what you're looking for. This is precisely why we use multi-vector collections in vector databases!

The Problem with the Single Vector Approach

Let's start with a real-world example to understand the limitation:

Document: "The neural network model exhibits overfitting behaviour on the training dataset"

User Query 1: "overfitting neural network"
User Query 2: "AI model performing poorly on training data"

With a single dense vector approach:

Query 1 might not match well because dense vectors focus on the overall meaning
Query 2 might miss the document because it doesn't contain exact terms like "AI" or "performing poorly"

This is where the magic of combining different vector types comes in!

What Are Multi-Vector Collections?

Multi-vector collections allow you to store multiple different representations of the same content within a single collection. Think of it as having different "lenses" through which you can view and search your data:

Dense Vectors: Understand meaning and context
Sparse Vectors: Focus on exact keywords and terms
Hybrid Approach: Combines both for superior search results

# Example of multi-vector structure
document_vectors = {
    "semantic": [0.123, -0.456, 0.789, ...],      # Dense vector (1536 dims)
    "keywords": [0, 0, 0.67, 0, 0.45, 0, ...],    # Sparse vector (5000 dims)
    "metadata": {"title": "ML Paper", "author": "John Doe"}
}

Why Do We Need a Multi-Vector Approach?

1. Complementary Strengths

Dense Vectors Excel At:

Understanding synonyms ("car" ≈ ", automobile")
Capturing context and meaning
Finding conceptually similar content
Handling paraphrases and different ways of expressing ideas

Sparse Vectors Excel At:

Exact keyword matching
Finding specific terms or phrases
Technical terminology searches
Proper nouns and unique identifiers

2. Real-World Search Scenarios

Consider an e-commerce product search:

# Product: "Apple MacBook Pro 16-inch M2 laptop computer"

# User searches: "16 inch Apple laptop"
# Dense vector: Understands "laptop" ≈ "computer" ≈ "MacBook"
# Sparse vector: Matches exact terms "16", "inch", "Apple", "laptop"
# Combined: Perfect match!

# User searches: "portable workstation for developers"  
# Dense vector: Connects "portable workstation" with "laptop computer"
# Sparse vector: Might miss due to different terminology
# Combined: Dense carries the weight, sparse provides precision

3. Improved Retrieval Quality

Studies show that hybrid search (dense + sparse) typically achieves:

20-40% better recall than dense alone
15-30% better precision than sparse alone
More robust results across different query types

Types of Sparse Vector Creation Methods

1. TF-IDF (Term Frequency-Inverse Document Frequency)

The classic statistical approach that weighs terms by their importance.

from sklearn.feature_extraction.text import TfidfVectorizer

# Simple TF-IDF example
documents = [
    "machine learning algorithms",
    "deep learning neural networks", 
    "artificial intelligence applications"
]

vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(documents)

# For new document
new_doc = "machine learning models"
sparse_vector = vectorizer.transform([new_doc]).toarray()[0]
print(f"Sparse vector shape: {sparse_vector.shape}")  # (1000,)
print(f"Non-zero elements: {(sparse_vector != 0).sum()}")  # Only few non-zero

When to use TF-IDF:

General-purpose keyword matching
When you have a well-defined vocabulary
Documents with clear term boundaries

2. BM25 (Best Matching 25)

An improved version of TF-IDF that handles document length better.

from rank_bm25 import BM25Okapi
import numpy as np

# BM25 implementation
documents = [
    "machine learning algorithms for data science",
    "deep learning and neural network architectures",
    "natural language processing with transformers"
]

# Tokenize documents
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Create sparse vector for query
query = "machine learning data"
query_tokens = query.split()

# Get BM25 scores for all documents
scores = bm25.get_scores(query_tokens)
print(f"BM25 scores: {scores}")

# Convert to sparse vector representation
def create_bm25_sparse_vector(query_tokens, bm25_model, vocab_size):
    # This is a simplified representation
    sparse_vector = np.zeros(vocab_size)
    # Map BM25 scores to vocabulary positions
    # (Implementation depends on your vocabulary mapping)
    return sparse_vector

When to use BM25:

Document retrieval systems
When document length varies significantly
Search engines and information retrieval

3. SPLADE (Sparse Lexical and Expansion)

A neural approach that learns to create sparse vectors.

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# SPLADE creates learned sparse vectors
model_name = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

def create_splade_vector(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Apply activation and create sparse representation
    sparse_vector = torch.relu(logits.squeeze()).cpu().numpy()

    # Keep only top-k dimensions (for sparsity)
    top_k = 100
    top_indices = np.argsort(sparse_vector)[-top_k:]
    final_sparse = np.zeros_like(sparse_vector)
    final_sparse[top_indices] = sparse_vector[top_indices]

    return final_sparse

# Usage
text = "machine learning model optimization"
splade_vector = create_splade_vector(text)
print(f"SPLADE vector sparsity: {(splade_vector == 0).sum() / len(splade_vector):.2%}")

When to use SPLADE:

When you need learned sparse representations
Complex domain-specific terminology
When you can afford the computational cost

4. ColBERT (Contextualized Late Interaction)

Creates multiple vectors per document for fine-grained matching.

# ColBERT conceptual approach
def colbert_sparse_simulation(text, model):
    # ColBERT creates a vector for each token
    tokens = text.split()
    token_vectors = []

    for token in tokens:
        # Each token gets its own contextualized vector
        token_embedding = model.encode(token)  # Simplified
        token_vectors.append(token_embedding)

    return token_vectors

# This creates multiple sparse-like representations
# that can be stored and searched efficiently

Practical Implementation with Qdrant

Here's how to implement multi-vector collections in your setup:

1. Collection Setup

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)

# Create collection with multiple vectors
client.create_collection(
    collection_name="hybrid_search",
    vectors_config={
        "semantic": VectorParams(size=1536, distance=Distance.COSINE),  # OpenAI
        "keywords": VectorParams(size=5000, distance=Distance.DOT),     # TF-IDF
        "bm25": VectorParams(size=3000, distance=Distance.DOT),         # BM25
    }
)

2. Document Processing Pipeline

import openai
from sklearn.feature_extraction.text import TfidfVectorizer

class MultiVectorProcessor:
    def __init__(self):
        self.tfidf = TfidfVectorizer(max_features=5000)
        self.openai_client = openai.Client()

    def create_dense_vector(self, text):
        """Create semantic dense vector using OpenAI"""
        response = self.openai_client.embeddings.create(
            input=text,
            model="text-embedding-ada-002"
        )
        return response.data[0].embedding

    def create_sparse_vector(self, text, method="tfidf"):
        """Create sparse vector using specified method"""
        if method == "tfidf":
            return self.tfidf.transform([text]).toarray()[0].tolist()
        # Add other methods as needed

    def process_document(self, text, doc_id):
        """Process single document into multi-vector format"""
        return {
            "id": doc_id,
            "vectors": {
                "semantic": self.create_dense_vector(text),
                "keywords": self.create_sparse_vector(text, "tfidf"),
            },
            "payload": {
                "text": text,
                "processed_at": time.time()
            }
        }

3. Hybrid Search Implementation

def hybrid_search(query, collection_name, weights=None):
    """
    Perform hybrid search combining multiple vector types
    """
    if weights is None:
        weights = {"semantic": 0.7, "keywords": 0.3}

    processor = MultiVectorProcessor()

    # Create query vectors
    query_semantic = processor.create_dense_vector(query)
    query_keywords = processor.create_sparse_vector(query)

    # Search with semantic vector
    semantic_results = client.search(
        collection_name=collection_name,
        query_vector=("semantic", query_semantic),
        limit=20
    )

    # Search with keyword vector
    keyword_results = client.search(
        collection_name=collection_name,
        query_vector=("keywords", query_keywords),
        limit=20
    )

    # Combine results with weighted scoring
    combined_results = combine_and_rerank(
        semantic_results, keyword_results, weights
    )

    return combined_results[:10]  # Top 10 results

def combine_and_rerank(semantic_results, keyword_results, weights):
    """Combine and rerank results from different vector searches"""
    result_scores = {}

    # Score semantic results
    for result in semantic_results:
        doc_id = result.id
        result_scores[doc_id] = result_scores.get(doc_id, 0) + \
                               (result.score * weights["semantic"])

    # Score keyword results  
    for result in keyword_results:
        doc_id = result.id
        result_scores[doc_id] = result_scores.get(doc_id, 0) + \
                               (result.score * weights["keywords"])

    # Sort by combined score
    sorted_results = sorted(result_scores.items(), 
                          key=lambda x: x[1], reverse=True)
    return sorted_results

When to Use a Multi-Vector Approach

Ideal Use Cases:

E-commerce Search: Product catalogues need both exact matches and semantic understanding
Legal Document Retrieval: Exact legal terms + conceptual case law matching
Academic Paper Search: Technical keywords + research concept similarity
Customer Support: FAQ systems benefit from keyword precision + intent understanding
Enterprise Search: Internal documents with domain-specific terminology

A Guide to Document Chunking and Vector Search

Ummer Farooq — Thu, 14 Aug 2025 11:17:59 GMT

Introduction: Why Traditional Search Falls Short

Imagine you're searching through a massive company knowledge base for information about "machine learning best practices." Traditional keyword search might return hundreds of documents, but you end up scrolling through irrelevant results because:

The term "machine learning" appears in random sentences throughout documents
You get the entire 50-page document when you need a specific section
Important documents are buried because they use synonyms like "AI" or "artificial intelligence"

This is where intelligent document chunking and vector search come to the rescue. Instead of treating documents like black boxes, we break them down intelligently and search through them using AI that understands meaning, not just keywords.

But here's the thing: there are multiple ways to approach this problem, each with its strengths and use cases. Let's examine the primary strategies that production systems employ today.

The Three Main Approaches Explained

1. Fixed-Size Chunking: The Simple Approach

What it is: Cut documents into equal-sized pieces, like slicing bread.

How it works:

from langchain.text_splitter import CharacterTextSplitter

# Simple character-based chunking
def simple_chunking(document, chunk_size=500, chunk_overlap=50):
    text_splitter = CharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separator="\n"
    )
    chunks = text_splitter.split_text(document)
    return chunks

# Example
document = "Machine learning is revolutionizing healthcare..."
chunks = simple_chunking(document, 200)
# Result: ["Machine learning is revolutionizing healthcare by enabling...", 
#          "...doctors to diagnose diseases faster. Recent studies show..."]

Real-world example: Netflix might use this approach for subtitles or movie descriptions where the content is relatively uniform.

Pros:

✅ Simple to implement
✅ Predictable memory usage
✅ Works well for uniform content (novels, articles)

Cons:

❌ Cuts through sentences mid-thought
❌ Loses document structure
❌ Poor for complex documents

2. Semantic Chunking: The Smart Approach

What it is: Split documents based on meaning and structure, like organising a library by topics.

How it works:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
import re

def semantic_chunking_with_langchain(document, doc_type="markdown"):
    chunks = []

    if doc_type == "markdown":
        # For markdown documents, split by headers
        headers_to_split_on = [
            ("#", "Header 1"),
            ("##", "Header 2"),
            ("###", "Header 3"),
        ]

        markdown_splitter = MarkdownHeaderTextSplitter(
            headers_to_split_on=headers_to_split_on
        )
        md_header_splits = markdown_splitter.split_text(document)

        # Further split large sections
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )

        for header_chunk in md_header_splits:
            chunk_type = detect_section_type(header_chunk.page_content, header_chunk.metadata)

            # Split large chunks further while preserving structure
            if len(header_chunk.page_content) > 800:
                sub_chunks = text_splitter.split_text(header_chunk.page_content)
                for i, sub_chunk in enumerate(sub_chunks):
                    chunks.append({
                        'content': sub_chunk,
                        'type': chunk_type,
                        'metadata': {
                            **header_chunk.metadata,
                            'sub_chunk_index': i,
                            'is_split_chunk': True
                        }
                    })
            else:
                chunks.append({
                    'content': header_chunk.page_content,
                    'type': chunk_type,
                    'metadata': header_chunk.metadata
                })

    else:
        # For plain text, use recursive splitting with custom separators
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n\n", "\n\n", "\n", ".", "!", "?", " ", ""]
        )

        raw_chunks = text_splitter.split_text(document)

        for i, chunk in enumerate(raw_chunks):
            chunk_type = detect_section_type(chunk)
            chunks.append({
                'content': chunk,
                'type': chunk_type,
                'metadata': {'chunk_index': i}
            })

    return chunks

def detect_section_type(text, existing_metadata=None):
    text_lower = text.lower()

    # Use existing header metadata if available
    if existing_metadata:
        for key, value in existing_metadata.items():
            if 'header' in key.lower():
                header_text = value.lower()
                if any(keyword in header_text for keyword in ['summary', 'abstract', 'overview']):
                    return 'summary'
                elif any(keyword in header_text for keyword in ['conclusion', 'summary', 'results']):
                    return 'conclusion'
                elif any(keyword in header_text for keyword in ['introduction', 'background']):
                    return 'introduction'

    # Fallback to content-based detection
    if len(text) < 100 and ':' in text:
        return 'title'
    elif any(keyword in text_lower for keyword in ['summary', 'abstract', 'overview']):
        return 'summary'
    elif any(keyword in text_lower for keyword in ['conclusion', 'in conclusion', 'to conclude']):
        return 'conclusion'
    elif any(keyword in text_lower for keyword in ['introduction', 'background']):
        return 'introduction'
    else:
        return 'content'

Real-world example: A legal firm's document system where lawyers need to quickly find case summaries vs. detailed legal reasoning, vs. final judgments.

Pros:

✅ Preserves document structure
✅ Enables targeted search (search only in summaries)
✅ Better context preservation
✅ Widely used in production

Cons:

❌ More complex to implement
❌ Requires understanding of document structure
❌ May create uneven chunk sizes

3. Multi-Vector Collections: The Advanced Approach

What it is: Create multiple different representations of the same content, like having multiple indexes for the same library book.

How it works:

def multi_vector_approach(document):
    # Same document, multiple representations
    representations = {}

    # Semantic representation (for meaning)
    representations['semantic'] = create_embedding(document, model='semantic')

    # Keyword representation (for exact matches)
    keyword_enhanced = extract_keywords(document) + document
    representations['keyword'] = create_embedding(keyword_enhanced, model='keyword')

    # Summary representation (for high-level concepts)
    summary = generate_summary(document)
    representations['summary'] = create_embedding(summary, model='large')

    return {
        'document': document,
        'vectors': representations
    }

# When searching, you can choose which representation to use
def search_with_strategy(query, search_type='semantic'):
    if search_type == 'semantic':
        return search_vector(query, vector_type='semantic')
    elif search_type == 'keyword':
        return search_vector(query, vector_type='keyword')
    elif search_type == 'conceptual':
        return search_vector(query, vector_type='summary')

Real-world example: A research platform where the same paper needs to be findable by exact technical terms, general concepts, and semantic similarity.

Pros:

✅ Multiple search strategies for the same content
✅ Can combine different AI models
✅ Handles diverse query types well

Cons:

❌ Much more complex and expensive
❌ Requires multiple embedding API calls
❌ Higher storage costs
❌ Less commonly used in production

Real-World Use Cases: Which Approach When?

E-commerce Platform: Product Search

Scenario: Customers search for products using various terms

Best approach: Semantic chunking with product attribute separation

Product chunks:
- Title: "iPhone 15 Pro Max"
- Features: "6.7-inch display, A17 Pro chip, titanium design"
- Reviews: "Great camera quality, excellent battery life"
- Specifications: "256GB storage, 5G connectivity"

Why: Customers might want to search specifically in reviews ("battery life") or specifications ("storage"), making targeted search valuable.

Legal Document Management

Scenario: Lawyers need to find specific information in thousands of legal documents

Best approach: Semantic chunking with legal document structure

Legal document chunks:
- Case summary: "Plaintiff vs. Defendant regarding contract dispute"
- Facts: "On January 15, 2023, the parties entered into agreement..."
- Legal reasoning: "Under contract law precedent established in..."
- Judgment: "The court finds in favor of plaintiff and awards..."

Why: Legal professionals have specific information needs - they might want only case summaries for quick review or only judgments for precedent research.

Customer Support Knowledge Base

Scenario: Support agents need quick answers to customer questions

Best approach: Multi-vector collections for diverse query handling

Same article about "Password Reset" gets multiple representations:
- Semantic vector: Understands "I can't log in" → password reset
- Keyword vector: Finds exact matches for "forgot password"
- Summary vector: Matches high-level concepts like "account access issues"

Why: Customer questions come in many forms - some use exact terminology, others describe problems in natural language.

Academic Research Platform

Scenario: Researchers search through millions of scientific papers

Best approach: Semantic chunking with academic paper structure

Research paper chunks:
- Abstract: High-level research summary
- Introduction: Problem background and motivation  
- Methodology: How the research was conducted
- Results: What was discovered
- Conclusion: Implications and future work

Why: Researchers have different needs - some want quick overviews (abstracts), others need implementation details (methodology).

Best Practices and Common Pitfalls

Do's ✅

Start simple: Begin with semantic chunking before considering multi-vector
Test with real queries: Use actual user queries to evaluate effectiveness
Monitor chunk sizes: Aim for 200-800 tokens per chunk for most embedding models
Preserve context: Include some overlap between chunks to maintain context
Use metadata effectively: Store document source, creation date, author, etc.

Don'ts ❌

Don't over-engineer: Multi-vector isn't always better than good semantic chunking
Don't ignore document structure: Fixed chunking often loses important context
Don't forget evaluation: Measure search quality with real user scenarios
Don't chunk too small: Very small chunks lose context
Don't chunk too large: Very large chunks dilute specific information

Common Pitfalls and How to Avoid Them

Pitfall 1: "More Vectors = Better Results"

Problem: Assuming multi-vector always outperforms simpler approaches

Solution: Start with semantic chunking and only add complexity if you have specific use cases that require it

Pitfall 2: Ignoring Document Structure

Problem: Using fixed chunking on structured documents like research papers

Solution: Analyze your document types and chunk according to their natural structure

Pitfall 3: Not Testing with Real Queries

Problem: Optimizing for theoretical scenarios instead of actual user needs

Solution: Collect real user queries and evaluate your chunking strategy against them

Sample code

import os
import uuid
import logging
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import openai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
from dotenv import load_dotenv

load_dotenv()

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class DocumentChunk:
    """Represents a chunk of a document with its metadata"""
    chunk_id: str
    content: str
    chunk_type: str  # 'summary', 'paragraph', 'title', 'conclusion', etc.
    parent_doc_id: str
    chunk_index: int
    metadata: Dict[str, Any]


class OpenAIEmbeddingService:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        self.model = model
        self.embedding_dimension = 1536 if "3-small" in model else 3072
        logger.info(f"Initialized OpenAI embedding service with model: {model}")

    def create_embeddings(self, texts: List[str]) -> List[List[float]]:
        try:
            logger.info(f"Creating embeddings for {len(texts)} texts")
            response = self.client.embeddings.create(
                model=self.model,
                input=texts,
                encoding_format="float"
            )

            embeddings = [embedding.embedding for embedding in response.data]
            logger.info(f"Successfully created {len(embeddings)} embeddings")
            return embeddings

        except Exception as e:
            logger.error(f"Failed to create embeddings: {str(e)}")
            raise

    def create_single_embedding(self, text: str) -> List[float]:
        """Create embedding for a single text"""
        return self.create_embeddings([text])[0]


class QdrantCollection:
    """
    Multi-vector collection implementation using Qdrant.
    Each document is split into multiple chunks, with each chunk getting its own vector.
    """

    def __init__(self,
                 collection_name: str,
                 qdrant_client: QdrantClient,
                 embedding_service: OpenAIEmbeddingService):
        self.collection_name = collection_name
        self.qdrant_client = qdrant_client
        self.embedding_service = embedding_service

        # Create the collection if it doesn't exist
        self._ensure_collection_exists()

    def _ensure_collection_exists(self):
        """Create the Qdrant collection for multi-vector storage"""
        try:
            # Check if collection exists
            collections = self.qdrant_client.get_collections()
            existing_names = [col.name for col in collections.collections]

            if self.collection_name not in existing_names:
                logger.info(f"Creating new collection: {self.collection_name}")

                # Create collection with vector configuration
                self.qdrant_client.create_collection(
                    collection_name=self.collection_name,
                    vectors_config=VectorParams(
                        size=self.embedding_service.embedding_dimension,
                        distance=Distance.COSINE
                    )
                )
                logger.info(f"✅ Collection '{self.collection_name}' created successfully")
            else:
                logger.info(f"✅ Collection '{self.collection_name}' already exists")

        except Exception as e:
            logger.error(f"Failed to create/verify collection: {str(e)}")
            raise

    def _create_document_chunks(self, doc_id: str, content: str, metadata: Dict) -> List[DocumentChunk]:
        """
        Split document into multiple chunks of different types.
        This is where the 'multi-vector' concept comes into play.
        """
        chunks = []
        lines = content.strip().split('\n')
        paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()]

        # 1. Title chunk (first non-empty line if it looks like a title)
        if lines and len(lines[0].strip()) < 100:
            title_chunk = DocumentChunk(
                chunk_id=f"{doc_id}_title",
                content=lines[0].strip(),
                chunk_type="title",
                parent_doc_id=doc_id,
                chunk_index=0,
                metadata={**metadata, "is_title": True}
            )
            chunks.append(title_chunk)

        # 2. Summary chunk (first paragraph as summary)
        if paragraphs:
            summary_content = paragraphs[0]
            if len(summary_content) > 50:  # Only if substantial
                summary_chunk = DocumentChunk(
                    chunk_id=f"{doc_id}_summary",
                    content=f"Summary: {summary_content}",
                    chunk_type="summary",
                    parent_doc_id=doc_id,
                    chunk_index=1,
                    metadata={**metadata, "is_summary": True}
                )
                chunks.append(summary_chunk)

        # 3. Individual paragraph chunks
        for i, paragraph in enumerate(paragraphs):
            if len(paragraph) > 100:  # Only substantial paragraphs
                para_chunk = DocumentChunk(
                    chunk_id=f"{doc_id}_para_{i}",
                    content=paragraph,
                    chunk_type="paragraph",
                    parent_doc_id=doc_id,
                    chunk_index=i + 2,  # After title and summary
                    metadata={**metadata, "paragraph_number": i}
                )
                chunks.append(para_chunk)

        # 4. Conclusion chunk (last paragraph if it contains conclusion keywords)
        if len(paragraphs) > 1:
            last_para = paragraphs[-1].lower()
            conclusion_keywords = ['conclusion', 'summary', 'in conclusion', 'to summarize', 'finally']
            if any(keyword in last_para for keyword in conclusion_keywords):
                conclusion_chunk = DocumentChunk(
                    chunk_id=f"{doc_id}_conclusion",
                    content=paragraphs[-1],
                    chunk_type="conclusion",
                    parent_doc_id=doc_id,
                    chunk_index=len(chunks) + 1,
                    metadata={**metadata, "is_conclusion": True}
                )
                chunks.append(conclusion_chunk)

        logger.info(f"Created {len(chunks)} chunks for document '{doc_id}'")
        return chunks

    def add_document(self, doc_id: str, content: str, metadata: Dict = None) -> Dict[str, Any]:
        """
        Add a document to the multi-vector collection.
        This demonstrates the core workflow of multi-vector storage.
        """
        metadata = metadata or {}

        try:
            logger.info(f"Adding document '{doc_id}' to collection")

            # Step 1: Create multiple chunks from the document
            chunks = self._create_document_chunks(doc_id, content, metadata)

            if not chunks:
                raise ValueError("No valid chunks created from document")

            # Step 2: Generate embeddings for all chunks
            chunk_contents = [chunk.content for chunk in chunks]
            embeddings = self.embedding_service.create_embeddings(chunk_contents)

            # Step 3: Create Qdrant points for each chunk
            points = []
            for chunk, embedding in zip(chunks, embeddings):
                # Prepare payload with chunk metadata
                payload = {
                    "chunk_id": chunk.chunk_id,
                    "content": chunk.content,
                    "chunk_type": chunk.chunk_type,
                    "parent_doc_id": chunk.parent_doc_id,
                    "chunk_index": chunk.chunk_index,
                    **chunk.metadata  # Include all custom metadata
                }

                # Create point
                point = PointStruct(
                    id=str(uuid.uuid4()),  # Unique point ID
                    vector=embedding,
                    payload=payload
                )
                points.append(point)

            # Step 4: Upload points to Qdrant
            self.qdrant_client.upsert(
                collection_name=self.collection_name,
                points=points
            )

            result = {
                "success": True,
                "doc_id": doc_id,
                "chunks_created": len(chunks),
                "chunk_types": [chunk.chunk_type for chunk in chunks],
                "points_uploaded": len(points)
            }

            logger.info(f"✅ Successfully added document '{doc_id}' with {len(chunks)} chunks")
            return result

        except Exception as e:
            logger.error(f"Failed to add document '{doc_id}': {str(e)}")
            raise

    def search(self,
               query: str,
               limit: int = 5,
               chunk_types: Optional[List[str]] = None,
               doc_id_filter: Optional[str] = None) -> List[Dict[str, Any]]:
        """
        Search the multi-vector collection with optional filtering.
        This demonstrates the key advantage of multi-vector collections.
        """
        try:
            logger.info(f"Searching for: '{query}' with limit={limit}")

            # Step 1: Create query embedding
            query_embedding = self.embedding_service.create_single_embedding(query)

            # Step 2: Build filter conditions
            filter_conditions = []

            if chunk_types:
                # Filter by chunk types
                filter_conditions.append(
                    FieldCondition(
                        key="chunk_type",
                        match=MatchValue(value=chunk_types[0] if len(chunk_types) == 1 else {"$in": chunk_types})
                    )
                )
                logger.info(f"Filtering by chunk types: {chunk_types}")

            if doc_id_filter:
                # Filter by specific document
                filter_conditions.append(
                    FieldCondition(
                        key="parent_doc_id",
                        match=MatchValue(value=doc_id_filter)
                    )
                )
                logger.info(f"Filtering by document ID: {doc_id_filter}")

            # Combine filters
            search_filter = Filter(must=filter_conditions) if filter_conditions else None

            # Step 3: Perform vector search
            search_results = self.qdrant_client.search(
                collection_name=self.collection_name,
                query_vector=query_embedding,
                query_filter=search_filter,
                limit=limit,
                with_payload=True,
                with_vectors=False  # Don't return vectors to save bandwidth
            )

            # Step 4: Format results
            formatted_results = []
            for result in search_results:
                formatted_result = {
                    "score": result.score,
                    "chunk_id": result.payload.get("chunk_id"),
                    "content": result.payload.get("content"),
                    "chunk_type": result.payload.get("chunk_type"),
                    "parent_doc_id": result.payload.get("parent_doc_id"),
                    "chunk_index": result.payload.get("chunk_index"),
                    "metadata": {k: v for k, v in result.payload.items()
                                 if k not in ["chunk_id", "content", "chunk_type", "parent_doc_id", "chunk_index"]}
                }
                formatted_results.append(formatted_result)

            logger.info(f"Found {len(formatted_results)} results")
            return formatted_results

        except Exception as e:
            logger.error(f"Search failed: {str(e)}")
            raise

    def get_document_chunks(self, doc_id: str) -> List[Dict[str, Any]]:
        """Retrieve all chunks for a specific document"""
        try:
            logger.info(f"Retrieving chunks for document: {doc_id}")

            # Search with document filter
            filter_condition = Filter(
                must=[
                    FieldCondition(
                        key="parent_doc_id",
                        match=MatchValue(value=doc_id)
                    )
                ]
            )

            results = self.qdrant_client.scroll(
                collection_name=self.collection_name,
                scroll_filter=filter_condition,
                limit=100,  # Adjust based on expected chunks per document
                with_payload=True,
                with_vectors=False
            )

            chunks = []
            for point in results[0]:  # results is a tuple (points, next_page_offset)
                chunk_info = {
                    "chunk_id": point.payload.get("chunk_id"),
                    "content": point.payload.get("content"),
                    "chunk_type": point.payload.get("chunk_type"),
                    "chunk_index": point.payload.get("chunk_index"),
                    "metadata": {k: v for k, v in point.payload.items()
                                 if k not in ["chunk_id", "content", "chunk_type", "parent_doc_id", "chunk_index"]}
                }
                chunks.append(chunk_info)

            # Sort by chunk index
            chunks.sort(key=lambda x: x.get("chunk_index", 0))

            logger.info(f"Retrieved {len(chunks)} chunks for document '{doc_id}'")
            return chunks

        except Exception as e:
            logger.error(f"Failed to retrieve chunks for document '{doc_id}': {str(e)}")
            raise

    def get_collection_stats(self) -> Dict[str, Any]:
        """Get statistics about the collection"""
        try:
            collection_info = self.qdrant_client.get_collection(self.collection_name)

            # Get chunk type distribution
            chunk_types_result = self.qdrant_client.scroll(
                collection_name=self.collection_name,
                limit=1000,  # Adjust based on your collection size
                with_payload=True,
                with_vectors=False
            )

            chunk_type_counts = {}
            document_counts = {}

            for point in chunk_types_result[0]:
                chunk_type = point.payload.get("chunk_type", "unknown")
                doc_id = point.payload.get("parent_doc_id", "unknown")

                chunk_type_counts[chunk_type] = chunk_type_counts.get(chunk_type, 0) + 1
                document_counts[doc_id] = document_counts.get(doc_id, 0) + 1

            stats = {
                "collection_name": self.collection_name,
                "total_points": collection_info.points_count,
                "vector_size": collection_info.config.params.vectors.size,
                "distance_metric": collection_info.config.params.vectors.distance.value,
                "total_documents": len(document_counts),
                "chunk_type_distribution": chunk_type_counts,
                "avg_chunks_per_document": round(collection_info.points_count / len(document_counts),
                                                 2) if document_counts else 0
            }

            return stats

        except Exception as e:
            logger.error(f"Failed to get collection stats: {str(e)}")
            raise


def main():
    print("=== Semantic search ===\n")

    try:
        print("1. Initializing services...")
        embedding_service = OpenAIEmbeddingService()

        from qdrant_client import QdrantClient
        qdrant_client = QdrantClient(url="http://localhost:6333")

        collection = QdrantCollection(
            collection_name="semantic_search_demo",
            qdrant_client=qdrant_client,
            embedding_service=embedding_service
        )
        print("✅ Services initialized\n")

        print("2. Adding documents to collection...")
        sample_docs = {
            "ai_overview": """Artificial Intelligence: An Overview
            Artificial Intelligence (AI) represents one of the most transformative technologies of our time, fundamentally changing how we interact with machines and process information.
            AI encompasses machine learning, natural language processing, computer vision, and robotics. These technologies enable computers to perform tasks that typically require human intelligence.
            The applications are vast: from autonomous vehicles and medical diagnosis to financial trading and content recommendation systems.
            As AI continues to evolve, it presents both tremendous opportunities and significant challenges that society must carefully navigate.""",

            "machine_learning": """Machine Learning Fundamentals
            Machine learning is a subset of artificial intelligence that enables systems to automatically learn and improve from experience without being explicitly programmed.
            There are three primary types of machine learning: supervised learning uses labeled training data, unsupervised learning finds patterns in unlabeled data, and reinforcement learning learns through interaction with an environment.
            Common algorithms include linear regression, decision trees, neural networks, and support vector machines. Each has strengths for different types of problems.
            In conclusion, machine learning forms the backbone of modern AI applications and continues to drive innovation across industries."""
        }

        for doc_id, content in sample_docs.items():
            result = collection.add_document(
                doc_id=doc_id,
                content=content,
                metadata={"topic": "artificial_intelligence", "language": "english"}
            )
            print(f"   Added '{doc_id}': {result['chunks_created']} chunks ({', '.join(result['chunk_types'])})")
        print()

        print("3. Demonstrating search capabilities...\n")

        print("🔍 General search for 'machine learning applications':")
        results = collection.search("machine learning applications", limit=3)
        for i, result in enumerate(results, 1):
            print(f"   {i}. [{result['chunk_type']}] Score: {result['score']:.3f}")
            print(f"      From: {result['parent_doc_id']}")
            print(f"      Content: {result['content'][:80]}...")
        print()

        print("🔍 Search only in summaries for 'artificial intelligence':")
        results = collection.search("artificial intelligence", limit=2, chunk_types=['summary'])
        for i, result in enumerate(results, 1):
            print(f"   {i}. [{result['chunk_type']}] Score: {result['score']:.3f}")
            print(f"      Content: {result['content'][:100]}...")
        print()

        print("🔍 Search only in titles for 'machine learning':")
        results = collection.search("machine learning", limit=2, chunk_types=['title'])
        for i, result in enumerate(results, 1):
            print(f"   {i}. [{result['chunk_type']}] Score: {result['score']:.3f}")
            print(f"      Content: {result['content']}")
        print()

        print("🔍 Search within specific document for 'algorithms':")
        results = collection.search("algorithms", limit=3, doc_id_filter="machine_learning")
        for i, result in enumerate(results, 1):
            print(f"   {i}. [{result['chunk_type']}] Score: {result['score']:.3f}")
            print(f"      Content: {result['content'][:80]}...")
        print()

        print("4. Document structure analysis...\n")
        for doc_id in sample_docs.keys():
            print(f"📄 Document: {doc_id}")
            chunks = collection.get_document_chunks(doc_id)
            for chunk in chunks:
                print(f"   {chunk['chunk_index']}. [{chunk['chunk_type']}] {chunk['content'][:60]}...")
            print()

        print("5. Collection statistics...")
        stats = collection.get_collection_stats()
        print(f"   Collection: {stats['collection_name']}")
        print(f"   Total points: {stats['total_points']}")
        print(f"   Total documents: {stats['total_documents']}")
        print(f"   Avg chunks per document: {stats['avg_chunks_per_document']}")
        print(f"   Chunk type distribution: {stats['chunk_type_distribution']}")
        print()

        print("✅ Semantic search demonstration completed successfully!")

    except Exception as e:
        logger.error(f"Demo failed: {str(e)}")
        print(f"❌ Demo failed: {str(e)}")


if __name__ == "__main__":
    main()

Kubernetes/Helm-charts commonly used commands

Ummer Farooq — Tue, 05 Aug 2025 10:30:30 GMT

Uninstall Helm Release

helm uninstall  --namespace 

#example
helm uninstall qdrant --namespace qdrant

Delete PVCs (Persistent Volume Claims)

kubectl delete pvc --all -n 

#example
kubectl delete pvc --all -n qdrant

Delete Namespace (Optional – full cleanup)

kubectl delete namespace 

#example
kubectl delete namespace qdrant

Check Pod Status

kubectl get pods -n qdrant

Check Events

kubectl get events -n qdrant --sort-by=.metadata.creationTimestamp

Checking NodePorts

kubectl get svc qdrant -n qdrant-cluster

Check Pod Endpoints

kubectl get endpoints -n qdrant-cluster qdrant

WSL Networking and Port Forwarding

Ummer Farooq — Mon, 04 Aug 2025 08:18:46 GMT

Considering vLLM is running inside WSL, and we are trying to expose it to the Windows network.

Step 1: Confirm WSL Server Is Listening on 0.0.0.0

In your docker-compose or command, make sure vLLM is not bound to 127.0.0.1. It should be bound to all interfaces (0.0.0.0):

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000

Also, confirm with:

netstat -tuln | grep 8000

Should show:

tcp 0 0 0.0.0.0:8000 0.0.0.0:* LISTEN

Step 2: Identify the WSL IP

Inside WSL, run:

ip addr | grep inet

You’ll see something like inet 172.22.x.x — that's the internal IP of your WSL instance. But this is not accessible from the outside.

Step 3: Set up Port Forwarding (From Windows Host to WSL2)

Since WSL2 is on a different virtual network, you need to forward ports from the Windows host to WSL.

Use PowerShell as Administrator on Windows and run:

netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=8000 connectaddress=WSL-IP connectport=8000

Replace WSL-IP with the IP from Step 2. You can also use localhost if you're sure the service is bound to 0.0.0.0 in WSL.

Example:

netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=8000 connectaddress=172.22.64.1 connectport=8000

Then enable the firewall rule:

netsh advfirewall firewall add rule name="vLLM Port 8000" dir=in action=allow protocol=TCP localport=8000

Step 4: Use Your Windows Machine’s Private IP

From another machine on your LAN, access your vLLM server using:

http://:8000

You can find this by running on Windows CMD:

ipconfig

Look for the IPv4 Address under your active network adapter (e.g., 192.168.x.x).

Step 5: Verify from Another Device

Try this in your browser or with curl:

curl http://:8000/v1/models

You should get a response from the vLLM server.

(Optional) Clean Up Forwarding Rules

To remove the proxy:

netsh interface portproxy delete v4tov4 listenport=8000 listenaddress=0.0.0.0

⚠️ Notes

If you're using Docker inside WSL2, ensure Docker is binding to 0.0.0.0, or use Docker’s ports section in docker-compose.
Windows Defender Firewall can block incoming traffic. Ensure the port is allowed
WSL2 still doesn’t support host bridging, so this proxying is a stable workaround.

QDrant Multi-node Cluster Deployment on AWS EC2 with Helm Charts

Ummer Farooq — Mon, 04 Aug 2025 07:06:42 GMT

Prerequisites

AWS Account with appropriate permissions
Basic knowledge of Kubernetes and Helm
SSH key pair for EC2 access

Phase 1: AWS Infrastructure Setup

Step 1: Create VPC and Networking

Create VPC
- Go to AWS Console → VPC → Create VPC
- Name: qdrant-vpc
- IPv4 CIDR: 10.0.0.0/16
- Enable DNS hostnames and DNS resolution
Create Subnets
- Create 3 private subnets in different AZs:
  - qdrant-subnet-1a: 10.0.1.0/24 (ap-south-1a)
  - qdrant-subnet-1b: 10.0.2.0/24 (ap-south-1b)
  - qdrant-subnet-1c: 10.0.3.0/24 (ap-south-1c)
- Create 1 public subnet for NAT Gateway:
  - qdrant-public-subnet: 10.0.100.0/24 (ap-south-1a)
Create Internet Gateway
- Name: qdrant-igw
- Attach to qdrant-vpc
Create NAT Gateway
- Place in qdrant-public-subnet
- Allocate Elastic IP
Configure Route Tables
- Public Route Table:
  - Route: 0.0.0.0/0 → Internet Gateway
- Private Route Table:
  - Route: 0.0.0.0/0 → NAT Gateway
  - Associate with all private subnets

Step 2: Security Groups

Create Security Group: qdrant-cluster-sg
- VPC: qdrant-vpc
- Inbound Rules:
  - SSH: Port 22 (Source: Your IP)
  - Kubernetes API: Port 6443 (Source: Security Group itself)
  - QDrant HTTP: Port 6333 (Source: Security Group itself)
  - QDrant gRPC: Port 6334 (Source: Security Group itself)
  - Etcd: Ports 2379-2380 (Source: Security Group itself)
  - Kubelet: Port 10250 (Source: Security Group itself)
  - NodePort Range: Ports 30000-32767 (Source: Security Group itself)
  - All Traffic: All ports (Source: Security Group itself)
- Outbound Rules: All traffic to 0.0.0.0/0

Step 3: IAM Roles and Policies

Create IAM Role: qdrant-node-role
- Trusted entity: EC2
- Attach policies:
  - AmazonEC2FullAccess
  - AmazonEBSCSIDriverPolicy
- Create custom policy QDrantEBSPolicy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachVolume",
                "ec2:DetachVolume",
                "ec2:DescribeVolumes",
                "ec2:DescribeInstances",
                "ec2:CreateVolume",
                "ec2:DeleteVolume",
                "ec2:CreateSnapshot",
                "ec2:DeleteSnapshot",
                "ec2:DescribeSnapshots",
                "ec2:CreateTags"
            ],
            "Resource": "*"
        }
    ]
}

Create Instance Profile
- Name: qdrant-instance-profile
- Add role: qdrant-node-role

Phase 2: EC2 Instances Setup

Step 4: Launch EC2 Instances

Launch 3 EC2 instances with the following specifications:

Instance Configuration:

AMI: Ubuntu 22.04 LTS
Instance Type: t3.medium (minimum) or t3.large (recommended)
Key Pair: Your SSH key
VPC: qdrant-vpc
Subnets: Place each instance in different subnets
Security Group: qdrant-cluster-sg
IAM Role: qdrant-instance-profile
Storage: 20GB gp3 root volume + 50GB gp3 data volume for each instance

Instance Names:

qdrant-master-1 (in qdrant-subnet-1a)
qdrant-worker-1 (in qdrant-subnet-1b)
qdrant-worker-2 (in qdrant-subnet-1c)

Step 5: Create Additional EBS Volumes

For each instance, create additional EBS volumes for persistent storage:

Go to EC2 → Volumes → Create Volume
Create 3 volumes (one per instance):
- Volume Type: gp3
- Size: 50GB each
- Availability Zone: Match instance AZ
- Tags: Name = qdrant-data-volume-{1,2,3}
Attach each volume to corresponding instance

Phase 3: Kubernetes Cluster Setup

Step 6: Install Prerequisites on All Nodes

SSH into each instance and run:

#!/bin/bash
# Update system
sudo apt update && sudo apt upgrade -y

# Install Docker
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io

# Configure Docker
sudo usermod -aG docker $USER
sudo systemctl enable docker
sudo systemctl start docker

# Install kubeadm, kubelet, kubectl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

# Configure containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd

# Disable swap
sudo swapoff -a
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

# Load kernel modules
sudo modprobe br_netfilter
echo 'br_netfilter' | sudo tee /etc/modules-load.d/k8s.conf

# Configure sysctl
sudo tee /etc/sysctl.d/k8s.conf <


Step 7: Initialize Master Node
On the master node (qdrant-master-1):
# Initialize cluster
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=

# Configure kubectl
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install Calico CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

# Generate join command (save this output)
kubeadm token create --print-join-command

Step 8: Join Worker Nodes
On both worker nodes, run the join command from previous step:
sudo kubeadm join :6443 --token  --discovery-token-ca-cert-hash 

Step 9: Verify Cluster
On master node:
kubectl get nodes
kubectl get pods -A

Phase 4: Storage Setup
Step 10: Install EBS CSI Driver
# Install EBS CSI Driver
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.23"

# Verify installation
kubectl get pods -n kube-system | grep ebs-csi

Step 11: Create Storage Class
Create ebs-storageclass.yaml:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain

Apply the storage class:
kubectl apply -f ebs-storageclass.yaml

Phase 5: Helm and QDrant Deployment
Step 12: Install Helm
On master node:
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update
sudo apt install helm

Step 13: Add QDrant Helm Repository
helm repo add qdrant https://qdrant.github.io/qdrant-helm
helm repo update

Step 14: Create QDrant Values File
Create qdrant-values.yaml:
# QDrant Cluster Configuration
replicaCount: 3

image:
  repository: qdrant/qdrant
  tag: "v1.7.4"
  pullPolicy: IfNotPresent

# Service configuration
service:
  type: NodePort
  httpPort: 6333
  grpcPort: 6334
  httpNodePort: 30333
  grpcNodePort: 30334

# Persistent storage
persistence:
  enabled: true
  storageClass: "ebs-gp3"
  size: 50Gi
  accessMode: ReadWriteOnce

# Resource limits
resources:
  limits:
    cpu: 1000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

# Pod disruption budget
podDisruptionBudget:
  enabled: true
  minAvailable: 2

# Anti-affinity to spread pods across nodes
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - qdrant
        topologyKey: kubernetes.io/hostname

# QDrant specific configuration
config:
  cluster:
    enabled: true
    p2p:
      port: 6335
  service:
    api_key: "your_secret_master_api_key_here"
    read_only_api_key: "your_secret_read_only_api_key_here"
    http_port: 6333
    grpc_port: 6334
  storage:
    storage_path: "/qdrant/storage"
    snapshots_path: "/qdrant/snapshots"
    on_disk_payload: true
  log_level: "INFO"

# Environment variables for clustering
env:
  - name: QDRANT__CLUSTER__ENABLED
    value: "true"
  - name: QDRANT__CLUSTER__P2P__PORT
    value: "6335"

# Security context
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

# Node selector to ensure pods are scheduled on our nodes
nodeSelector: {}

# Tolerations
tolerations: []

Step 15: Deploy QDrant Cluster
# Create namespace
kubectl create namespace qdrant

# Deploy QDrant
helm install qdrant qdrant/qdrant \
  --namespace qdrant \
  --values qdrant-values.yaml \
  --wait

# Verify deployment
kubectl get pods -n qdrant
kubectl get pvc -n qdrant
kubectl get svc -n qdrant

Step 16: Create Load Balancer Service (Optional)
For external access, create qdrant-lb.yaml:
apiVersion: v1
kind: Service
metadata:
  name: qdrant-loadbalancer
  namespace: qdrant
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: qdrant
  ports:
    - name: http
      port: 6333
      targetPort: 6333
    - name: grpc
      port: 6334
      targetPort: 6334

Apply the load balancer:
kubectl apply -f qdrant-lb.yaml

Phase 6: Verification and Testing
Step 17: Verify Cluster Status
# Check pods
kubectl get pods -n qdrant -o wide

# Check persistent volumes
kubectl get pv
kubectl get pvc -n qdrant

# Check services
kubectl get svc -n qdrant

# Check logs
kubectl logs -n qdrant -l app.kubernetes.io/name=qdrant

# Port forward for testing (run in background)
kubectl port-forward -n qdrant svc/qdrant 6333:6333 &

Step 18: Test QDrant API
# Test cluster info
curl -X GET "http://localhost:6333/cluster"

# Test collections
curl -X GET "http://localhost:6333/collections"

# Create a test collection
curl -X PUT "http://localhost:6333/collections/test_collection" \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": {
      "size": 100,
      "distance": "Cosine"
    }
  }'

Phase 7: Monitoring and Maintenance
Step 19: Set Up Basic Monitoring
Create monitoring-values.yaml for Prometheus (optional):
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: qdrant

Step 20: Backup Strategy
Create backup script backup-qdrant.sh:
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backup/qdrant_$TIMESTAMP"

# Create snapshots via API
for pod in $(kubectl get pods -n qdrant -l app.kubernetes.io/name=qdrant -o jsonpath='{.items[*].metadata.name}'); do
  kubectl exec -n qdrant $pod -- curl -X POST "http://localhost:6333/snapshots"
done

# Copy snapshots from persistent volumes
kubectl exec -n qdrant qdrant-0 -- tar -czf /tmp/qdrant-backup-$TIMESTAMP.tar.gz /qdrant/snapshots
kubectl cp qdrant/qdrant-0:/tmp/qdrant-backup-$TIMESTAMP.tar.gz ./qdrant-backup-$TIMESTAMP.tar.gz

Troubleshooting
Common Issues and Solutions

Pods stuck in Pending state:

Check node resources: kubectl describe nodes

Check PVC status: kubectl get pvc -n qdrant

Verify EBS CSI driver: kubectl get pods -n kube-system | grep ebs-csi



Storage issues:

Verify IAM permissions for EBS operations

Check storage class: kubectl get storageclass

Review EBS volume attachments in AWS Console



Network connectivity issues:

Verify security group rules

Check Calico pod status: kubectl get pods -n kube-system | grep calico

Test pod-to-pod connectivity



QDrant cluster formation issues:

Check cluster configuration in pod logs

Verify p2p port accessibility between pods

Review QDrant cluster API endpoint




Maintenance Commands
# Scale cluster
helm upgrade qdrant qdrant/qdrant --namespace qdrant --set replicaCount=5 --values qdrant-values.yaml

# Update QDrant version
helm upgrade qdrant qdrant/qdrant --namespace qdrant --set image.tag=v1.8.0 --values qdrant-values.yaml

# Backup and restore procedures
kubectl exec -n qdrant qdrant-0 -- /qdrant/backup.sh

Security Considerations

Network Security:

Use private subnets for all worker nodes

Restrict security group access to minimum required ports

Consider using AWS PrivateLink for internal communication



Storage Security:

Enable EBS encryption

Use IAM roles with least privilege

Regular backup testing and restoration procedures



Access Control:

Implement RBAC in Kubernetes

Use network policies to restrict pod communication

Enable audit logging




This deployment provides a production-ready QDrant cluster with high availability, persistent storage, and proper AWS integration.



Step-by-Step Guide to Linking an SSH Key with Your GitHub Repositories
Ummer Farooq — Thu, 19 Dec 2024 08:42:13 GMT
1. Generate a New SSH Key
Run the following command to generate a new SSH key specifically for GitHub:
ssh-keygen -t ed25519 -C "your_email@example.com" -f ~/.ssh/id_github

Explanation:

-t ed25519: Specifies the key type.

-C "your_email@example.com": Adds a comment to the key (typically your email).

-f ~/.ssh/id_github: Specifies the file name for the new key.


During the process, you’ll be prompted to:

Confirm the file location (press Enter to accept the default).

Enter a passphrase (optional, but recommended for added security).


2. Verify the Key Files
Once the SSH key is created, confirm the existence of the key files using:
ls -l ~/.ssh/id_github*

You should see:

~/.ssh/id_github: The private key (keep this secure and never share it).

~/.ssh/id_github.pub: The public key (this will be added to GitHub).


3. Add the Key to Your SSH Agent
Add the private key to your SSH agent to manage it easily:

Start the SSH agent:
 eval "$(ssh-agent -s)"


Add the key to the agent:
 ssh-add ~/.ssh/id_github



4. Copy the Public Key to Your Clipboard
To add the key to GitHub, first copy the public key:
cat ~/.ssh/id_github.pub

5. Add the SSH Key to Your GitHub Account

Log in to your GitHub account.

Navigate to Settings > SSH and GPG keys.

Click New SSH key.

Paste your public key into the provided field.

Add a title to identify this key (e.g., "Server").

Click Add SSH key.


6. Test the SSH Connection
Verify the connection to GitHub using:
ssh -i ~/.ssh/id_github -T git@github.com

If successful, you should see a message like:
Hi ! You've successfully authenticated, but GitHub does not provide shell access.

7. Using the SSH Key for Repository Management
Clone a Repository
Use the SSH URL to clone a repository:
When pushing changes, Git will automatically use the SSH key for authentication:
git clone git@github.com:/.git

Push Changes
git push origin main

Conclusion
By setting up an SSH key with GitHub, you’ve simplified your authentication process and secured your repository management workflow. No more password prompts—just smooth, secure Git operations!