2026-01-22
RAG Data Preparation: The Foundation That Makes or Breaks Your AI System
Comprehensive guide to preparing data for RAG systems covering document parsing, chunking strategies, contextual enrichment, and embedding optimization
Most RAG implementation failures trace back to data preparation, not retrieval architecture. Teams spend weeks tuning retrieval parameters when the real problem is poorly parsed documents or inappropriate chunking. This guide covers the critical foundation that determines the quality ceiling of your RAG system.
Why Data Preparation Is the Most Critical RAG Step
There is a common pattern in RAG implementations: sophisticated retrieval architectures (hybrid search, reranking, CRAG) that still produce poor results. The root cause is almost always upstream in the data preparation layer.
The key insight: if data preparation fails at 60% quality, no amount of architectural sophistication can push retrieval quality above that ceiling. Teams report 40-60% quality improvements from fixing data preparation alone, often without touching retrieval logic.
Document Parsing: Extracting Clean Text from Messy Sources
Real-world documents are messy. PDFs store text as positioned glyphs, not logical sequences. Tables get mangled. Multi-column layouts require layout analysis. Scanned documents need OCR with 5-15% error rates.
Parsing Tool Selection
| Tool | Table Accuracy | Text Fidelity | Speed | Best For |
|---|---|---|---|---|
| Docling | 97.9% | Excellent | ~10s/page | Complex documents |
| LlamaParse | 75-90% | Good | ~6s/doc | Speed-critical |
| Unstructured | 75-100%* | Good | Variable | OCR-heavy |
| PyMuPDF/PyPDF | 60-70% | Fair | Fast | Simple PDFs |
*Unstructured achieves 100% on simple tables, 75% on complex structures
Practical PDF Parsing
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
converter = DocumentConverter()
# Parse PDF with layout analysis
result = converter.convert("technical-manual.pdf")
# Access structured content
for element in result.document.body:
if element.type == "table":
# Table extracted with structure preserved
markdown_table = element.export_to_markdown()
elif element.type == "text":
# Text with section context
text = element.text
section = element.section_header
# Export as markdown for RAG ingestion
markdown_output = result.document.export_to_markdown()
HTML Content Extraction
from bs4 import BeautifulSoup
from readability import Document
def extract_html_content(html: str) -> dict:
"""Extract meaningful content from HTML, handling diverse structures."""
# Use readability for main content extraction
doc = Document(html)
main_content = doc.summary()
title = doc.title()
# Parse with BeautifulSoup for structure
soup = BeautifulSoup(main_content, 'html.parser')
# Remove navigation, ads, footers
for element in soup.find_all(['nav', 'footer', 'aside', 'script', 'style']):
element.decompose()
# Extract text with structure preservation
text_blocks = []
for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'td']):
text = element.get_text(strip=True)
if text:
text_blocks.append({
'type': element.name,
'text': text,
'level': int(element.name[1]) if element.name.startswith('h') else 0
})
return {
'title': title,
'blocks': text_blocks,
'full_text': soup.get_text(separator='\n', strip=True)
}
Tip: Start with rule-based parsing before resorting to LLM-based parsing. Use hybrid approaches: heuristics for structure combined with Vision-Language Models only for the most challenging elements.
Text Preprocessing: Cleaning for Embedding Quality
Embeddings encode noise along with signal. Inconsistent formatting creates spurious similarity. PII in embeddings creates security risks. Preprocessing removes these issues before they propagate through the pipeline.
import re
from typing import List
import unicodedata
class TextPreprocessor:
def __init__(self):
self.pii_patterns = {
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'phone': r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
}
def normalize_whitespace(self, text: str) -> str:
"""Normalize all whitespace to single spaces."""
text = re.sub(r'\s+', ' ', text)
return text.strip()
def normalize_unicode(self, text: str) -> str:
"""Normalize unicode characters to consistent form."""
return unicodedata.normalize('NFKC', text)
def redact_pii(self, text: str) -> str:
"""Detect and redact PII patterns."""
for pii_type, pattern in self.pii_patterns.items():
text = re.sub(pattern, f'[REDACTED_{pii_type.upper()}]', text)
return text
def remove_boilerplate(self, text: str, patterns: List[str] = None) -> str:
"""Remove known boilerplate text patterns."""
default_patterns = [
r'Page \d+ of \d+',
r'Copyright \d{4}.*?(?=\n|$)',
r'All rights reserved\.?',
]
patterns = patterns or default_patterns
for pattern in patterns:
text = re.sub(pattern, '', text, flags=re.IGNORECASE)
return text
def process(self, text: str, redact_pii: bool = True) -> str:
"""Run full preprocessing pipeline."""
text = self.normalize_unicode(text)
text = self.remove_boilerplate(text)
text = self.normalize_whitespace(text)
if redact_pii:
text = self.redact_pii(text)
return text
Deduplication
Near-duplicate content wastes storage and skews retrieval results. MinHash LSH provides efficient near-duplicate detection:
from datasketch import MinHash, MinHashLSH
import hashlib
from typing import List, Set
class Deduplicator:
def __init__(self, threshold: float = 0.8, num_perm: int = 128):
self.threshold = threshold
self.num_perm = num_perm
self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
self.exact_hashes: Set[str] = set()
self.doc_id = 0
def _compute_minhash(self, text: str) -> MinHash:
"""Compute MinHash signature for text."""
minhash = MinHash(num_perm=self.num_perm)
words = text.lower().split()
for i in range(len(words) - 2):
shingle = ' '.join(words[i:i+3])
minhash.update(shingle.encode('utf-8'))
return minhash
def is_duplicate(self, text: str) -> bool:
"""Check for exact or near duplicates."""
# Exact duplicate check
text_hash = hashlib.md5(text.encode()).hexdigest()
if text_hash in self.exact_hashes:
return True
self.exact_hashes.add(text_hash)
# Near-duplicate check
minhash = self._compute_minhash(text)
if self.lsh.query(minhash):
return True
self.lsh.insert(f"doc_{self.doc_id}", minhash)
self.doc_id += 1
return False
def deduplicate(self, documents: List[str]) -> List[str]:
"""Remove duplicates from document list."""
return [doc for doc in documents if not self.is_duplicate(doc)]
Chunking Strategies: The Art of Splitting Documents
Chunking determines how information is organized for retrieval. The core tension: too small loses context, too large dilutes relevance signal.
Strategy Comparison
Recursive Character Splitting (Recommended Default)
from langchain_text_splitters import RecursiveCharacterTextSplitter
def recursive_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> list:
"""
Split text recursively using hierarchy of separators.
Tries to keep paragraphs together, then sentences, then words.
"""
splitter = RecursiveCharacterTextSplitter(
separators=[
"\n\n", # Paragraphs first
"\n", # Then line breaks
". ", # Then sentences
", ", # Then clauses
" ", # Finally words
],
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=len,
)
return splitter.split_text(text)
Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
def semantic_chunking(text: str) -> list:
"""
Split based on semantic similarity between sentences.
Groups semantically related content together.
"""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # Split at top 5% dissimilarity
min_chunk_size=100
)
return splitter.split_text(text)
# Performance: Up to 70% accuracy improvement in retrieval (varies by content type)
# Trade-off: Requires embedding calls during chunking
Hierarchical Parent-Child Chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
def setup_hierarchical_chunking(documents: list, embeddings):
"""
Create parent-child hierarchy for precision + context.
Search on small child chunks, return large parent chunks.
"""
# Parent splitter: large chunks for context
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200
)
# Child splitter: small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50
)
# Storage for parent documents
docstore = InMemoryStore()
# Vector store indexes child chunks
vectorstore = Chroma(
collection_name="child_chunks",
embedding_function=embeddings
)
# Retriever searches children, returns parents
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
retriever.add_documents(documents)
return retriever
# Performance: Improved relevance on structured documents by preserving context
# Trade-off: 2-3x storage overhead
Chunk Size Guidelines
| Content Type | Recommended Size | Overlap | Rationale |
|---|---|---|---|
| Technical docs | 512 tokens | 50-100 | Balance detail with context |
| Conversational | 256 tokens | 25-50 | Shorter exchanges |
| Legal/contracts | 1024 tokens | 100-150 | Preserve clause context |
| Code | 1000 chars | 100 | Keep functions intact |
| Q&A pairs | 128-256 tokens | 0 | Each Q&A is self-contained |
Contextual Chunking: Solving the Lost Context Problem
Traditional chunking destroys document context. A chunk saying “This approach reduces latency by 40%” is useless without knowing which approach. Contextual chunking addresses this.
Anthropic’s Contextual Retrieval Technique
from anthropic import Anthropic
from typing import List
def add_contextual_headers(
document: str,
chunks: List[str],
model: str = "claude-3-5-haiku-latest"
) -> List[str]:
"""
Prepend chunk-specific context using Claude.
Reduces retrieval failures by 35% with contextual embeddings alone,
49% when combined with BM25 hybrid search, and 67% with reranking added.
"""
client = Anthropic()
contextualized_chunks = []
context_prompt = """Here is the full document:
<document>
{document}
</document>
Here is a chunk from that document:
<chunk>
{chunk}
</chunk>
Provide a short context (2-3 sentences) to situate this chunk within the document. Focus on:
1. What section/topic this chunk belongs to
2. Key entities or concepts being discussed
3. How it relates to the document's main subject
Context:"""
for chunk in chunks:
response = client.messages.create(
model=model,
max_tokens=200,
messages=[{
"role": "user",
"content": context_prompt.format(document=document, chunk=chunk)
}]
)
context = response.content[0].text
contextualized_chunks.append(f"{context}\n\n{chunk}")
return contextualized_chunks
# Cost with prompt caching: ~$1.02 per million document tokens
Rule-Based Context (Zero-Cost Alternative)
def add_structural_context(chunks: List[dict]) -> List[dict]:
"""
Add context based on document structure without LLM calls.
Uses metadata from structure-aware chunking.
"""
contextualized = []
for chunk in chunks:
metadata = chunk.get('metadata', {})
content = chunk['content']
context_parts = []
if 'document_title' in metadata:
context_parts.append(f"From: {metadata['document_title']}")
if 'header_1' in metadata:
context_parts.append(f"Section: {metadata['header_1']}")
if 'header_2' in metadata:
context_parts.append(f"Subsection: {metadata['header_2']}")
context = " | ".join(context_parts)
contextualized.append({
'content': f"{context}\n\n{content}" if context else content,
'metadata': metadata
})
return contextualized
Embedding Model Selection
Choosing the right embedding model depends on content type, chunk size, and deployment constraints. Note that MTEB scores change frequently as models are updated and new benchmarks are added, so always verify current scores before making decisions.
| Model | MTEB Score | Dimensions | Cost/1M tokens | Best For |
|---|---|---|---|---|
| Cohere embed-v4 | 65.2 | 1024 | $0.10 | Multilingual, production |
| text-embedding-3-large | 64.6 | 3072 | $0.13 | General purpose |
| text-embedding-3-small | 62.3 | 1536 | $0.02 | Cost-sensitive |
| Voyage voyage-3-large | 63.8 | 1536 | $0.12 | RAG-optimized |
| BGE-M3 | 63.0 | 1024 | Self-hosted | Privacy-critical |
Embedding Optimization
from typing import List
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embeddings(
texts: List[str],
dimensions: int = 1024
) -> List[List[float]]:
"""
Get OpenAI embeddings with dimension reduction.
256-dim text-embedding-3-large outperforms full ada-002.
"""
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=dimensions # Matryoshka truncation
)
return [item.embedding for item in response.data]
def batch_embed_with_normalization(
texts: List[str],
batch_size: int = 100,
dimensions: int = 1024
) -> np.ndarray:
"""
Embed texts in batches with L2 normalization.
Normalization enables cosine similarity via dot product.
"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = get_embeddings(batch, dimensions)
all_embeddings.extend(embeddings)
embeddings_array = np.array(all_embeddings)
# L2 normalize for cosine similarity
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
return embeddings_array / norms
Metadata Extraction and Enrichment
Metadata enables filtering before semantic search, provides ranking signals, and supports source attribution.
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime
@dataclass
class ChunkMetadata:
# Content-based
keywords: List[str]
entities: List[str]
content_type: str
# Structural
document_title: str
section_header: Optional[str]
chunk_index: int
# Contextual
source_url: Optional[str]
ingestion_date: datetime
language: str
# Technical
word_count: int
has_code: bool
has_table: bool
Automated Extraction
import spacy
from collections import Counter
from typing import Dict, List
class MetadataExtractor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
def extract_entities(self, text: str) -> Dict[str, List[str]]:
"""Extract named entities using spaCy."""
doc = self.nlp(text)
entities = {}
for ent in doc.ents:
if ent.label_ not in entities:
entities[ent.label_] = []
entities[ent.label_].append(ent.text)
return entities
def extract_keywords(self, text: str, top_n: int = 10) -> List[str]:
"""Extract keywords using noun chunks."""
doc = self.nlp(text)
noun_chunks = [chunk.text.lower() for chunk in doc.noun_chunks]
chunk_counts = Counter(noun_chunks)
return [word for word, _ in chunk_counts.most_common(top_n)]
def detect_content_type(self, text: str) -> str:
"""Heuristic content type detection."""
code_patterns = ['def ', 'function ', 'class ', 'import ', '```']
if any(pattern in text for pattern in code_patterns):
return 'code'
tech_indicators = ['API', 'database', 'server', 'deployment']
if sum(1 for ind in tech_indicators if ind.lower() in text.lower()) >= 2:
return 'technical'
return 'general'
Storing with Vector Database
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance, Filter, FieldCondition, MatchValue
def store_chunks_with_metadata(
chunks: List[str],
embeddings: List[List[float]],
metadata_list: List[dict],
collection_name: str = "documents"
):
"""Store chunks with rich metadata in Qdrant."""
client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(
size=len(embeddings[0]),
distance=Distance.COSINE
)
)
points = [
PointStruct(
id=idx,
vector=embedding,
payload={"text": chunk, **metadata}
)
for idx, (chunk, embedding, metadata)
in enumerate(zip(chunks, embeddings, metadata_list))
]
client.upsert(collection_name=collection_name, points=points)
def search_with_filter(
query_embedding: List[float],
collection_name: str,
content_type: str = None,
top_k: int = 10
) -> List[dict]:
"""Search with optional metadata filter."""
client = QdrantClient(host="localhost", port=6333)
query_filter = None
if content_type:
query_filter = Filter(
must=[FieldCondition(
key="content_type",
match=MatchValue(value=content_type)
)]
)
results = client.search(
collection_name=collection_name,
query_vector=query_embedding,
query_filter=query_filter,
limit=top_k
)
return [{"text": hit.payload["text"], "score": hit.score} for hit in results]
Quality Metrics for Data Preparation
Measuring data quality enables data-driven optimization and early issue detection.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List
def evaluate_chunk_coherence(chunks: List[str], embedding_model) -> dict:
"""
Measure semantic coherence within chunks.
High coherence = chunk discusses single topic.
"""
coherence_scores = []
for chunk in chunks:
sentences = chunk.split('. ')
if len(sentences) < 2:
coherence_scores.append(1.0)
continue
embeddings = np.array(embedding_model.embed(sentences))
similarities = cosine_similarity(embeddings)
n = len(sentences)
coherence = (similarities.sum() - n) / (n * (n - 1)) if n > 1 else 1.0
coherence_scores.append(coherence)
return {
'mean_coherence': np.mean(coherence_scores),
'min_coherence': np.min(coherence_scores),
'low_coherence_count': sum(1 for s in coherence_scores if s < 0.5)
}
def evaluate_boundary_quality(chunks: List[str]) -> dict:
"""Check if chunks have clean boundaries."""
bad_starts = 0
bad_ends = 0
lowercase_starters = ['and', 'but', 'or', 'so', 'because', 'however']
for chunk in chunks:
first_word = chunk.split()[0].lower() if chunk.split() else ''
if first_word in lowercase_starters:
bad_starts += 1
if chunk and chunk.rstrip()[-1] not in '.!?:':
bad_ends += 1
return {
'bad_start_ratio': bad_starts / len(chunks),
'bad_end_ratio': bad_ends / len(chunks),
'clean_boundary_ratio': 1 - (bad_starts + bad_ends) / (2 * len(chunks))
}
def evaluate_retrieval_quality(
embeddings: np.ndarray,
test_queries: List[str],
relevant_chunk_ids: List[List[int]],
embedding_model
) -> dict:
"""Evaluate embedding quality using retrieval tests."""
query_embeddings = np.array(embedding_model.embed(test_queries))
similarities = cosine_similarity(query_embeddings, embeddings)
hits_at_k = {1: 0, 5: 0, 10: 0}
mrr_sum = 0
for i, relevant_ids in enumerate(relevant_chunk_ids):
ranked_indices = np.argsort(similarities[i])[::-1]
for rank, idx in enumerate(ranked_indices):
if idx in relevant_ids:
mrr_sum += 1 / (rank + 1)
for k in hits_at_k:
if rank < k:
hits_at_k[k] += 1
break
n_queries = len(test_queries)
return {
'mrr': mrr_sum / n_queries,
'hit_rate@1': hits_at_k[1] / n_queries,
'hit_rate@5': hits_at_k[5] / n_queries,
'hit_rate@10': hits_at_k[10] / n_queries
}
Common Pitfalls and Solutions
Pitfall 1: Skipping Parsing Validation
Assuming parsing tools work perfectly on all documents leads to missing content and mangled tables in retrieval results. Always validate parsing output on representative samples before full ingestion.
Pitfall 2: One-Size-Fits-All Chunking
Using the same chunk size for all content types results in code split mid-function and tables losing context. Match chunking strategy to content structure.
Pitfall 3: Ignoring Lost Context
Chunks that reference “it”, “this method”, “as mentioned” become meaningless in isolation. Implement contextual chunking (LLM or rule-based) to make chunks self-contained.
Pitfall 4: Choosing Models by Benchmark Alone
MTEB scores do not reflect performance on specific content. A high-benchmark model can perform poorly on domain-specific queries. Evaluate embedding models on your own test queries.
Pitfall 5: Processing in Wrong Order
Chunking before cleaning or embedding before deduplication creates noisy results. Follow the pipeline: parse -> clean -> dedupe -> chunk -> enrich -> embed.
Building Your Pipeline: A Practical Approach
The order in which you tackle data preparation matters. Here’s how to think about building your pipeline.
Start with parsing validation. Before writing any pipeline code, manually inspect parsing output for 10-20 representative documents. Look for mangled tables, missing sections, and garbled text. If your parser fails on 30% of samples, no downstream optimization will save you.
Next, establish your preprocessing baseline. Run your text through normalization, PII detection, and boilerplate removal. Compare before/after samples. The goal is clean, consistent text without losing meaningful content.
Then choose your chunking strategy based on what you learned from parsing. If your documents have clear hierarchical structure (headers, sections), leverage it with structure-aware chunking. If they’re dense technical prose, recursive splitting is your friend. If you’re dealing with mixed content types, consider routing different document types to different strategies.
Add context only after chunking works well. Contextual enrichment is powerful but adds cost and complexity. Get your basic pipeline producing reasonable results first, then measure whether contextual chunking improves your specific retrieval scenarios.
Finally, close the loop with metrics. Implement coherence and boundary quality checks. Create a small test set of queries with known relevant chunks. Run retrieval evaluations weekly as you tune parameters. Without measurement, you’re guessing.
The key insight: each step depends on the previous one working correctly. Resist the urge to implement everything at once. A simple pipeline you understand beats a complex one you can’t debug.
Key Takeaways
Data preparation sets the quality ceiling: The most sophisticated RAG architecture cannot compensate for poorly prepared data.
Parsing determines everything downstream: Invest in quality parsing tools and validate output before proceeding.
Context matters more than chunk size: Contextual chunking reduces retrieval failures by 35% alone, 49% with BM25, and 67% with reranking.
Quality metrics are non-negotiable: Measure parsing accuracy, chunk coherence, and retrieval quality throughout the pipeline.
Start simple, measure, enhance: Begin with RecursiveCharacterTextSplitter and quality parsing. Add complexity only when metrics justify it.
Sources
- Anthropic: Introducing Contextual Retrieval - Research on contextual chunking with performance benchmarks
- LangChain Text Splitters Documentation - Official documentation for chunking strategies
- Docling: Document Parsing Library - High-accuracy document parsing tool
- MTEB Leaderboard - Massive Text Embedding Benchmark for model comparison
- Qdrant Documentation - Vector database with metadata filtering
- datasketch: MinHash LSH - Near-duplicate detection library
- spaCy: Industrial NLP - Named entity recognition and text processing
- OpenAI Embeddings Guide - Embedding model documentation with Matryoshka truncation
Related posts
A comprehensive guide to advanced RAG techniques including hybrid search, reranking, GraphRAG, and self-corrective patterns with production AWS implementation examples.
A practical, implementation-focused glossary for developers navigating the AI/LLM landscape. From tokens to agents, RAG to fine-tuning, with code examples and honest assessments.
A platform-engineer read of what a Bedrock Knowledge Base actually is, which data sources and vector stores are first-class, and why the console default rarely fits a small corpus.
A comprehensive technical guide to building production-grade prompt engineering systems, covering systematic design, security, observability, and cost optimization for enterprise LLM applications.
Real lessons from deploying LangChain applications to production. Learn about the anti-patterns that cause failures and the patterns that enable success, with working code examples and cost optimization strategies.