RAG Ingestion Pipeline
Document loading · Chunking strategies · Embeddings · Vector store indexing
What is the Ingestion Pipeline?
The ingestion pipeline converts raw source documents (PDFs, DOCX, HTML, plain text) into searchable vector chunks stored in a vector database. It runs offline — either once or incrementally — and is the foundation of any RAG system. Poor ingestion causes poor retrieval regardless of how good the LLM is.
The quality of your RAG system is bounded by the quality of your ingestion pipeline. Garbage in, garbage out.
Pipeline Stages
| Stage | What happens | Key decisions |
|---|---|---|
| 1. Loading | Read raw files from disk, S3, URLs | File format handling, encoding, error quarantine |
| 2. Parsing | Extract clean text, preserve structure | PDF parser choice, table extraction, OCR fallback |
| 3. Chunking | Split into indexed units | Strategy, size, overlap |
| 4. Metadata | Attach source info to each chunk | What to store, how to filter later |
| 5. Embedding | Convert text to dense vectors | Model choice, dimensionality, normalisation |
| 6. Indexing | Store vectors + metadata in DB | Batch size, upsert strategy, deduplication |
Document Loading by Format
PyMuPDF (fitz) — best for text-heavy PDFs, preserves layout. Fallback to AWS Textract for scanned/image-only PDFs. Detect scanned pages by checking if extracted text is under 100 chars per page.
import fitz # PyMuPDF
def load_pdf(path: str) -> list[str]:
doc = fitz.open(path)
pages = []
for page in doc:
text = page.get_text()
if len(text.strip()) < 100:
text = ocr_fallback(page) # AWS Textract
pages.append(text)
return pages
DOCX / HTML / plain text
Use python-docx for DOCX (preserves heading hierarchy), BeautifulSoup for HTML (strip nav/footer boilerplate), unstructured.io as a catch-all for 20+ formats.
Chunking Strategies
| Strategy | How it works | Best for | Risk |
|---|---|---|---|
| Fixed-size | Split every N characters | Uniform docs | Breaks mid-sentence |
| Recursive character | Split on \n\n → \n → space, prefer paragraph breaks | General prose | Inconsistent chunk sizes |
| Token-based | Split on token count (tiktoken) | LLM context window control | Semantic breaks ignored |
| Semantic | Embed sentences, split where similarity drops below threshold | Complex documents | Slow, expensive |
| Markdown/header | Split at H1/H2/H3 boundaries | Structured docs, policies | Requires clean markdown |
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["
", "
", ". ", " ", ""],
)
chunks = splitter.split_text(raw_text)
LoanIQ: Semantic Chunking for Policy Documents
Policy documents have dense, clause-based structure. Splitting mid-clause loses the rule. LoanIQ uses semantic chunking with a similarity threshold of 0.85 — adjacent sentences are grouped until semantic similarity to the next sentence drops, signalling a topic boundary.
Metadata Design
Metadata is stored alongside each chunk and used for pre-filtering before vector search — much cheaper than searching all vectors.
chunk = {
"page_content": "DTI ratio must not exceed 43% for conventional loans...",
"metadata": {
"source": "underwriting_guidelines_v3.pdf",
"section": "Debt-to-Income",
"product_type": "conventional",
"effective_date": "2024-01-01",
"page": 14,
"chunk_id": "ug-v3-p14-c2"
}
}
Embedding Models
| Model | Dims | Best for |
|---|---|---|
| text-embedding-3-small | 1536 | General RAG, low cost |
| text-embedding-3-large | 3072 | High-accuracy retrieval |
| BGE-large-en | 1024 | Open-source, strong on technical docs |
| Amazon Titan Embeddings | 1536 | AWS Bedrock stack |
Always normalise embeddings (L2 norm) before storing. Use the same model for ingestion and query — a mismatch produces garbage retrieval.
Indexing Strategy
# Batch upsert with deduplication by chunk_id
def index_chunks(chunks: list[dict], vector_store):
existing_ids = vector_store.get_existing_ids()
new_chunks = [c for c in chunks if c["metadata"]["chunk_id"] not in existing_ids]
if new_chunks:
vector_store.add_documents(new_chunks, batch_size=100)
return len(new_chunks)
Common Interview Questions
Q: How do you handle duplicate documents?
Hash the document content (SHA-256) and store in a seen-hashes set before ingestion. If hash exists, skip. For near-duplicates (updated versions), use a version metadata field and upsert by source + version.
Q: How do you decide chunk size?
Rule of thumb: chunk size should match the retrieval granularity you need. For Q&A over dense policy text, 256–512 tokens with 64 overlap. For long-form document summarisation, 1024 tokens. Always test retrieval quality with RAGAS context recall at different sizes.
Q: What if a table is split across chunks?
Extract tables separately as markdown strings, keeping the entire table as one chunk regardless of size. Tag with content_type: table metadata. This prevents a cell from landing in one chunk with no column header context.