RAG Ingestion Pipeline

Document loading · Chunking strategies · Embeddings · Vector store indexing

What is the Ingestion Pipeline?

The ingestion pipeline converts raw source documents (PDFs, DOCX, HTML, plain text) into searchable vector chunks stored in a vector database. It runs offline — either once or incrementally — and is the foundation of any RAG system. Poor ingestion causes poor retrieval regardless of how good the LLM is.

The quality of your RAG system is bounded by the quality of your ingestion pipeline. Garbage in, garbage out.

Pipeline Stages

Stage	What happens	Key decisions
1. Loading	Read raw files from disk, S3, URLs	File format handling, encoding, error quarantine
2. Parsing	Extract clean text, preserve structure	PDF parser choice, table extraction, OCR fallback
3. Chunking	Split into indexed units	Strategy, size, overlap
4. Metadata	Attach source info to each chunk	What to store, how to filter later
5. Embedding	Convert text to dense vectors	Model choice, dimensionality, normalisation
6. Indexing	Store vectors + metadata in DB	Batch size, upsert strategy, deduplication

Document Loading by Format

PDF

PyMuPDF (fitz) — best for text-heavy PDFs, preserves layout. Fallback to AWS Textract for scanned/image-only PDFs. Detect scanned pages by checking if extracted text is under 100 chars per page.

import fitz  # PyMuPDF

def load_pdf(path: str) -> list[str]:
    doc = fitz.open(path)
    pages = []
    for page in doc:
        text = page.get_text()
        if len(text.strip()) < 100:
            text = ocr_fallback(page)  # AWS Textract
        pages.append(text)
    return pages

DOCX / HTML / plain text

Use python-docx for DOCX (preserves heading hierarchy), BeautifulSoup for HTML (strip nav/footer boilerplate), unstructured.io as a catch-all for 20+ formats.

Chunking Strategies

Strategy	How it works	Best for	Risk
Fixed-size	Split every N characters	Uniform docs	Breaks mid-sentence
Recursive character	Split on \n\n → \n → space, prefer paragraph breaks	General prose	Inconsistent chunk sizes
Token-based	Split on token count (tiktoken)	LLM context window control	Semantic breaks ignored
Semantic	Embed sentences, split where similarity drops below threshold	Complex documents	Slow, expensive
Markdown/header	Split at H1/H2/H3 boundaries	Structured docs, policies	Requires clean markdown

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["

", "
", ". ", " ", ""],
)
chunks = splitter.split_text(raw_text)

LoanIQ: Semantic Chunking for Policy Documents

Policy documents have dense, clause-based structure. Splitting mid-clause loses the rule. LoanIQ uses semantic chunking with a similarity threshold of 0.85 — adjacent sentences are grouped until semantic similarity to the next sentence drops, signalling a topic boundary.

Metadata Design

Metadata is stored alongside each chunk and used for pre-filtering before vector search — much cheaper than searching all vectors.

chunk = {
    "page_content": "DTI ratio must not exceed 43% for conventional loans...",
    "metadata": {
        "source": "underwriting_guidelines_v3.pdf",
        "section": "Debt-to-Income",
        "product_type": "conventional",
        "effective_date": "2024-01-01",
        "page": 14,
        "chunk_id": "ug-v3-p14-c2"
    }
}

Embedding Models

Model	Dims	Best for
text-embedding-3-small	1536	General RAG, low cost
text-embedding-3-large	3072	High-accuracy retrieval
BGE-large-en	1024	Open-source, strong on technical docs
Amazon Titan Embeddings	1536	AWS Bedrock stack

Always normalise embeddings (L2 norm) before storing. Use the same model for ingestion and query — a mismatch produces garbage retrieval.

Indexing Strategy

# Batch upsert with deduplication by chunk_id
def index_chunks(chunks: list[dict], vector_store):
    existing_ids = vector_store.get_existing_ids()
    new_chunks = [c for c in chunks if c["metadata"]["chunk_id"] not in existing_ids]
    if new_chunks:
        vector_store.add_documents(new_chunks, batch_size=100)
    return len(new_chunks)

Common Interview Questions

Q: How do you handle duplicate documents?

Hash the document content (SHA-256) and store in a seen-hashes set before ingestion. If hash exists, skip. For near-duplicates (updated versions), use a version metadata field and upsert by source + version.

Q: How do you decide chunk size?

Rule of thumb: chunk size should match the retrieval granularity you need. For Q&A over dense policy text, 256–512 tokens with 64 overlap. For long-form document summarisation, 1024 tokens. Always test retrieval quality with RAGAS context recall at different sizes.

Q: What if a table is split across chunks?

Extract tables separately as markdown strings, keeping the entire table as one chunk regardless of size. Tag with content_type: table metadata. This prevents a cell from landing in one chunk with no column header context.