RAG Ingestion Pipeline

Document loading · Chunking strategies · Embeddings · Vector store indexing


What is the Ingestion Pipeline?

The ingestion pipeline converts raw source documents (PDFs, DOCX, HTML, plain text) into searchable vector chunks stored in a vector database. It runs offline — either once or incrementally — and is the foundation of any RAG system. Poor ingestion causes poor retrieval regardless of how good the LLM is.

The quality of your RAG system is bounded by the quality of your ingestion pipeline. Garbage in, garbage out.

Pipeline Stages

StageWhat happensKey decisions
1. LoadingRead raw files from disk, S3, URLsFile format handling, encoding, error quarantine
2. ParsingExtract clean text, preserve structurePDF parser choice, table extraction, OCR fallback
3. ChunkingSplit into indexed unitsStrategy, size, overlap
4. MetadataAttach source info to each chunkWhat to store, how to filter later
5. EmbeddingConvert text to dense vectorsModel choice, dimensionality, normalisation
6. IndexingStore vectors + metadata in DBBatch size, upsert strategy, deduplication

Document Loading by Format

PDF

PyMuPDF (fitz) — best for text-heavy PDFs, preserves layout. Fallback to AWS Textract for scanned/image-only PDFs. Detect scanned pages by checking if extracted text is under 100 chars per page.

import fitz  # PyMuPDF

def load_pdf(path: str) -> list[str]:
    doc = fitz.open(path)
    pages = []
    for page in doc:
        text = page.get_text()
        if len(text.strip()) < 100:
            text = ocr_fallback(page)  # AWS Textract
        pages.append(text)
    return pages

DOCX / HTML / plain text

Use python-docx for DOCX (preserves heading hierarchy), BeautifulSoup for HTML (strip nav/footer boilerplate), unstructured.io as a catch-all for 20+ formats.

Chunking Strategies

StrategyHow it worksBest forRisk
Fixed-sizeSplit every N charactersUniform docsBreaks mid-sentence
Recursive characterSplit on \n\n → \n → space, prefer paragraph breaksGeneral proseInconsistent chunk sizes
Token-basedSplit on token count (tiktoken)LLM context window controlSemantic breaks ignored
SemanticEmbed sentences, split where similarity drops below thresholdComplex documentsSlow, expensive
Markdown/headerSplit at H1/H2/H3 boundariesStructured docs, policiesRequires clean markdown
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["

", "
", ". ", " ", ""],
)
chunks = splitter.split_text(raw_text)

LoanIQ: Semantic Chunking for Policy Documents

Policy documents have dense, clause-based structure. Splitting mid-clause loses the rule. LoanIQ uses semantic chunking with a similarity threshold of 0.85 — adjacent sentences are grouped until semantic similarity to the next sentence drops, signalling a topic boundary.

Metadata Design

Metadata is stored alongside each chunk and used for pre-filtering before vector search — much cheaper than searching all vectors.

chunk = {
    "page_content": "DTI ratio must not exceed 43% for conventional loans...",
    "metadata": {
        "source": "underwriting_guidelines_v3.pdf",
        "section": "Debt-to-Income",
        "product_type": "conventional",
        "effective_date": "2024-01-01",
        "page": 14,
        "chunk_id": "ug-v3-p14-c2"
    }
}

Embedding Models

ModelDimsBest for
text-embedding-3-small1536General RAG, low cost
text-embedding-3-large3072High-accuracy retrieval
BGE-large-en1024Open-source, strong on technical docs
Amazon Titan Embeddings1536AWS Bedrock stack

Always normalise embeddings (L2 norm) before storing. Use the same model for ingestion and query — a mismatch produces garbage retrieval.

Indexing Strategy

# Batch upsert with deduplication by chunk_id
def index_chunks(chunks: list[dict], vector_store):
    existing_ids = vector_store.get_existing_ids()
    new_chunks = [c for c in chunks if c["metadata"]["chunk_id"] not in existing_ids]
    if new_chunks:
        vector_store.add_documents(new_chunks, batch_size=100)
    return len(new_chunks)

Common Interview Questions

Q: How do you handle duplicate documents?

Hash the document content (SHA-256) and store in a seen-hashes set before ingestion. If hash exists, skip. For near-duplicates (updated versions), use a version metadata field and upsert by source + version.

Q: How do you decide chunk size?

Rule of thumb: chunk size should match the retrieval granularity you need. For Q&A over dense policy text, 256–512 tokens with 64 overlap. For long-form document summarisation, 1024 tokens. Always test retrieval quality with RAGAS context recall at different sizes.

Q: What if a table is split across chunks?

Extract tables separately as markdown strings, keeping the entire table as one chunk regardless of size. Tag with content_type: table metadata. This prevents a cell from landing in one chunk with no column header context.