RAG Development That Survives Real Data

Production RAG breaks on three things: bad chunking, missing query rewriting, and an eval set of ten queries written by the founder. We build the other 80% of the system that makes retrieval actually reliable on your real corpus: messy PDFs, stale wikis, domain jargon, multi-hop questions, and the data that updates while your agent is still pulling yesterday's snapshot.

Pixelfield is for CTOs, Heads of Product and VPs of Engineering at companies that need their AI to reason over their own data, not generic training data. The seniors who scope the work also write the code. 50+ AI features in production, fastest deployment two months, and a reputation for telling you when fine-tuning is the more expensive mistake.

  • The 80% past the vector search
  • Chunking, reranking, hybrid retrieval
  • Evaluation as CI/CD gate, not vibes
  • Freshness SLAs and drift monitoring
VeoliaUniversal studiosMercedesVienna insurance groupRaiffeisen BankGeometryWagestreamCinestarWMC | GREYNOAHOgilvyAmeli
4.9/5 on Google
4.8/5 on Trustpilot
5.0/5 on Clutch

Shipping AI inside production products for scaleups and enterprises across the UK, Europe and the US since 2013.

A London based engineering team that treats RAG as a data platform problem, not a vector search trick.

/ Deliverables
What We Actually Build (the 80% Past the Vector Search)
01
Production RAG Pipeline (Past the Prototype)
Prototype RAG is chunk, embed, store, retrieve, send to LLM. Production RAG is metadata enrichment, query rewriting, hybrid retrieval, cross-encoder reranking, citation tracking, freshness SLAs, evaluation in CI and fallback chains when retrieval misses.
We build the whole pipeline as engineering, not configuration. The ML is maybe 20% of the system.
End-to-end pipeline
Query rewriting
Citation tracking
Production gates
02
Chunking Strategy Per Data Type
Most RAG systems fail at chunking, not retrieval. The best embedding model in the world won't save you if your chunks split a thought in half. We design chunking per data type: semantic units for prose docs, structural for code and config, table-aware for PDFs and spreadsheets, conversation-aware for tickets and chat logs.
200-500 tokens with 10% overlap is a starting point, not a recipe. Real chunking is data-shape specific.
Semantic chunking
Table-aware parsing
Code chunking
Per-data-type strategy
03
Hybrid Retrieval and Reranking
Vector search alone gives you recall on semantic similarity and misses on exact terms, identifiers and rare keywords. BM25 plus vector plus optional graph gives you recall. A cross-encoder reranker gives you precision. Teams that skip reranking and tune embedding models instead are optimising the wrong layer.
We tune the retrieval mix to your corpus and your query distribution, not a benchmark dataset.
Hybrid search
BM25 + vector
Cross-encoder reranking
Latency profiling
04
Graph RAG and Agentic RAG (When It's Worth It)
Vector search hits 32-75% accuracy on multi-hop queries. Graph RAG hits 85%+ when the data is genuinely relational: org charts, invoice chains, code dependencies, compliance lineage, customer-to-account-to-contract. Agentic RAG when the system needs to choose how to retrieve (vector, graph, SQL, API).
We make the call with you in writing. The complexity cost is real and we'll tell you when standard RAG with hybrid search is enough.
Graph RAG
Agentic RAG
Multi-hop retrieval
Architecture decision
05
Ingestion for Messy Real Data
Your corpus isn't a clean Confluence export. Half the docs are outdated. PDFs are scanned sideways. Tables are merged. Wikis have stale pages. CRMs have duplicate records. We build ingestion pipelines that handle the reality: OCR for scanned PDFs (with handwriting where needed), structure-aware table extraction, deduplication, freshness flags, source-of-truth precedence.
For specialised OCR / vision work, see our NLP & Computer Vision page.
OCR + tables
Multi-format corpora
Deduplication
Source-of-truth precedence
06
RAG Evaluation as CI/CD Gate
Hit rate is a vanity metric. What matters: retrieval precision, answer faithfulness, citation accuracy, context relevance. We use RAGAS and DeepEval as CI/CD gates so prompt and embedding-model changes go through the same release process as code. Regression on a real eval set blocks the merge.
An eval set built from real production queries, not ten questions written by the founder.
RAGAS / DeepEval
Faithfulness scoring
Citation accuracy
CI gate on regression
07
Freshness, Drift, Monitoring
Your Postgres updated five minutes ago. Your data warehouse synced last hour. If your agent is pulling yesterday's snapshot, the answers are confidently wrong. We design freshness SLAs into the ingestion (CDC where possible, scheduled where not), backfill protocols, idempotent reprocessing and lineage from answer back to source doc.
At launch we wire in monitoring on retrieval quality, drift detection on query distribution, and alerting on faithfulness regression.
Freshness SLAs
CDC ingestion
Lineage to source
Drift monitoring
08
IP Ownership and Handover
You own everything we deliver. Source code, prompts, prompt registry, ingestion pipelines, eval datasets, infrastructure-as-code, monitoring dashboards, runbooks. No rented layer we hold back. No vendor lock-in on us. At the end of the engagement we hand it over with documentation and a training session, or continue with monthly support. Your call.
Full IP transfer
No lock-in
Eval datasets included
Runbooks

Why Most Production RAG Breaks (and What We Do About It)

It Works on 10 Test Documents, Breaks on Yours

Every RAG demo works on a handful of clean docs. Production has millions of pages, scanned PDFs, stale wikis, near-duplicates, multi-hop questions and users who ask in ways your test set never anticipated. We design for that reality from week one: corpus profiling, real query collection, eval set built from production data, fail-fast on the patterns that usually kill RAG.

The 80% That Isn't the Vector Search

Vector search is the famous 20% of a RAG system. The 80% that decides whether it ships is chunking strategy, query rewriting, hybrid retrieval, reranking, citation tracking, freshness SLAs, evaluation pipelines and monitoring. Most agencies sell the 20%. We build the other 80%.

Your Eval Set Decides Whether It Works

An eval set of ten questions written by the founder is the most common RAG failure pattern we see. We build the eval set from real production queries, expand it with adversarial cases (multi-hop, near-duplicate, stale-data traps) and run it as a CI/CD gate. RAGAS or DeepEval, gating prompt and embedding-model changes. If you can't measure it, you can't ship it.

Honest About When Fine-Tuning Wins (Rarely)

Around 51% of enterprise AI ships as RAG. RAG gives 80% of the value with 20% of the operational overhead. Fine-tuning is the right call for narrow, stable, high-volume behavioural change (tone, output format) where prompt engineering can't enforce it reliably. We've recommended classical RAG with better chunking after teams burned two months on fine-tuning. We'll do the same for you if it's the right call.

How the Engagement Runs, Week by Week

01

Corpus Audit and Architecture Decision (Weeks 1-2, fixed-fee from £2K)

We sample your real corpus, classify the data shapes (structured, semi-structured, scanned, conversational), profile query patterns from real users where possible, and identify what should be deleted, restructured, or excluded before retrieval. Discovery includes an evaluation baseline on your data using a small test eval set, so you know where you're starting from.

You receive: corpus profile, retrieval architecture (standard vs Graph vs Agentic, with rationale), ingestion plan, evaluation framework, fixed-scope build quote and projected monthly run rate.

02

Build Ingestion and Retrieval (Weeks 3-6+)

Build runs as a small senior team led by Michal Vavra. We build the ingestion pipeline (chunking per data type, OCR / table extraction where needed, deduplication, freshness CDC), the retrieval layer (hybrid BM25 plus vector plus optional graph) and the reranking pass.

Each iteration is gated on a fixed eval set sliced by query type and data shape. Retrieval precision and faithfulness are tracked from day one, not after launch.

03

Evaluation, Reranking, Hardening (Weeks 5-9+)

We expand the eval set from real production queries plus adversarial cases (multi-hop, near-duplicates, stale-data traps). RAGAS / DeepEval as a CI/CD gate. Cross-encoder reranking tuned for precision. Citation tracking, source lineage, fallback chains for retrieval misses.

We run the system in shadow mode against real traffic before any user sees output. The build ships when it survives the hardening pass, not when the demo works.

04

Launch, Monitor, Iterate (optional retainer)

At launch we wire in retrieval-quality monitoring, faithfulness drift alerts, freshness lag tracking, query-distribution drift, citation-accuracy regression and a feedback loop from human review back to the eval set. RAG systems age. We size the retainer to keep yours from quietly degrading.

Monthly retainer for monitoring, prompt and chunking tuning, embedding-model updates, on-call. Take it in-house whenever you're ready.

RAG DEVELOPMENT INVESTMENT

Cost depends on corpus size, data messiness, retrieval architecture (standard vs Graph vs Agentic) and SLA. Discovery is fixed-fee from £2K and produces a defensible build quote plus a projected monthly run rate before you commit. Small focused RAG systems start around £10K. Mid-market production builds with hybrid retrieval, evaluation pipelines and Graph or Agentic components typically land between £25K and £100K. Run rate is flat monthly plus inference and embedding cost.
Corpus Audit and Architecture (from £2K)
Weeks 1-2, fixed-fee. Corpus profile, retrieval architecture decision, evaluation framework, fixed-scope quote.
Small Focused RAG
From £10K. Single corpus, single retrieval mode, modest eval set, embedded into one product surface.
Mid-Market Production RAG
Typically £25K-£100K. Hybrid retrieval, reranking, RAGAS / DeepEval CI gate, freshness SLAs, monitoring stack.
Monitoring and Iteration Retainer
Monthly. Retrieval quality, faithfulness drift, embedding updates, eval-set expansion, on-call. Optional, priced in bands.

Frequently Asked Questions

Direct answers to the questions CTOs and Heads of Product ask us on every <strong>scoping call</strong>.

A LangChain tutorial gets you from zero to a working demo on ten clean documents. Production RAG is a data platform problem: chunking strategy per data type, query rewriting, hybrid retrieval (BM25 plus vector plus optional graph), cross-encoder reranking, citation tracking, freshness SLAs, RAGAS / DeepEval as a CI/CD gate, drift monitoring and fallback chains. We build the 80% past the demo. If your use case fits in a tutorial, wire it up yourselves and skip us.

Around 51% of enterprise AI ships as RAG. RAG gives 80% of the value with 20% of the operational overhead. Fine-tuning is the right call for narrow, stable, high-volume behavioural change (tone, output format) where prompt engineering can't enforce it reliably. We've seen teams burn two months on fine-tuning when better chunking plus hybrid search would have got them there in two weeks. Ship RAG first; we'll tell you in discovery if your use case is one of the rare ones where fine-tuning wins.

Usually, yes. The ingestion pipeline is half the work: OCR for scanned PDFs (with handwriting where the document type needs it), structure-aware table extraction, deduplication, source-of-truth precedence across systems holding the same entity differently, freshness flags for stale wiki pages. We profile your corpus in discovery and tell you honestly when the answer is 'fix the data layer first, then build RAG'. We've recommended that more than once.

Standard RAG with hybrid search handles single-hop semantic queries on most corpora. Graph RAG when the data is genuinely relational and queries are multi-hop: org charts, invoice chains, code dependencies, compliance lineage, 'who reports to the person who approved invoice X'. Agentic RAG when the system needs to decide how to retrieve (vector vs graph vs SQL vs API). The complexity cost is real. We make the call with you in writing during discovery.

Hit rate is a vanity metric. What matters: retrieval precision, answer faithfulness, citation accuracy, context relevance. We use RAGAS and DeepEval, build the eval set from real production queries plus adversarial cases (multi-hop, near-duplicate, stale-data traps), and gate prompt and embedding-model changes on regression. RAG ships when it survives the eval gate, not when the demo works.

Discovery: fixed-fee from £2K. Small focused RAG (single corpus, single retrieval mode, embedded into one product surface): from £10K. Mid-market production builds with hybrid retrieval, reranking, evaluation pipelines and Graph or Agentic components typically land £25K-£100K. Run rate is flat monthly plus inference and embedding cost, which we engineer to minimise from day one.

Yes. Standard contract is full IP ownership: source code, prompts, prompt registry, ingestion pipelines, eval datasets, infrastructure-as-code, monitoring dashboards, runbooks. No rented layer. We document so it can be handed to your team or another vendor at any point. The only reason you stay is because the work is good.