The problem
Most RAG systems are demos. They break under real workloads: irrelevant chunks, hallucinated citations, no observability, no evaluation, and no path to agents. Quarry is built as a production knowledge infrastructure layer — the substrate teams need to move AI from prototype to product.
How it's built
Retrieval-Augmented Generation Pipeline
What powers it
What was hard
- Chunking strategies that preserve semantic units across document types
- Hybrid retrieval (dense + BM25) with a re-ranker to fight relevance drift
- Streaming responses with grounded citations and safe fallbacks
- Multi-tenant isolation and rate limiting at the API boundary
- Evaluation harness that can score both retrieval quality and complete answers
Why it's built this way
Postgres + pgvector over a dedicated vector DB
One system to operate, transactional guarantees, and mature tooling. Vector DBs can be added later if scale demands it.
FastAPI + async everywhere
Non-blocking IO for embedding and LLM calls; simple to reason about; easy to instrument.
LangGraph for agent orchestration
Explicit state machine over ad-hoc chains — makes reliability, retries, and observability first-class.
Redis for hot paths
Semantic cache, rate limits, and short-term memory sit in Redis to keep latency low.
Docker-first deployment
Reproducible environments across dev, CI, and prod; friction-free onboarding for contributors.
What I'd tell my past self
- Retrieval quality — not model choice — dominates complete answer quality in most cases.
- Every RAG system needs an eval harness on day one, not day one hundred.
- Observability (traces, spans, token counts, retrieval hits) is worth building before you scale.
- The interesting engineering is at the boundaries: chunking, re-ranking, and orchestration.