DocSignal: Hybrid RAG over Dutch Immigration Policy
DocSignal is a retrieval-augmented Q&A system over a real, messy corpus: ~40 public pages of Dutch work-visa and immigration policy. Built while researching my own relocation — and used as a chance to build retrieval properly instead of wiring up a tutorial.
This is a technical demo, not immigration advice. The corpus is a dated snapshot; salary thresholds and policy change constantly. For current rules, go to ind.nl.
At a glance
- Problem: Immigration policy is dense, cross-referenced, and full of exact tokens (salary thresholds, permit names) that pure semantic search mishandles.
- Approach: Structure-aware chunking, hybrid dense + BM25 retrieval fused with RRF, Claude Haiku re-ranking, grounded generation with citations.
- Stack: Next.js 14 · TypeScript · bge-small-en-v1.5 embeddings · BM25 · Claude API.
- Headline result: Hybrid retrieval hits 96.7% Recall@5 vs 93.3% dense-only and 86.7% BM25-only on 30 hand-labelled questions.
Architecture & logic
Government policy pages are deep heading trees with the highest-value facts in tables. DocSignal chunks along document structure, keeps table rows attached to their headers, and prepends every chunk's breadcrumb to its text before embedding.
Immigration terminology is a worst case for pure semantic search — embeddings map "EU Blue Card" and "highly skilled migrant" close together, which is usually helpful and catastrophic when the user asks about one specifically. Exact tokens like "TWV", "€ 4,357.00" and "150 kilometres" are where keyword search earns its keep.
collect_corpus.ts → chunker.ts → build_index.ts
│
▼
retrieval.ts (dense + BM25 → RRF fusion)
│
▼
rerank.ts (Claude Haiku → top 5)
│
▼
generate.ts (grounded answer + [n] citations)Benchmarks & results
From npm run eval — 30 hand-labelled questions, 310 chunks, bge-small-en-v1.5 embeddings:
| Retrieval mode | Recall@5 | MRR@10 | | --- | ---: | ---: | | Dense-only | 93.3% | 0.837 | | BM25-only | 86.7% | 0.751 | | Hybrid (RRF fusion) | 96.7% | 0.851 |
Recall@5 means: for each question, did at least one of the hand-identified relevant chunks appear in the top 5 results? MRR@10 is the mean reciprocal rank of the first relevant chunk (top 10, else 0).
Honest caveats: fusion parameters were swept against this same 30-question set, so hybrid figures are slightly optimistic; dense vs hybrid is 28 vs 29 hits out of 30 — one question, not a statistically robust margin. The per-mode failure analysis on /eval is more trustworthy than the headline percentages.
What I'd do differently
Hold out a query set. Thirty labelled questions is enough to expose failure modes, not enough to tune confidently. I would split into dev/test before sweeping fusion weights.
Cross-encoder re-ranking. Haiku re-ranking helps, but a dedicated cross-encoder (or a smaller bi-encoder fine-tuned on immigration terminology) would be the next step before scaling corpus size.
Freshness pipeline. Policy pages change. Production would need scheduled re-fetch, diff detection, and chunk invalidation — not a one-time corpus snapshot.
Stack & links
- Next.js 14 (App Router) · TypeScript · Tailwind CSS
- bge-small-en-v1.5 (local ONNX) · BM25 · Claude API
- GitHub: DocSignal source
Related portfolio work: LeadSignal (rules + Claude enrichment for RevOps), SignalVision (ONNX int8 quantization for browser inference).