RAGTAG · how we got here

i. the constraint

It started with a list.

Before any code, the Taxxa team sat us down on 23.05. Seven points, each a hard constraint.

€60 per user per month. Queries have to be cheap.
Connect Finlex, Vero, case law. EU-lex is out of scope.
Case laws refer to Finlex. Vero is just an interpreter. Case laws can overwrite Vero.
Can’t bring 1,000 chunks per question. 25M-page DB.
Timeline matters. Active now, not then, not later.
Run RAG locally first. DeepSeek is good and cheap.
Reference extraction by regex / NLP is a good idea. They aren’t doing it.

ii. the cathedral

We tried to build too much.

Our first sketch: a bitemporal knowledge graph on Neo4j with the SAT-Graph RAG schema arXiv:2505.00039. BGE-M3 hybrid retrieval, ColBERT SIGIR 2020, Reciprocal Rank Fusion SIGIR 2009. A courtroom debate from AgenticSimLaw arXiv:2601.21936. A 63k-node GPU constellation. SPARQL fallback CRAG-style arXiv:2401.15884 with Self-RAG reflection tokens ICLR 2024.

Two days in, zero answers. The cathedral lost against chat #01, #03 and #04 before it ever resolved a conflict.

iii. the pivot

Chat #07 was the unlock.

Taxxa said reference extraction by regex was a good idea and they weren’t doing it. We flipped the build order. Three deterministic passes over HTML: structural (heading tree), anchor (<a href>) and regex (text citations). The graph fell out automatically; one Verifier comparing an integer rank replaced three agents arguing.

iv. ragtag

Ten small pieces.

Graph in SQLite (1.97M nodes, 2.18M edges). Chunks in LanceDB on the filesystem, embedded by Voyage voyage-3-large (1024-dim multilingual). Section-anchored chunking, six-preset strategy router, bounded BFS with hub-skip caps, bge-reranker-v2-m3 cross-encoder. Temporal correctness is a per-section version_chain plus a deterministic check_temporal_mismatches on difflib. Authority is one integer: Finlex 100, Treaty 90, KHO 80, Vero 60. Generation runs on DeepSeek-V4-Pro via Featherless, query rewrites cached in process.

v. the architecture

How it actually fits together.

Built from scratch on purpose. The unique part is the architecture, not any one model. Each layer is small enough to debug and replace.

vi. ai act ready, by accident

The graph is auditable by construction.

A future EU AI Act review asks “how did the system reach this answer?” RAGTAG answers that without extra work. Every cite ships a RetrievalPath. Every stale chunk ships an AmendmentCaveat. Every conflict ships an integer-rank resolution. None of this is live on Taxxa today; the demo is what a compliance-ready future could look like.

Mechanism 1 Reference extraction is deterministic (chat #07). Three rule-based passes, no model in the batch path.
Mechanism 2 Two-file architecture. One SQLite graph + one LanceDB store, joined by section_id. The whole audit surface is two files.
Mechanism 3 Conflicts resolve by integer (chat #03). Authority rank is logged in AnswerResult.conflicts.
Mechanism 4 Temporal correctness. version_chain plays amendments forward; AmendmentCaveat flags every stale cite.

EU-lex itself is out of scope today (chat #02), but adding it later is new corpus, not new infrastructure. The transposes edge type is already in the schema, the authority lattice extends with one number, and the strategy router treats it as another cross_source route.

vii. two things we learned

The graph paid off twice.

Mojibake recovered through the graph

About 1.7% of chunks were double-encoded; the HTML sniffer mis-read UTF-8 as Latin-1. We traced RAG hits back to source files, forced UTF-8 at the parse layer, and re-embedded only the affected slice.

scripts/reingest_corrupted_chunks.py

Not every tax question is in the law

Eval question N49 asks the common account-number range for myyntisaamiset and ostovelat. Our system returned the correct legal answer (no universally binding range exists). The question-bank reference traces to KILA practice, not Finlex.

eval/questions.json · question N49

viii. burning questions

What we get asked most.

Relations From the content. Three deterministic passes: structural (heading tree), anchor (<a href>), regex.
Densities Tuned from the corpus. interprets_in > 30, cites_out > 15, parent_of_in > 50.
Finnish Voyage and bge-reranker are multilingual. Strategy regex carries Finnish vocab; amendment verbs are muutetaan, kumotaan, lisätään.
Model picks Voyage was already in LanceDB. bge-reranker is the standard multilingual cross-encoder. DeepSeek per chat #06.
Edge cases Mojibake recovered, stale citations caveated (suspect / stale / repealed), conflicts resolved by rank.
Accuracy No single number. 60-question eval; demo covers Q1, Q12, Q41, plus the N49 honest miss.