Pillar

Architecture for AI in regulated environments

I help engineering teams ship AI that survives Legal review. Production AI in regulated environments has different architecture problems than greenfield consumer AI. Here's how I think about it — what I look at first, what most teams skip, where the failure modes actually live.

On this page

→ The 7-layer mental model
→ Three sovereignty paths
→ What changes when Legal is in the room
→ Common failure patterns
→ Engagement options
→ Recent work in regulated environments
→ Further reading

The 7-layer mental model

Most production AI systems built today have the first two layers right — ingestion and corpus indexing. The interesting failure modes live in layers 3 through 7.

In a recent engagement for a regulated regulatory-intelligence platform, the existing vendor proposal covered Layers 1–2 well — scraping, hybrid retrieval, change detection. Solid for per-document factual extraction. Incomplete for the larger problem: a meaningful share of the questions required multi-document reasoning — answers assembled across articles spread over multiple laws — which a per-document RAG cannot produce.

┌─────────────────────────────────────────────────────────┐
│  LAYER 7   Eval Harness                                  │
│  LAYER 6   Validation Queue (analyst review)             │
│  LAYER 5   Answer Composition + Citation Tracking        │
│  LAYER 4   Orchestration / Multi-hop Reasoning           │
│  LAYER 3   Question Classifier / Router                  │
├─────────────────────────────────────────────────────────┤
│  LAYER 2   Corpus Indexing (hybrid retrieval)            │
│  LAYER 1   Ingestion + Normalization                     │
└─────────────────────────────────────────────────────────┘

Layer 1 — Ingestion + Normalization. Scrape and parse source documents into a canonical model that preserves hierarchy: Title → Chapter → Article → Section → Paragraph. Capture cross-references and metadata (jurisdiction, date in force, language, version).

Layer 2 — Corpus Indexing. Hybrid retrieval — vector for semantic similarity, BM25 for exact references and domain jargon, neural reranker to refine top candidates. Optionally augmented with a knowledge graph for structural queries (amendment chains, citation relationships).

Layer 3 — Question Classifier / Router. Classifies each question by type — direct factual, multi-document interpretive, cross-jurisdictional comparison, synthesis. Routes each to the appropriate reasoning strategy. Without this layer the system treats all questions identically and most multi-hop answers are wrong.

Layer 4 — Orchestration / Multi-hop Reasoning. For complex questions, decomposes into atomic steps: identify relevant documents → extract per-document positions → check restrictive conditions → synthesize. Built on LangGraph — currently the strongest production framework for stateful reasoning, with explicit graph control, persistence, retries, and human-in-the-loop checkpoints. Includes Chain-of-Verification for high-stakes answers.

Layer 5 — Answer Composition + Citation Tracking. Synthesizes the final answer with confidence scoring. Every claim traces to specific sections of specific source documents. Output is structured JSON for downstream use. In a regulated product, an answer without traceable citations is worthless.

Layer 6 — Validation Queue. Routes low-confidence and high-stakes answers to analyst review. Captures analyst corrections as feedback for system improvement. This is where the "structured reviewable format" requirement most regulated buyers articulate actually lives.

Layer 7 — Eval Harness. Runs the pipeline against a golden dataset of human-verified answers. Measures accuracy per question type. Tracks improvement over time. The eval layer matters disproportionately. Most teams skip it and regret it within three months — when answers start to drift, there's no way to debug whether the issue is retrieval, classification, or generation.

Three sovereignty paths (and how to pick)

In 2026, three approaches each lead to a meaningfully different architecture.

Approach	Setup	Sovereignty	Cost shape	Capability
Strict	Self-hosted open-weight models (Llama 3.x, Mistral, Qwen) on EU GPU cloud (OVH, Scaleway, Regolo, LUMI)	Full — data stays on EU infra under your control	Higher fixed — GPU infra + ops	Capability gap to frontier models is narrowing but real for hardest reasoning
Pragmatic	Frontier APIs in EU regions (AWS Bedrock EU, Azure OpenAI EU) + DPA + zero retention	Partial — US CLOUD Act still reaches US-parented providers	Lowest — pay-per-use, no infra	Best — full frontier model quality
Hybrid	Self-hosted open-weight for sensitive / bulk; EU-region frontier APIs for hard reasoning only	Sovereign where it matters; pragmatic where it helps	Moderate — partial infra	Frontier-equivalent for the workloads that need it

The hybrid approach usually threads the needle for regulated mid-market: self-hosted open-weight for bulk operations (extraction, classification, retrieval pre-processing) where capability is sufficient and volume is high; EU-region frontier APIs for the hardest multi-hop reasoning where the capability gap actually matters; routing logic separates the two. It's more complex to set up than either pure option, but it costs less than full strict and protects more than full pragmatic.

What changes when Legal is in the room

Three facts that should sit in front of any decision:

EU AI Act enters full enforcement in August 2026. Penalties up to 7% of global turnover. For a regulated product, getting AI sovereignty visibly right has reputational weight beyond the legal one.
CLOUD Act exposure for managed APIs is real. Confirmed publicly by Microsoft France VP testimony to the French Senate in 2025: EU regions of US-headquartered providers cannot guarantee EU data sovereignty against US government data requests. DPAs reduce the surface but don't eliminate it.
Open-weight capability is rising fast. As of mid-2026, Llama 3.x, Mistral Large, and Qwen 2.5 close most of the gap to frontier models for extraction, classification, and structured output. The remaining gap is concentrated in the hardest multi-step reasoning — which makes the hybrid split natural.

In most regulated mid-market engagements, Legal flags that default OpenAI direct is not acceptable (subscriber data flowing to a US-headquartered provider). The pragmatic path is workable only with care: explicit DPAs, zero retention, EU residency, and probably supplementary measures (encryption, pseudonymisation) Legal will want to define given Schrems II direction. With the EU AI Act landing in August and a regulated product being "independent intelligence," the optics matter beyond legal compliance — subscribers in regulated industries are increasingly asking whether their analyst is using AI, and where the data goes when it is.

Common failure patterns

Four patterns I see repeatedly in regulated AI builds:

1. Per-document RAG for multi-hop questions. Hybrid retrieval is great for factual extraction from one document. The moment a real-world question requires assembling evidence across multiple documents, per-doc RAG produces wrong answers — and most teams don't realize until production. The fix is an orchestration layer above retrieval (Layer 4).

2. Missing eval harness. Most teams skip the eval harness because it doesn't ship features. Six months later, when answers start drifting, there's no way to debug whether the issue is retrieval, classification, or generation. In regulated environments where teams already have human-verified historical answers (an old database, a review log), that dataset is usually the most valuable thing the company owns for the AI build — and almost no one treats it as such from day one.

3. Vendor lock-in in the LLM observability stack. A copilot serving 5,000+ users at Roche Diagnostics had its observability tightly coupled to a single vendor through their frontend integration. If that vendor changed pricing, deprecated an API, or shut down, the entire monitoring and tracing pipeline would break and the team would be scrambling for an emergency replacement. The fix was an observability abstraction layer — a thin adapter that decouples the application from any specific provider — and a recommendation for Langfuse as the replacement. (Full case study.)

4. Monolithic agent layers. When an offshore vendor's agent layer is built around a per-document assumption — two agents handling structured-field extraction — the model selection is often under-specified for the multi-hop reasoning the actual product needs. This is the layer where vendor proposals most need redesign. Often the right move is to keep the vendor on Layers 1–2 where their architecture fits, and own Layers 3–7 in-house.

Engagement options

Two engagement shapes work well for advisory work in this lane, plus a paid lighter entry point:

Scoping engagement — €8–10K, 2 weeks, fixed. Front-loaded diagnostic: stakeholder interviews, architecture document, sovereignty decision record, build plan with phasing, cost estimate, final readout. Best for teams facing a build/buy decision on a meaningful AI investment.
Advisor retainer — €7–11K/mo, 3-month minimum. Weekly architecture call plus async availability. Your tech team owns delivery; this engagement keeps them aimed at the right target. Architecture decision support, vendor management input, eval discipline, sovereignty/regulatory decisions as they come up. Three tiers depending on hours: €7K (5–6 h/wk), €9K (7–8 h/wk), €11K (10+ h/wk).
Technical assessment — $750, 90 minutes, written take in 48 hours. Bring an architecture diagram, a vendor proposal, an eval question, or a specific stuck point. Validating fit before a larger commitment.

Full engagement details on the Services page →

Recent work in regulated environments

Roche Diagnostics — Owned architecture decisions on a production RAG copilot serving 5,000+ users internally. Identified that the LangChain abstraction layer had become a liability for debuggability as the system matured; proposed and led the migration to LangGraph. Also identified vendor lock-in risk in the observability stack and designed an abstraction adapter so another developer could cleanly migrate to Langfuse without touching application code.

EuroPharma Alliance — Brought in by leadership to define AI strategy from scratch and build the in-house team to execute it. Hired and mentored a 2-person AI development team after screening ~20 candidates. Defined the technical stack and architecture for the broader AI product roadmap. Shipped a RAG-based knowledge chatbot (regulated-pharma context) and an internal email-writing tool before handing the rest of the roadmap off to the team I'd hired.

Teamwork.com — Owned two production AI features end-to-end inside an existing SaaS platform: an AI Meeting Bot (transcription, summary, automatic task creation) and AI Plan My Week (calendar optimisation combining LLM reasoning with rule-based scheduling). Also led a structured-outputs migration across legacy AI features — converting freeform LLM outputs to schema-validated structured responses, reducing hallucination and improving consistency.