Part 3 — How RegIntel Works: The Knowledge Engineering Layer
Where the KE layer fits between raw documents and user queries. A before/after comparison shows exactly what it changes. Plain-English glossary of every technical term you will encounter.
Where the Knowledge Engineering Layer Fits
Three nodes. One layer between raw documents and AI applications.
The Knowledge Engineering layer is what sits between the raw regulatory corpus (PDFs, Gazettes, Circulars) and the AI applications that serve each user. Without it, the AI reads raw text directly. With it, the AI reads structured, classified, versioned knowledge.
Regulatory Corpus
200+ documents 6 regulators PDFs, Gazettes No structure
Knowledge Engineering Layer
Document registry Authority hierarchy Extraction pipeline Graph + embeddings
This is what RegIntel builds
AI Applications
Compliance Q&A Dev rule engines Policy dashboards CA advisory tools
Without the KE layer, the AI application connects directly to the raw corpus — and produces the 6 failure modes described in Part 2. With the KE layer, every document is classified, versioned, de-duplicated, and schema-tagged before the AI ever sees it.
Document Registry
Every document classified by authority tier, issuing body, legal status, effective dates, and supersession chain. The registry is queryable — not a file system.
Extraction Pipeline
An AI pipeline reads regulatory prose and extracts structured obligations — obligation_type, subject, action, trigger, threshold, effective_from — stored as rows, not text.
Persona-Aware Retrieval
GraphRAG retrieval routes queries differently by persona. CA: enacted law only, current FY. Policy team: proposed changes only. Dev team: structured JSON rules. Compliance officer: all domains, cross-referenced.
Provenance Layer
Every answer traces back to its source — document, section, version, operative date. The user sees the citation. The audit log records it. Every answer is verifiable.
Before and After — The Same Query, Two Different Systems
What retrieval looks like with and without the KE layer
Query: "What RBI rules apply to digital lending for NBFCs?"
Without KE Layer — Generic RAG
With RegIntel KE Layer
The KE layer does not change the AI model. It changes what the model is allowed to see. Superseded documents are excluded by pre-filter. T5 documents are excluded for CA persona. Cross-domain results are assembled by the query router — not left to chance similarity scoring.
Key Terms in Plain English
Every technical term used in this series — what it means and how it is used in RegIntel
If you are new to AI engineering or to Indian financial regulation, this table explains every term. Read it once — these terms appear throughout Parts 4 and 5.
| Term | Full Name | Plain English | How It Is Used in RegIntel |
|---|---|---|---|
| RAG | Retrieval-Augmented Generation | A technique where an AI first searches a document library to find relevant passages, then uses those passages to write an answer. Like an open-book exam — the AI "reads" before answering. | The starting point before adding a Knowledge Engineering layer. Fails in 6 ways on regulated domains without metadata. |
| Supersession | Document Supersession | When a newer document replaces an older one and the older one is no longer valid law. The RBI Master Direction on Digital Lending superseded 23 earlier circulars — but both still exist in the document corpus. | One of the 6 core failure modes. A system without KE returns superseded documents as if they are current law. |
| Knowledge Graph | Graph-structured Knowledge Base | A database that stores not just facts, but the relationships between facts. "Document A supersedes Document B" is a relationship, not just a fact. A graph can traverse these. A flat vector store cannot. | Used in GraphRAG — the recommended retrieval architecture for cross-domain regulatory queries. |
| Vector Embedding | Numerical Vector Representation | A way of turning text into numbers so a computer can measure how "similar" two pieces of text are. A question about KYC will be numerically close to a KYC circular — even if that circular is 8 years old and superseded. | The core mechanism behind RAG. Its strength (finds similar text) is also its weakness (cannot see legal hierarchy). |
| Authority Tier | Document Authority Classification | A ranking system for how much legal weight a document carries. T1 (enacted law like the IT Act) is highest. T5 (CBDT FAQ, Budget Speech) has zero legal authority. A system without tiers treats all documents equally. | The fix for Failure Mode 02 (Proposal/Enactment Gap). All retrieval pre-filters on authority tier. |
| GraphRAG | Graph-enhanced Retrieval-Augmented Generation | RAG combined with a knowledge graph. Instead of only finding similar text, the system can also traverse explicit relationships — "what documents does Finance Act 2025 amend?" is a graph traversal, not a text search. | Recommended production architecture. Handles all 6 failure modes including the cross-domain problem. |
| Metadata Filter | Pre-search Attribute Filter | A condition applied before the similarity search runs. WHERE status = active removes superseded documents before the AI even looks at them. Like filtering a spreadsheet before running a formula. | The practical implementation of most KE fixes. Fast, cheap, handles 5 of the 6 failure modes when schema is correct. |
| Provenance | Answer Source Attribution | Knowing exactly where an answer came from — which document, which section, which version, on what date it was operative. Without provenance, an AI answer cannot be audited or verified by a lawyer or CA. | Required for compliance-grade AI systems. Full-Stack team owns the provenance display UI. |
| Chunking | Document Segmentation Strategy | How a large document is split into smaller pieces before embedding. A bad strategy (e.g. fixed 512 tokens) can split a legal obligation mid-sentence. The AI then retrieves half an obligation and misses the condition that triggers it. | AI/LLM Engineering team owns chunking strategy. Section-boundary chunking recommended for regulatory documents. |
| Obligation Extraction | Structured Rule Extraction from Text | Pulling structured rules out of legal prose — "NBFC must report digital lending metrics monthly" becomes a JSON object with obligation_type: MANDATORY, subject: NBFC, action: report, trigger: monthly. | Core deliverable of the extraction pipeline. This is what the Dev Team consumes to build rule engines. |
Want to explore what you can build or achieve?
Whether it is a product idea, a compliance challenge, or an engineering question — let's talk through it.