Back to RegIntel Home
SA
Sumit Arora

Full-Stack Architect

Brisbane, Australia
March 2026
7 min readBusiness SolutionRegIntel — Part 3 of 6

Part 3 — How RegIntel Works: The Knowledge Engineering Layer

Where the KE layer fits between raw documents and user queries. A before/after comparison shows exactly what it changes. Plain-English glossary of every technical term you will encounter.

1

Where the Knowledge Engineering Layer Fits

Three nodes. One layer between raw documents and AI applications.

The Knowledge Engineering layer is what sits between the raw regulatory corpus (PDFs, Gazettes, Circulars) and the AI applications that serve each user. Without it, the AI reads raw text directly. With it, the AI reads structured, classified, versioned knowledge.

Regulatory Corpus

200+ documents 6 regulators PDFs, Gazettes No structure

Knowledge Engineering Layer

Document registry Authority hierarchy Extraction pipeline Graph + embeddings

This is what RegIntel builds

AI Applications

Compliance Q&A Dev rule engines Policy dashboards CA advisory tools

Without the KE layer, the AI application connects directly to the raw corpus — and produces the 6 failure modes described in Part 2. With the KE layer, every document is classified, versioned, de-duplicated, and schema-tagged before the AI ever sees it.

Document Registry

Every document classified by authority tier, issuing body, legal status, effective dates, and supersession chain. The registry is queryable — not a file system.

Extraction Pipeline

An AI pipeline reads regulatory prose and extracts structured obligations — obligation_type, subject, action, trigger, threshold, effective_from — stored as rows, not text.

Persona-Aware Retrieval

GraphRAG retrieval routes queries differently by persona. CA: enacted law only, current FY. Policy team: proposed changes only. Dev team: structured JSON rules. Compliance officer: all domains, cross-referenced.

Provenance Layer

Every answer traces back to its source — document, section, version, operative date. The user sees the citation. The audit log records it. Every answer is verifiable.

2

Before and After — The Same Query, Two Different Systems

What retrieval looks like with and without the KE layer

Query: "What RBI rules apply to digital lending for NBFCs?"

Without KE Layer — Generic RAG

RBI Circular on Digital Lending (2017) (Superseded)
RBI FAQ on Fintech (2019) (T5 — no authority)
RBI Press Release on P2P Lending (2020) (Not applicable)
NBFC Circular (2018) (Superseded)
RBI Master Direction — Digital Lending (2022) (Correct — but buried at #5)

With RegIntel KE Layer

RBI Master Direction on Digital Lending (2022, active) (T2 operative)
RBI NBFC Regulations — Applicable Sections (T2 operative)
IT Act 1961 §194A — TDS on interest (T1 enacted)
CGST Act — GST on processing fees (T1 enacted)
PMLA 2002 + FIU-IND KYC Guidelines (Cross-domain)

The KE layer does not change the AI model. It changes what the model is allowed to see. Superseded documents are excluded by pre-filter. T5 documents are excluded for CA persona. Cross-domain results are assembled by the query router — not left to chance similarity scoring.

3

Key Terms in Plain English

Every technical term used in this series — what it means and how it is used in RegIntel

If you are new to AI engineering or to Indian financial regulation, this table explains every term. Read it once — these terms appear throughout Parts 4 and 5.

TermFull NamePlain EnglishHow It Is Used in RegIntel
RAGRetrieval-Augmented GenerationA technique where an AI first searches a document library to find relevant passages, then uses those passages to write an answer. Like an open-book exam — the AI "reads" before answering.The starting point before adding a Knowledge Engineering layer. Fails in 6 ways on regulated domains without metadata.
SupersessionDocument SupersessionWhen a newer document replaces an older one and the older one is no longer valid law. The RBI Master Direction on Digital Lending superseded 23 earlier circulars — but both still exist in the document corpus.One of the 6 core failure modes. A system without KE returns superseded documents as if they are current law.
Knowledge GraphGraph-structured Knowledge BaseA database that stores not just facts, but the relationships between facts. "Document A supersedes Document B" is a relationship, not just a fact. A graph can traverse these. A flat vector store cannot.Used in GraphRAG — the recommended retrieval architecture for cross-domain regulatory queries.
Vector EmbeddingNumerical Vector RepresentationA way of turning text into numbers so a computer can measure how "similar" two pieces of text are. A question about KYC will be numerically close to a KYC circular — even if that circular is 8 years old and superseded.The core mechanism behind RAG. Its strength (finds similar text) is also its weakness (cannot see legal hierarchy).
Authority TierDocument Authority ClassificationA ranking system for how much legal weight a document carries. T1 (enacted law like the IT Act) is highest. T5 (CBDT FAQ, Budget Speech) has zero legal authority. A system without tiers treats all documents equally.The fix for Failure Mode 02 (Proposal/Enactment Gap). All retrieval pre-filters on authority tier.
GraphRAGGraph-enhanced Retrieval-Augmented GenerationRAG combined with a knowledge graph. Instead of only finding similar text, the system can also traverse explicit relationships — "what documents does Finance Act 2025 amend?" is a graph traversal, not a text search.Recommended production architecture. Handles all 6 failure modes including the cross-domain problem.
Metadata FilterPre-search Attribute FilterA condition applied before the similarity search runs. WHERE status = active removes superseded documents before the AI even looks at them. Like filtering a spreadsheet before running a formula.The practical implementation of most KE fixes. Fast, cheap, handles 5 of the 6 failure modes when schema is correct.
ProvenanceAnswer Source AttributionKnowing exactly where an answer came from — which document, which section, which version, on what date it was operative. Without provenance, an AI answer cannot be audited or verified by a lawyer or CA.Required for compliance-grade AI systems. Full-Stack team owns the provenance display UI.
ChunkingDocument Segmentation StrategyHow a large document is split into smaller pieces before embedding. A bad strategy (e.g. fixed 512 tokens) can split a legal obligation mid-sentence. The AI then retrieves half an obligation and misses the condition that triggers it.AI/LLM Engineering team owns chunking strategy. Section-boundary chunking recommended for regulatory documents.
Obligation ExtractionStructured Rule Extraction from TextPulling structured rules out of legal prose — "NBFC must report digital lending metrics monthly" becomes a JSON object with obligation_type: MANDATORY, subject: NBFC, action: report, trigger: monthly.Core deliverable of the extraction pipeline. This is what the Dev Team consumes to build rule engines.

Want to explore what you can build or achieve?

Whether it is a product idea, a compliance challenge, or an engineering question — let's talk through it.