Part 2 — The Document Universe: What You Are Actually Working With

200+ document types. 6 regulators. Each document carries different legal weight. Click any regulator to see every document type — and the specific way generic AI fails on it.

Home Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

The Document Inventory — All 6 Regulators

Click any regulator to see its document types and what breaks without a knowledge layer

Most teams skip this step — grab a few PDFs, embed them, and start building. The "RAG Danger" column is the reason that fails. Each entry is a class of query your system will receive in production.

What This Inventory Tells You

Every "RAG Danger" entry is a class of query your system will receive from real users. The CBDT FAQ problem is not rare — FAQs rank first on Google. The supersession problem hits every RBI query. These are the default failure modes, not exceptions. The knowledge layer must handle all of them.

You Try the Obvious Thing — and Here Is What Happens

6 structural failure modes. Click each to see the problem, why RAG misses it, and the schema fix.

The natural first move is RAG. For many domains this works. For this document universe it fails in six distinct ways. None are the model's fault. The failures are structural properties of the corpus that RAG alone cannot see.

What RAG Does and Does Not Know

RAG knows: which passages are semantically similar to the query. RAG does not know: whether the passage is superseded, whether it is proposed or enacted, whether the same text is on three portals, whether the guidance binds the taxpayer or only the department, whether the answer needs five regulators combined. These are knowledge structure failures — not model failures.

Why the Failures Compound

A single compliance query simultaneously triggers multiple failure modes. The fixes compose into one document schema and one layered retrieval architecture. You design the schema that encodes all six, then build the retrieval layer on top. That is the knowledge engineering layer — covered in Part 3.

The Document Schema — What Fixes All Six Problems

Every failure mode maps to one or more schema fields

Each failure mode adds fields to the document schema. Together they form the metadata layer that makes retrieval accurate. This schema is what Data Science engineers design and own.

Field	Type	Fixes Failure Mode	Example Value
status	enum	01 — Supersession	active \| superseded
superseded_by	doc_id	01 — Supersession	RBI/2022/MD-DL-001
authority_tier	enum	02 — Proposal/Enactment Gap	T1 \| T2 \| T3 \| T4 \| T5
effective_from	date	03 — Version Problem	2025-04-01
effective_for_fy	string	03 — Version Problem	FY2025-26
canonical_url	url	04 — Duplication	https://rbi.org.in/...
duplicate_of	doc_id	04 — Duplication	null or parent_id
binding_on	enum	05 — Hierarchy Problem	all \| department_only \| applicant_only \| persuasive
domain	enum	06 — Cross-Domain Problem	CBDT \| CBIC \| RBI \| SEBI \| IRDAI \| MCA \| FIU

This schema is designed in a planning sprint before any code is written. It is not a database detail — it is an architectural decision that determines what questions the system can and cannot answer accurately.

Want to explore what you can build or achieve?

Whether it is a product idea, a compliance challenge, or an engineering question — let's talk through it.

Part 1 — The Problem Next — Part 3 of 6How RegIntel Works