Part 2 — The Document Universe: What You Are Actually Working With
200+ document types. 6 regulators. Each document carries different legal weight. Click any regulator to see every document type — and the specific way generic AI fails on it.
The Document Inventory — All 6 Regulators
Click any regulator to see its document types and what breaks without a knowledge layer
Most teams skip this step — grab a few PDFs, embed them, and start building. The "RAG Danger" column is the reason that fails. Each entry is a class of query your system will receive in production.
What This Inventory Tells You
Every "RAG Danger" entry is a class of query your system will receive from real users. The CBDT FAQ problem is not rare — FAQs rank first on Google. The supersession problem hits every RBI query. These are the default failure modes, not exceptions. The knowledge layer must handle all of them.
You Try the Obvious Thing — and Here Is What Happens
6 structural failure modes. Click each to see the problem, why RAG misses it, and the schema fix.
The natural first move is RAG. For many domains this works. For this document universe it fails in six distinct ways. None are the model's fault. The failures are structural properties of the corpus that RAG alone cannot see.
What RAG Does and Does Not Know
RAG knows: which passages are semantically similar to the query. RAG does not know: whether the passage is superseded, whether it is proposed or enacted, whether the same text is on three portals, whether the guidance binds the taxpayer or only the department, whether the answer needs five regulators combined. These are knowledge structure failures — not model failures.
Why the Failures Compound
A single compliance query simultaneously triggers multiple failure modes. The fixes compose into one document schema and one layered retrieval architecture. You design the schema that encodes all six, then build the retrieval layer on top. That is the knowledge engineering layer — covered in Part 3.
The Document Schema — What Fixes All Six Problems
Every failure mode maps to one or more schema fields
Each failure mode adds fields to the document schema. Together they form the metadata layer that makes retrieval accurate. This schema is what Data Science engineers design and own.
| Field | Type | Fixes Failure Mode | Example Value |
|---|---|---|---|
| status | enum | 01 — Supersession | active | superseded |
| superseded_by | doc_id | 01 — Supersession | RBI/2022/MD-DL-001 |
| authority_tier | enum | 02 — Proposal/Enactment Gap | T1 | T2 | T3 | T4 | T5 |
| effective_from | date | 03 — Version Problem | 2025-04-01 |
| effective_for_fy | string | 03 — Version Problem | FY2025-26 |
| canonical_url | url | 04 — Duplication | https://rbi.org.in/... |
| duplicate_of | doc_id | 04 — Duplication | null or parent_id |
| binding_on | enum | 05 — Hierarchy Problem | all | department_only | applicant_only | persuasive |
| domain | enum | 06 — Cross-Domain Problem | CBDT | CBIC | RBI | SEBI | IRDAI | MCA | FIU |
This schema is designed in a planning sprint before any code is written. It is not a database detail — it is an architectural decision that determines what questions the system can and cannot answer accurately.
Want to explore what you can build or achieve?
Whether it is a product idea, a compliance challenge, or an engineering question — let's talk through it.