The RAG tutorial tells you to split your document into 512-token chunks with 50 tokens of overlap, embed them, and call it done. That works for FAQs and knowledge bases where each paragraph is self-contained. It does not work for documents where understanding section B requires context from section A, and section C only makes sense once you have both.
I hit this building a RAG-based HLD document generator on OCI. The corpus was Oracle's internal architecture library: structured documents with explicit cross-references, sections that build on each other, and meaning that exists at the document level rather than the paragraph level. Fixed-size chunking destroyed that structure. Here is what I replaced it with, and where the replacement still has gaps.
Why fixed-size chunking fails for hierarchical documents
Fixed-size chunking treats a document as a flat sequence of token windows. The assumption is that any 512 tokens will contain enough signal to answer a query against it. This holds for flat documents where paragraphs stand alone.
Oracle HLD documents are not flat. A section on network architecture references security controls defined in a different section. The security section assumes compute topology context from yet another section. A chunk that cuts mid-section either includes half the context for a concept or buries a cross-reference whose definition lives in a different chunk entirely.
The failure I saw in practice: queries about OCI network patterns for PCI-DSS environments returned chunks that correctly scored on "network" and "PCI-DSS" but were sliced halfway through the section explaining the control requirements. The generated HLD sections were syntactically correct but structurally incomplete. Oracle architects were catching the gaps in review rather than at generation time, which partially defeated the purpose of the system.
Section-level chunking
The approach I switched to: chunk at section boundaries rather than token count boundaries. Each chunk is one logical section of the source document. If a section is large enough to cause problems in the context window, it splits at sub-section boundaries rather than at an arbitrary token count.
This requires parsing the document structure rather than treating it as a flat string. For the Oracle architecture library, documents followed a consistent heading hierarchy. I extracted the structure using heading detection and built a section tree before chunking.
from dataclasses import dataclass
from typing import Optional
@dataclass
class Section:
doc_id: str
section_id: str
parent_section_id: Optional[str]
sequence_index: int
heading: str
content: str
level: int # 1=H1, 2=H2, 3=H3
def extract_sections(doc_id: str, raw_text: str) -> list[Section]:
sections = []
current_heading = None
current_content: list[str] = []
current_level = 0
sequence = 0
# tracks parent section_id per heading level
parent_by_level: dict[int, str] = {}
def flush(heading: str, content: list[str], level: int, seq: int) -> Section:
parent_id = parent_by_level.get(level - 1)
section_id = f"{doc_id}::{seq}"
parent_by_level[level] = section_id
return Section(
doc_id=doc_id,
section_id=section_id,
parent_section_id=parent_id,
sequence_index=seq,
heading=heading,
content="\n".join(content).strip(),
level=level,
)
for line in raw_text.splitlines():
stripped = line.lstrip("#")
level = len(line) - len(stripped)
if level > 0 and level <= 3:
if current_heading is not None:
sections.append(flush(current_heading, current_content, current_level, sequence))
sequence += 1
current_heading = stripped.strip()
current_content = []
current_level = level
else:
if current_heading is not None:
current_content.append(line)
if current_heading is not None:
sections.append(flush(current_heading, current_content, current_level, sequence))
return sections
The parent_section_id and sequence_index stored alongside each embedding are what make the approach useful beyond standard vector similarity search.
Two retrieval patterns that section metadata enables
Proximity retrieval. When a section scores highly on vector similarity, its immediate neighbors in document order -- sequence_index + 1 and sequence_index - 1 -- are fetched from the database and included in the context. These neighbors often contain the setup or the follow-through for the retrieved section without themselves scoring highly enough to appear in a standard top-K result.
For the PCI-DSS query: vector search returns the network architecture section. The proximity fetch pulls in the security controls section immediately before it. The generator receives the network pattern and its control requirements together rather than one without the other.
Parent context injection. For H3 sub-sections, the parent H2 section is fetched and prepended before generation. This ensures the LLM has section-level framing even when the retrieval hit is at the sub-section level.
Both patterns run in a single query against Oracle DB 23ai. The vector similarity search and the relational joins for proximity and parent context happen together rather than as sequential database calls:
WITH ranked AS (
SELECT
s.section_id,
s.doc_id,
s.parent_section_id,
s.sequence_index,
s.content,
VECTOR_DISTANCE(s.embedding, :query_embedding, COSINE) AS score
FROM sections s
ORDER BY score ASC
FETCH FIRST :top_k ROWS ONLY
),
parents AS (
SELECT p.section_id, p.content AS parent_content
FROM sections p
WHERE p.section_id IN (
SELECT r.parent_section_id FROM ranked r WHERE r.parent_section_id IS NOT NULL
)
),
neighbors AS (
SELECT n.section_id, n.sequence_index, n.content AS neighbor_content, n.doc_id
FROM sections n
WHERE (n.doc_id, n.sequence_index) IN (
SELECT r.doc_id, r.sequence_index + 1 FROM ranked r
UNION ALL
SELECT r.doc_id, r.sequence_index - 1 FROM ranked r
)
)
SELECT
r.section_id,
r.doc_id,
r.sequence_index,
r.content,
r.score,
p.parent_content,
n.neighbor_content
FROM ranked r
LEFT JOIN parents p ON p.section_id = r.parent_section_id
LEFT JOIN neighbors n
ON n.doc_id = r.doc_id
AND ABS(n.sequence_index - r.sequence_index) = 1
ORDER BY r.score ASC
Co-locating the vector index with the relational section metadata in Oracle DB 23ai is what makes this single-query pattern possible. With a separate vector store and a relational database, the same retrieval requires two round-trips.
What the numbers looked like
After switching from fixed-size to section-level chunking, the rate of architect-caught gaps in review dropped significantly enough that it changed how architects used the system: they went from treating generated output as a draft requiring structural correction to treating it as a draft requiring content refinement. That is a meaningful difference in trust level.
The 40% improvement in retrieval latency came partly from the single-query pattern replacing sequential calls and partly from section-level chunks being longer and more semantically complete. The retriever needs fewer chunks to provide useful context, which means fewer total embeddings fetched and fewer LLM tokens consumed per generation pass.
The near-zero hallucination rate I cite comes from two mechanisms that section-level chunking supports rather than from chunking alone: strict source grounding (every generated section cites the retrieved chunk it was based on) and a validation pass using a separate LLM call that checks generated output against Oracle HLD standards. Sections that fail validation are flagged for human review rather than silently included. The system does not suppress hallucinations -- it surfaces them.
Where this approach still has gaps
Section-level chunking is better than fixed-size for structured documents. It is not a complete solution.
Cross-document references. Oracle's architecture library has references across documents: "see the Security Reference Architecture for the control mapping." Section-level chunking within documents does not capture those relationships. A document graph index would be the right fix but adds significant complexity, and it was not in scope for the initial delivery.
Flat-prose sections that are genuinely too long. Some sections in the library run 3,000-5,000 tokens with no sub-headings -- solid paragraphs with no structural markers to split on. The current fallback for those is fixed-size splitting within the section. It is the approach I was trying to avoid, applied to the cases where there is no better option. I have not found a clean solution for those sections that does not involve manual editorial work to add structure.
Static corpus. The architecture library updates when Oracle updates reference architectures, which happens regularly. The vector index does not update automatically. Currently it runs on a weekly manual refresh schedule. A webhook-triggered incremental update on document change would be the right design.
The correction-capture loop I would add first if rebuilding: when an architect corrects a generated section, that correction is a direct signal that retrieval produced insufficient context for that query. Capturing those corrections and using them to improve chunking boundaries and retrieval weights over time would make the system self-improving rather than static. It was the highest-leverage improvement I chose not to build in the initial version, and that choice was wrong. The system's quality ceiling is the quality of its initial chunking decisions, which are now fixed.
The retrieval work described here informed the architecture I use for pipeline pattern matching in Pipeshift. The case study with full results is on the case studies page.