Section-Level vs Paragraph Chunking: I Benchmarked Both on 15k Technical Docs

The answer to "what chunk size should I use?" is always "it depends," and I've always found that answer useless. It depends on what, specifically, is the question. I ran a benchmark to find out.

The corpus: approximately 15,000 technical documents -- architecture guides, API references, runbooks, and internal engineering wikis -- accumulated across a few client engagements and my own infrastructure work. The embedding model: text-embedding-3-small from OpenAI. The vector database: Qdrant, running on a dedicated VM, not serverless. The two strategies under comparison: section-level chunking (splitting at H2/H3 heading boundaries) and fixed-token paragraph chunking (512 tokens, 64-token overlap, no structural awareness).

My prior assumption going in: section-level would win on both recall and precision because it preserves semantic coherence. That assumption was half right. The part where I was wrong is the part worth talking about.

What the benchmark actually measured

I'll be upfront about the methodology so you can decide how much to trust the numbers.

I built an evaluation set of 320 queries by hand, sampling from document types proportionally. Each query was tagged with a query type: factual lookup (e.g., "what's the default timeout for the ingestion worker?"), conceptual explanation (e.g., "how does the auth service handle token refresh?"), and procedural ("what are the steps to rotate the signing key?"). I also tagged each by expected answer length: short (a single value or sentence), medium (a paragraph), and long (a multi-step process or a full section).

For each query, I ran retrieval against both chunking strategies and evaluated two metrics:

Recall@5: does the correct source passage appear in the top 5 retrieved chunks?
MRR (Mean Reciprocal Rank): where in the top 5 does the first relevant chunk appear?

Labelling "correct" was manual for the first 100 queries and semi-automated for the rest -- I generated candidate relevance labels with GPT-4o and reviewed cases where the two chunking strategies disagreed. That disagreement set is where the interesting data is.

The numbers

| Metric | Section-level | Paragraph (512t / 64t overlap) | |--------|--------------|-------------------------------| | Recall@5 (all queries) | 0.81 | 0.73 | | MRR (all queries) | 0.67 | 0.61 | | Recall@5 -- factual/short | 0.71 | 0.79 | | MRR -- factual/short | 0.63 | 0.72 | | Recall@5 -- conceptual/medium | 0.87 | 0.74 | | MRR -- conceptual/medium | 0.73 | 0.62 | | Recall@5 -- procedural/long | 0.85 | 0.68 | | MRR -- procedural/long | 0.69 | 0.57 |

Aggregate headline: section-level wins. If you stop there, you conclude section-level chunking is better and move on.

But look at the factual/short row.

Paragraph chunking beats section-level on factual short-answer queries by about 8 points on Recall@5 and 9 points on MRR. That surprised me. I expected section-level to be at worst competitive on these queries, since a well-structured document should have the answer localized in a specific section. It turns out the problem is the opposite: sections are too large for point-lookup queries. A section explaining a configuration parameter in detail will score lower against a query like "what is the default TTL for session tokens?" than a 512-token paragraph that leads with the exact sentence answering that question. The larger section buries the signal.

Why this happens

text-embedding-3-small encodes meaning across the full token sequence. A 200-token paragraph with a tight semantic focus produces an embedding that clusters tightly around that topic. A 1,200-token section covering the same topic plus surrounding context, setup paragraphs, and related caveats produces a denser embedding that is semantically richer but less precisely aligned to a short, specific query.

This is a consequence of how dense vector similarity works rather than a weakness of section-level chunking specifically. A section is semantically correct -- it contains the answer -- but the embedding distances are less discriminative for queries that map precisely to a small surface area within that section. The paragraph chunk, which is that small surface area, wins on cosine similarity.

For longer, contextual queries -- "explain how the auth service handles refresh token rotation" -- the section-level approach wins because the conceptual answer requires the full context that a section preserves. Paragraph chunking splits that context across multiple chunks, none of which has the full picture, and the retriever returns fragments that each require the others to make sense.

The practical framing: paragraph chunking is better at finding needles. Section-level chunking is better at finding the right haystack.

The code I used for section extraction

The section extractor is similar to work I've done before on structured document corpora, but this version is tuned for the mixed-format nature of the corpus here -- documents that are Markdown but also include some that were converted from Word and HTML with inconsistent heading hierarchies.

import re
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Chunk:
    doc_id: str
    chunk_id: str
    heading: str
    content: str
    level: int
    char_start: int
    char_end: int
    parent_chunk_id: Optional[str] = None
    token_estimate: int = 0

def estimate_tokens(text: str) -> int:
    # rough estimate: 1 token ~ 4 chars for English technical prose
    return len(text) // 4

HEADING_RE = re.compile(r'^(#{1,3})\s+(.+)$', re.MULTILINE)

def extract_section_chunks(
    doc_id: str,
    text: str,
    max_tokens: int = 1500,
    fallback_overlap: int = 64,
) -> list[Chunk]:
    """
    Split text at H1/H2/H3 boundaries.
    Sections exceeding max_tokens are sub-split by paragraph with overlap.
    Returns flat list of Chunk objects with parent linkage preserved.
    """
    chunks: list[Chunk] = []
    matches = list(HEADING_RE.finditer(text))

    if not matches:
        # No headings -- treat whole doc as one chunk or paragraph-split it
        return _paragraph_fallback(doc_id, "root", text, max_tokens, fallback_overlap)

    parent_by_level: dict[int, str] = {}
    seq = 0

    for i, match in enumerate(matches):
        level = len(match.group(1))
        heading = match.group(2).strip()
        body_start = match.end()
        body_end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        content = text[body_start:body_end].strip()

        chunk_id = f"{doc_id}::sec::{seq}"
        parent_id = parent_by_level.get(level - 1)
        parent_by_level[level] = chunk_id
        # clear children of this level and deeper from parent map
        for l in list(parent_by_level.keys()):
            if l > level:
                del parent_by_level[l]

        token_est = estimate_tokens(content)
        if token_est > max_tokens:
            sub_chunks = _paragraph_fallback(
                doc_id, chunk_id, content, max_tokens, fallback_overlap,
                base_seq=seq, parent_id=parent_id, level=level,
            )
            chunks.extend(sub_chunks)
            seq += len(sub_chunks)
        else:
            chunks.append(Chunk(
                doc_id=doc_id,
                chunk_id=chunk_id,
                heading=heading,
                content=content,
                level=level,
                char_start=body_start,
                char_end=body_end,
                parent_chunk_id=parent_id,
                token_estimate=token_est,
            ))
            seq += 1

    return chunks


def _paragraph_fallback(
    doc_id: str,
    section_id: str,
    text: str,
    max_tokens: int,
    overlap_tokens: int,
    base_seq: int = 0,
    parent_id: Optional[str] = None,
    level: int = 0,
) -> list[Chunk]:
    """Fixed-token fallback for sections without sub-headings or that are too long."""
    paragraphs = [p.strip() for p in re.split(r'\n{2,}', text) if p.strip()]
    chunks: list[Chunk] = []
    current_tokens = 0
    current_parts: list[str] = []
    seq = base_seq

    for para in paragraphs:
        para_tokens = estimate_tokens(para)
        if current_tokens + para_tokens > max_tokens and current_parts:
            content = '\n\n'.join(current_parts)
            chunks.append(Chunk(
                doc_id=doc_id,
                chunk_id=f"{section_id}::para::{seq}",
                heading=section_id,
                content=content,
                level=level,
                char_start=0,
                char_end=len(content),
                parent_chunk_id=parent_id,
                token_estimate=current_tokens,
            ))
            seq += 1
            # overlap: keep last paragraph as start of next chunk
            current_parts = current_parts[-1:] if overlap_tokens > 0 else []
            current_tokens = estimate_tokens(current_parts[0]) if current_parts else 0

        current_parts.append(para)
        current_tokens += para_tokens

    if current_parts:
        content = '\n\n'.join(current_parts)
        chunks.append(Chunk(
            doc_id=doc_id,
            chunk_id=f"{section_id}::para::{seq}",
            heading=section_id,
            content=content,
            level=level,
            char_start=0,
            char_end=len(content),
            parent_chunk_id=parent_id,
            token_estimate=current_tokens,
        ))

    return chunks

A few notes on the implementation choices. The max_tokens=1500 limit is not a universal recommendation -- it is what worked for this corpus where sections are typically 300--800 tokens and the outliers cluster around documentation overview sections that run 2,000+ tokens with no sub-headings. The paragraph fallback is intentional, not a hack: for those flat-prose sections, there is no better structural boundary to split on.

The token estimator is deliberately crude (1 token ~= 4 characters). For production work I would use tiktoken with the actual encoding for the embedding model, but for the chunker itself the estimate is close enough that the error is within one or two paragraphs of the max boundary. It does not affect embedding quality -- the actual embeddings are produced by the text-embedding-3-small API call, which does its own tokenization.

Qdrant setup

I ran two collections in the same Qdrant instance -- one per chunking strategy -- to keep the comparison clean. Both collections used the same vector size (1536 dimensions for text-embedding-3-small) and the same HNSW config (m=16, ef_construct=100). The only difference was the source chunks.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff

client = QdrantClient(url="http://localhost:6333")

for collection_name in ["chunks_section", "chunks_paragraph"]:
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=1536,
            distance=Distance.COSINE,
        ),
        hnsw_config=HnswConfigDiff(
            m=16,
            ef_construct=100,
        ),
    )

Ingestion used batched upserts of 256 chunks per request. Total indexing time for the section-level collection (approximately 94,000 chunks across 15,000 documents) was 38 minutes on a 4 vCPU VM. The paragraph collection produced 213,000 chunks and took 71 minutes. That size difference has downstream implications for memory and query latency that I'll come back to.

The query latency at ef=50 (search parameter, not construction parameter):

Section-level collection: p50 = 4.2ms, p99 = 11.8ms
Paragraph collection: p50 = 6.9ms, p99 = 19.3ms

The paragraph collection is slower at query time because the index is larger and HNSW traversal covers more nodes. At the scale of this benchmark -- a single developer VM, not a production cluster -- the difference is not meaningful. At production query volumes with tight latency budgets, it becomes relevant.

The decision rule I now apply

I went into this benchmark expecting to confirm section-level chunking as the clear winner and come away with a unified recommendation. That is not what the data supports.

The decision rule I now use:

Use section-level chunking when: the majority of your queries are conceptual, procedural, or require synthesizing information across a document section. Technical documentation, architecture guides, runbooks, onboarding wikis. Long-form content where the answers live in paragraphs that build on each other.

Use paragraph chunking (or a hybrid) when: the majority of your queries are factual point-lookups -- API parameter values, configuration defaults, error codes, single-field lookups. Structured reference content where the answer is a sentence or a short list. In these cases, the embedding density penalty of large sections actively hurts you.

Use both with a router when: your query distribution is genuinely mixed. I have not deployed this in production yet, but the approach I am working toward for a client with a mixed documentation corpus is a lightweight query classifier that predicts query type and routes to the appropriate collection. The classifier does not need to be a large model -- a fine-tuned small classifier or even a regex + embedding distance heuristic against a labeled query set would likely cover 80% of cases.

The operational overhead of two collections is not negligible. Two indexing pipelines, two query paths, double the storage. Whether that is worth it depends on your query distribution. If 90% of your queries are conceptual, section-level alone is fine and the overhead of the hybrid approach is not justified. If you have a genuine 50/50 split between factual lookups and conceptual questions, the routing approach is probably worth it.

I might be wrong about the router approach -- I have not run it in production and the evaluation set it would be trained on is the same 320-query set I used here, which is not large enough to be confident about generalization. I will update this post when I have production data.

What this corpus is not

15,000 documents is not a universal sample. This is technical documentation in English, mostly Markdown, mostly well-structured. The results would look different for:

Legal or compliance documents: dense prose, minimal headings, very long sections. Section-level chunking would produce fewer, larger chunks and the penalty on factual retrieval would be even higher.
Conversational or support content: Q&A format, short exchanges, no heading structure. Section-level chunking degrades to full-document chunking and provides no benefit. Paragraph chunking is the right default here.
Code-heavy documentation: function references, SDK docs, API specs. Chunking at code block + surrounding prose boundaries is a third strategy I did not evaluate here. The results from section or paragraph strategies alone on code-heavy content are worse than either does on prose-heavy content.

Hamel Husain and the LlamaIndex team have published evaluation frameworks that cover some of these corpus types more systematically -- particularly around code documentation retrieval. If your corpus is primarily code-adjacent, their work is worth reading before arriving at a chunking decision.

The finding I did not expect to care about

The indexing size difference -- 94k section chunks vs 213k paragraph chunks -- ends up mattering more than I thought it would, not for query latency (manageable) but for embedding cost.

At text-embedding-3-small pricing of $0.02 per 1M tokens: the section collection cost approximately $1.40 to embed. The paragraph collection cost $3.10. That is not a significant cost at one-time indexing. It becomes relevant if you are re-embedding frequently -- for a corpus that updates daily or if you are iterating on chunking strategies during development. I went through about six indexing iterations during this benchmark. The paragraph approach accumulated roughly $19 in embedding costs across those iterations vs $8 for section-level. Neither is expensive. The ratio is the point: paragraph chunking produces 2.3x the chunks, which means 2.3x the embedding cost per full re-index.

For production systems where the corpus changes continuously and re-indexing happens on a schedule, this cost multiplier is worth factoring in alongside the retrieval quality difference.

The retrieval evaluation patterns here informed the pipeline evaluation layer I'm building into Pipeshift -- I'm the founder, so take that mention with appropriate skepticism. The 320-query evaluation set and scoring scripts are something I intend to publish; I'll link from here when that happens.