How I Actually Version Prompts in Production (Not the Ideal System)

There is a version of this post where I describe a clean prompt management architecture: a dedicated prompt store, semantic versioning, A/B deployment, automatic rollback on degradation, beautiful dashboards. That system exists. It lives inside Pipeshift -- I am the founder, so I should be clear about that -- and it is genuinely useful.

This post is about what I use outside Pipeshift, when I am building a client's LLM pipeline or a standalone agentic system and I need prompt versioning that works in the time I have. The honest version, with the things I tried that didn't hold up.

The failure that made me stop improvising

Six months ago I was on a consulting engagement building a document processing pipeline. The core task was extracting structured fields from unstructured legal contracts -- contract parties, effective dates, termination clauses, liability caps. The prompts went through maybe fifteen iterations over three weeks of tuning.

At some point in week four, output quality on the client's reviewer sample dropped noticeably. Classification accuracy on termination clause extraction fell from around 88% to 71% on the same document set. Nobody had touched the application code. What had changed was the prompt.

The problem: I could not reconstruct exactly which version of the prompt was running when quality was good. The prompt text lived in a Python constant inside the module. The git history had commits like "refine extraction prompt" and "tweak prompt for edge cases" -- useful when I wrote them, useless for diffing actual quality impact. I had no mapping from prompt version to eval score. I was staring at a regression with no clean way to bisect it.

I rolled back to the previous commit, ran the evaluation set, confirmed the score recovered, and spent the next two days building the thing I should have built in week one.

What I actually use: git plus a thin registry

The architecture is not clever. I want to be clear about that before I describe it.

Prompt templates live in git, in their own directory. Not embedded in application code. Not in a database. In a /prompts directory in the repo, versioned exactly the same way as everything else.

prompts/
  contract_extraction/
    extract_parties_v1.txt
    extract_parties_v2.txt
    extract_termination_v1.txt
    extract_termination_v2.txt
  summarization/
    executive_summary_v1.txt
    executive_summary_v2.txt

Each prompt file is pure text -- the template with {{variable}} slots. No logic, no conditionals, no Jinja templating. I tried more expressive templating early on and it created a second category of bugs where the prompt you intended to send was not the prompt that got rendered because a filter or inheritance rule was wrong. Flat text with simple string substitution is easier to read in a diff, easier to test, and has never surprised me.

A registry maps prompt_id + version to eval score. This is a SQLite database (or Postgres if the project already has one) with a schema that looks like this:

CREATE TABLE prompt_registry (
    prompt_id       TEXT NOT NULL,
    version         TEXT NOT NULL,
    git_sha         TEXT NOT NULL,
    file_path       TEXT NOT NULL,
    eval_score      REAL,
    eval_dataset    TEXT,
    eval_run_at     TIMESTAMP,
    deployed_at     TIMESTAMP,
    deprecated_at   TIMESTAMP,
    notes           TEXT,
    PRIMARY KEY (prompt_id, version)
);

The git_sha column is the piece that connects the registry to git history. Every time a prompt is evaluated, the registry record is updated with the eval score and the git SHA of the commit that contains that prompt file. If I need to find the exact text of extract_parties at its peak performance, I query the registry for the highest eval score, get the git SHA, and git show the file at that commit. The registry does not store prompt text -- git does.

The Python interface to the registry is about 80 lines. Loading a prompt for production use looks like this:

from prompt_registry import PromptRegistry

registry = PromptRegistry(db_path="prompts.db")

# loads the highest-scored evaluated version of this prompt
prompt_template = registry.load(
    prompt_id="contract_extraction.extract_parties",
    version="latest_evaluated",
)

rendered = prompt_template.render(
    contract_text=raw_text,
    jurisdiction="US",
)

latest_evaluated is a sentinel that resolves to the most recent version with a non-null eval_score. A version with no eval score cannot be loaded via this sentinel -- that prevents me from accidentally deploying a prompt I edited but never evaluated.

def load(self, prompt_id: str, version: str) -> PromptTemplate:
    if version == "latest_evaluated":
        row = self._db.execute(
            """
            SELECT file_path, version FROM prompt_registry
            WHERE prompt_id = ?
              AND eval_score IS NOT NULL
              AND deprecated_at IS NULL
            ORDER BY eval_score DESC, eval_run_at DESC
            LIMIT 1
            """,
            (prompt_id,),
        ).fetchone()
        if row is None:
            raise ValueError(
                f"No evaluated version found for prompt_id={prompt_id!r}. "
                "Run the eval suite before deploying."
            )
        file_path, version = row["file_path"], row["version"]
    else:
        row = self._db.execute(
            "SELECT file_path FROM prompt_registry WHERE prompt_id = ? AND version = ?",
            (prompt_id, version),
        ).fetchone()
        if row is None:
            raise ValueError(f"prompt_id={prompt_id!r} version={version!r} not found in registry")
        file_path = row["file_path"]

    full_path = Path(self._prompts_dir) / file_path
    template_text = full_path.read_text(encoding="utf-8")
    return PromptTemplate(prompt_id=prompt_id, version=version, text=template_text)

The CI gate

This is the part that makes the rest actually work in a team context or on a long engagement.

A CI step runs after any PR that touches the /prompts directory. The gate evaluates the changed prompt against a golden dataset and blocks merge if the score degrades by more than 3% relative to the current deployed version.

The golden dataset is the other piece of work that is easy to skip and consequential to skip. For the contract extraction system, it is 47 hand-labeled documents covering the distribution of contracts the client actually processes -- standard MSAs, SOWs, SaaS subscription agreements, a few non-standard agreements with atypical termination clauses that tripped up early prompt versions. The labels are ground truth for each extraction field. This set lives in the repo under /eval/golden/.

The CI script, simplified:

import sqlite3
import sys
from pathlib import Path
from eval_runner import run_eval  # project-specific eval logic

DEGRADATION_THRESHOLD = 0.03  # 3% relative

def get_deployed_score(prompt_id: str, db_path: str) -> float | None:
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row
    row = conn.execute(
        """
        SELECT eval_score FROM prompt_registry
        WHERE prompt_id = ?
          AND deployed_at IS NOT NULL
          AND deprecated_at IS NULL
        ORDER BY deployed_at DESC
        LIMIT 1
        """,
        (prompt_id,),
    ).fetchone()
    conn.close()
    return row["eval_score"] if row else None

def gate(prompt_id: str, changed_file: str, db_path: str, golden_dir: str) -> bool:
    new_score = run_eval(
        prompt_file=changed_file,
        golden_dir=golden_dir,
        prompt_id=prompt_id,
    )

    baseline = get_deployed_score(prompt_id, db_path)

    if baseline is None:
        print(f"No deployed baseline for {prompt_id} -- treating as new prompt, passing gate")
        return True

    threshold = baseline * (1 - DEGRADATION_THRESHOLD)
    passed = new_score >= threshold

    print(
        f"prompt_id={prompt_id} baseline={baseline:.4f} new={new_score:.4f} "
        f"threshold={threshold:.4f} passed={passed}"
    )
    return passed

if __name__ == "__main__":
    prompt_id = sys.argv[1]
    changed_file = sys.argv[2]
    db_path = sys.argv[3]
    golden_dir = sys.argv[4]

    if not gate(prompt_id, changed_file, db_path, golden_dir):
        sys.exit(1)

The run_eval function is project-specific. For extraction tasks it returns something like an F1 score against the labeled fields. For generation tasks where the output is not directly comparable to a label, I use an LLM-as-judge setup with a separate evaluation prompt -- which creates a somewhat circular dependency, but in practice the judge prompt is stable and changes much less often than the task prompts.

The 3% threshold is not principled. I picked it because 2% felt too sensitive for the noise in my eval datasets and 5% felt too permissive to catch real regressions. If your eval dataset is larger and lower-variance you can probably tighten it. If it is small and noisy, 3% may still fire on noise. I might be wrong about this being the right default -- treat it as a starting point.

What I tried that did not work

A separate prompt database, not git. The first iteration used a Postgres table to store prompt text directly, with a version integer and is_active flag. The idea was a proper admin interface to promote versions. The problem: prompts fell out of sync with the application code that called them. A code change that changed the variables a prompt expected -- renaming {{contract_text}} to {{document_text}} -- was invisible to the prompt database. The rendered call broke at runtime with a KeyError. With git, that kind of change shows up in the same diff as the code change. The coupling is a feature.

Overly clever templating with Jinja2. I spent about a week building a system where prompts could inherit from base templates, include conditional blocks, and reference shared instructions via {% include %}. The intent was to reduce duplication across prompt variants. The result was that reading any individual prompt required mentally rendering the inheritance chain to understand what would actually be sent to the model. Prompt text is not code. The duplication that felt wrong was actually signal -- prompts that look similar but have subtly different instructions for different tasks should be different files, not a shared template with conditional blocks that swaps the one different part.

Version numbers as dates. For about two months I used YYYYMMDD as the version string (e.g., extract_parties_20260415). This seemed sensible until I needed two iterations on the same day, which required 20260415b, then 20260415c. And comparing 20260415 to 20260501 tells you the date but not which one performed better. Now I use v1, v2, v3 -- boring, works, sortable, and the registry maps each to an eval score so I do not need the date baked into the name.

What this does not solve

Prompt versioning at this level handles the regression gate and the audit trail. It does not handle:

A/B deployment. Routing a fraction of production traffic to a new prompt version while the old one handles the rest. This requires infrastructure I do not build into every project. When a client needs it I build it, but it is not in the baseline system.

Model-specific versioning. If you change the underlying model (GPT-4o to GPT-4o mini, or Claude 3.5 Sonnet to Claude 3.7 Sonnet), a prompt version that scored well on the old model may score differently on the new one. The registry schema has room to add a model_id column, but I have not needed it on current projects. This is a gap worth being aware of.

Multi-turn prompts. System prompt, user turns, injected context. The flat-text-file-per-prompt model starts getting awkward when a "prompt" is really a conversation template with multiple roles. I handle those as structured YAML files with a turns array, loaded separately. The registry can track them -- they just need a different template renderer.

Pipeshift handles all three of these more completely -- again, I built it, so I'm biased -- but the above system is what I use when I need something running in a day on a standalone engagement. It has survived four production deployments without the regression I had in week four of the contracts project. The gate has fired three times, twice correctly blocking a change that would have degraded output quality, once on a false positive from a noisy eval dataset that I then fixed by adding more labeled examples.

The whole thing -- registry, CI gate, eval runner scaffold -- is around 350 lines of Python. That size is not accidental. A prompt versioning system that requires significant maintenance of its own tooling is not worth the complexity it introduces. The value is in the gate and the audit trail, and those do not require much code to implement.

The prompt registry pattern here is a stripped-down version of the eval gating Pipeshift runs on CI/CD pipeline templates -- different problem, same underlying principle. If you are building LLM pipelines and want to talk through the eval setup, my calendar is open.