I Fine-Tuned a Model on Synthetic Data. Here Is What the Numbers Actually Said.

The popular claim is that synthetic data closes the gap between a capable base model and a task-specific one -- no labeling budget, no data collection headaches, just generate-and-fine-tune. That claim is half right. I ran a real experiment over three weeks in April and the results split cleanly along a line that nobody in the synthetic data hype cycle talks about enough: the line between tasks with verifiable correctness and tasks without it.

Short version: synthetic data worked well for structured extraction. It failed for subjective quality. The contamination check I initially skipped almost invalidated the structured extraction result entirely. Here is the full methodology, the before/after numbers, and the generation pipeline I used.

The task setup

The use case was a document processing pipeline for contract review: a client needed a model that could extract structured fields from commercial contracts (parties, effective date, termination clauses, payment terms, governing law) and also flag "problematic clauses" for attorney review.

Two distinct subtasks lived inside that problem description. I did not realize until post-experiment how different they were:

Structured extraction -- given a contract section, pull the field values into a JSON schema. Correctness is binary: the date is right or wrong, the party name matches or it doesn't.
Clause flagging -- given a clause, classify it as problematic or not and provide a short explanation. Correctness is ambiguous: attorneys disagree on what "problematic" means across deal types and client risk tolerances.

I used claude-sonnet-4-5 (the model available to me at the time) to generate training data for both tasks. The fine-tuning target was a smaller open-weights model hosted on a client-owned GPU node -- I cannot name the specific model for confidentiality reasons, but it was a 7B-parameter instruction-tuned variant. Fine-tuning ran via Hugging Face trl 0.8.6 with QLoRA (4-bit quantization, r=16, lora_alpha=32, lora_dropout=0.05).

The generation pipeline

This is the part most posts skip. Synthetic data quality is almost entirely a function of generation pipeline design. Here is mine.

The pipeline had four stages: seed collection, generation, filtering, and deduplication.

Stage 1 -- Seed collection. I pulled 80 real contracts from a public contracts dataset (CUAD -- Contract Understanding Atticus Dataset, the 2021 version from Atticusproject.org). I did not use these for training. They were seeds: I used them to extract realistic clause patterns, language styles, and jurisdictional variants. The seed pool was stratified by contract type (NDA, SaaS, services, employment) to avoid the generated data over-indexing on one genre.

Stage 2 -- Generation. For each of the 80 seed contracts, I generated 15 synthetic contract sections using Sonnet with a structured prompt:

GENERATION_PROMPT = """You are generating synthetic training data for a contract extraction model.

Given the following real contract section as a style reference, generate a NEW synthetic contract section
that covers the same legal concept but uses different parties, dates, jurisdictions, and dollar amounts.
The synthetic section must NOT reproduce verbatim phrases longer than 8 words from the reference.

Reference section:
{seed_section}

Contract type: {contract_type}
Target jurisdiction: {jurisdiction}

Output a JSON object with:
- "section_text": the synthetic contract section (150-400 words)
- "ground_truth": the extracted fields as a JSON object matching this schema: {schema}
- "confidence": your confidence that ground_truth is accurate (0.0-1.0)

Only output the JSON object, no preamble."""

The confidence field was a filter handle, not a real signal -- more on that in a moment. Total generation: 80 seeds x 15 samples x 4 contract types = 4,800 synthetic examples before filtering.

Stage 3 -- Filtering. I dropped anything with confidence < 0.85. That removed 312 samples. I also ran a schema validation check on ground_truth -- samples where the extracted fields did not conform to the schema were dropped. Another 89 samples gone. Remaining: 4,399.

Stage 4 -- Deduplication. I used MinHash LSH (via the datasketch library, threshold 0.8) to detect near-duplicates in section_text. Removed 203 samples. Final training set: 4,196 examples, split 90/10 train/validation.

For clause flagging, I ran the same pipeline but added a second generation pass: after generating a section, I prompted Sonnet to generate an attorney annotation explaining why a clause was or was not problematic, along with a simulated "deal context" (buyer vs. seller perspective, industry vertical, risk tolerance).

The contamination check I almost missed

This is the part that nearly invalidated the whole experiment.

Before fine-tuning, a colleague asked whether I had checked the synthetic data for contamination against the evaluation set. I had not. The evaluation set was 200 real contracts I had set aside from the CUAD dataset -- the same dataset the seeds came from.

The contamination risk: if seed sections from the 80 CUAD contracts had structural or phrase-level overlap with the 200 held-out evaluation contracts, the model could have "seen" evaluation examples during training in a transformed but recognizable form. The benchmark numbers would look better than they were.

I ran a check using the same MinHash LSH approach, this time comparing training section_text against evaluation contract sections (raw text, not the synthetic transformation). At threshold 0.7, I found 47 training examples with high overlap to evaluation contracts. I removed them. At threshold 0.6, I found another 91.

I used 0.65 as my cutoff -- a judgment call. At 0.65, the matching pairs were genuinely similar in structure (same clause type, same jurisdictional pattern) but not identical. I pulled the 91 threshold-0.65 matches from training anyway, erring conservative. Final clean training set: 4,058 examples.

Had I not done this check, the structured extraction F1 on the evaluation set would have read 0.91 instead of the 0.87 I measured after cleaning. Not catastrophic, but enough to mislead a client decision about whether to deploy.

Before/after: structured extraction

Baseline was the same 7B model with no fine-tuning, prompted with a 3-shot example. I measured field-level F1 on the 200-contract evaluation set, where a field extraction is correct if the extracted value exactly matches (after normalization: lowercase, whitespace collapse) the CUAD-annotated ground truth.

| Field | Baseline F1 | Fine-tuned F1 | Delta | |---|---|---|---| | Party names | 0.71 | 0.88 | +0.17 | | Effective date | 0.83 | 0.94 | +0.11 | | Termination clause type | 0.62 | 0.84 | +0.22 | | Payment terms (amount) | 0.74 | 0.89 | +0.15 | | Governing law jurisdiction | 0.79 | 0.93 | +0.14 | | Macro average | 0.74 | 0.90 | +0.16 |

The fine-tuned model also improved on JSON conformance: the baseline produced malformed JSON on 8.3% of extractions (requiring post-processing fallbacks), the fine-tuned model on 1.1%. That matters for a downstream pipeline that expects structured output -- you are not just measuring accuracy on the happy path, you are measuring reliability on the failure path.

Latency at inference: fine-tuned 7B averaged 340ms per section on the client's A10 node. The Sonnet API baseline was 1.2s average. The fine-tuned model is roughly 3.5x faster in this deployment context, which at 50,000+ contract sections per month is the actual business case, not just the accuracy improvement.

Before/after: clause flagging

This is where the experiment fell apart.

The evaluation setup: two attorneys (both with commercial contract experience) independently annotated 150 clauses as problematic or not, with inter-annotator agreement tracked. They agreed on 118 of 150 (78.7% agreement). For the 32 disagreements, I took the majority label (which in a two-annotator setup means one of them won arbitrarily -- a limitation).

| Metric | Baseline | Fine-tuned | Delta | |---|---|---|---| | Accuracy vs. attorney majority label | 0.68 | 0.67 | -0.01 | | Precision (problematic class) | 0.61 | 0.59 | -0.02 | | Recall (problematic class) | 0.72 | 0.74 | +0.02 | | F1 (problematic class) | 0.66 | 0.66 | 0.00 |

The fine-tuned model was statistically indistinguishable from the baseline on attorney-agreement metrics. That flat result is not a fine-tuning failure in the implementation sense -- the training ran cleanly, loss curves looked normal. It is a data-quality failure.

The synthetic clause annotations I generated were Sonnet's interpretation of what attorneys think is problematic. They were plausible but wrong in a specific way: the synthetic annotations were systematically more conservative than the real attorneys. The synthetic data had an implicit prior that favored flagging (Sonnet errs toward "flag this" when uncertain), while the two real attorneys had a deal-oriented prior that favored not flagging clauses that were unusual but defensible.

The fine-tuned model learned the synthetic distribution. That distribution did not match production.

I ran an analysis after the fact to confirm this. The synthetic training data had a 61% flag rate (problematic vs. not). The real attorney evaluation set had a 38% flag rate. That 23-point gap in base rate is almost certainly why the fine-tuned model's precision dropped slightly -- it learned to flag more aggressively than real attorneys do.

What this tells me about synthetic data for fine-tuning

The structured extraction task has a hard correctness criterion: the party name in the contract is a fact. You can verify it without human judgment. Synthetic data generation with a capable model produces training examples that are correct in the same factual sense. The fine-tuned model learned a sharper extraction behavior and generalized it.

The clause flagging task does not have a hard correctness criterion. "Problematic" is a judgment call that varies by attorney, client, deal type, risk appetite, and the legal jurisdiction you're operating in. When I asked Sonnet to generate "correct" annotations for that task, I was not generating ground truth -- I was generating Sonnet's opinion at temperature 0.2. The fine-tuned model learned Sonnet's opinion. That is not the same as learning to agree with attorneys.

The practical heuristic I now use before reaching for synthetic fine-tuning data: can you write a deterministic or near-deterministic function that checks correctness? If yes, synthetic data is probably safe. If correctness requires a human judgment call -- a quality score, a preference between two responses, a subjective classification -- synthetic data will embed the generating model's biases into your training set, and you will not catch it until you measure against real human labels.

Hamel Husain has written about this distinction in the context of LLM evaluation (his "Your AI Product Needs Evals" post is worth reading here). The same logic applies to training data: evals with no human grounding are measuring model-vs-model agreement, not model-vs-reality performance.

What I would change

The structured extraction result held up. The clause flagging result was a waste of three weeks of fine-tuning compute and attorney evaluation time.

The change I should have made before starting the clause flagging experiment: annotate 500 real clauses before generating any synthetic data. Use those real annotations to measure the generating model's agreement with human raters. If Sonnet-vs-attorney agreement on clause flagging is 0.68 (which is roughly what I measured post-hoc), that is the ceiling for any fine-tuned model trained entirely on Sonnet's synthetic annotations. I would have known immediately that synthetic data alone was not going to work.

The fix that would have helped: use real annotations as calibration data, generate synthetic data that samples from the distribution implied by real annotations (not the unconstrained generating model prior), and validate distribution match before fine-tuning rather than after. That is more work than just running the generation pipeline, but it is less work than running a fine-tuning experiment that teaches you nothing.

For the structured extraction work, I'd use the same approach again. The generation pipeline is fast, the filtering and contamination checks add a day of work, and the improvement -- 0.74 to 0.90 macro F1 -- is real and reproducible.

The evaluation and pipeline infrastructure I built for this experiment shares architectural patterns with the ML observability layer I'm building for Pipeshift. I'm the founder there, so read that link with appropriate skepticism. The contract extraction work described here was for a client engagement -- details anonymized.