The Four-Week Pattern I Use to Get Skeptical Engineers Actually Shipping AI Features

Most engineering teams that bring me in for AI consulting have the same latent problem: one or two people understand how LLMs actually work, the rest have absorbed enough LinkedIn content to be confused, and management has already promised a roadmap with "AI-powered" in the feature names.

The goal everyone gives me is "get the team up to speed on AI." What that usually means in practice is: get ten engineers from zero to shipping something real, in a context where half of them are skeptical this is worth their time and the other half think they're about to become prompt engineers.

Here is the pattern that works, and the things I stopped trying.

What does not work

I want to be direct about this because the alternatives are what most teams try first.

LLM theory sessions. Attention mechanisms, transformer architecture, token probabilities -- this content has a place, but it is not week one of an engineering onboarding. I have sat through onboardings where a team spent three hours understanding how next-token prediction works and then had no idea how to write an eval. The theory provides context for decisions you have already made. It does not help you make the first decision.

Vendor demos. Every major cloud provider will send someone to show your team how to build a RAG app in 20 minutes on their platform. The demo works flawlessly. Then your engineers try to adapt it to your actual data schema and actual latency requirements and it falls apart in ways the demo never showed. Vendor demos are optimized for impressiveness, not for transferable understanding.

"Let's add AI to this feature." The mandate that arrives from product or leadership with no specificity. I have seen engineers told to "add AI to the onboarding flow" with no further guidance. The result is always either analysis paralysis or a feature that bolts a completion onto something that did not need it. Without a concrete task that has a measurable success condition, the engineers are flying blind.

What all three of these have in common: they defer the moment of productive confusion. Every engineer learning to build with LLMs needs to encounter a specific failure -- a hallucination, a wrong retrieval result, a prompt that works on test cases and fails on production data -- and understand why it happened. The failure is the learning. Anything that delays it delays actual competence.

The four-week structure

This is not a training curriculum. It is a set of interventions across four weeks that move a team from skeptical-but-willing to shipping a production-grade AI feature. I have run versions of this across three engagements now. The specifics change; the structure holds.

Week 1: A real task they care about, and the eval harness before the model

The first thing I do is pick a task the team already cares about solving. Not a tutorial task, not a synthetic exercise -- something from their actual backlog. It needs to meet two criteria: it has inputs and outputs you can describe precisely, and there is already a human process producing the outputs so you have something to compare against.

The second thing I do is build the eval harness before touching a model.

This is the intervention that skeptical engineers respond to most. The framing is: "We are not going to trust a model to do anything useful until we can measure whether it is doing the right thing. So let's build the measurement first." That is an argument engineers find immediately credible because it maps to how they think about software in general. Write the test before the implementation.

In practice this means: define the output structure, write 20-30 labeled examples of input/expected output from the existing human process, and build a simple evaluation script that compares model output to those labels on whatever metrics matter -- exact match, BLEU if it is a text generation task, a rubric-based LLM-as-judge setup if the outputs are long-form.

Week 1 ends with a working eval harness and no model integration. That is intentional. The engineers have spent a week thinking carefully about what "correct" looks like. They are now ready to be wrong about it.

Week 2: Let them see a hallucination before they care about preventing it

In week 2, I connect the simplest possible model integration and run it against the eval harness. For most tasks this means a straightforward prompt + the best-available hosted model, with no retrieval, no chain-of-thought, no structure. Just raw output.

The eval scores are usually bad. Not catastrophically bad -- somewhere in the range of 40-60% on the initial rubric. But bad enough to be interesting.

Here is what happens in the room: the engineers look at the failures. They read specific cases where the model produced confident, syntactically correct, completely wrong output. The skeptics are vindicated -- this is not magic. The optimists are humbled. Both groups are now curious, which is the state you need to do real work.

I do not intervene at this point to explain why the hallucination happened. I let them read the failure cases and form hypotheses. "It seems to not know about X context." "It's ignoring the constraint in the prompt." "The output format is wrong here, which is causing the metric to score it zero even when it's close." These observations are usually right, and they come from the engineers themselves.

What this does is convert hallucination from an abstract risk into a concrete, measured phenomenon. It is no longer "LLMs hallucinate" as a fact they read somewhere. It is "this model hallucinated on this specific input type and we can see it in the eval results." That is a solvable engineering problem, not a philosophical concern.

Week 3: One intervention at a time

Week 3 is iteration. The engineers each pick one hypothesis from week 2 and implement it. One person tries adding context via retrieval. One tries a more constrained output schema with response_format. One tries few-shot examples in the system prompt. One tries breaking the task into two separate model calls.

The rule is: one change per iteration, run the eval before and after, and report the delta. This discipline matters. Without it, engineers layer multiple changes at once and cannot attribute score improvements to any specific decision. That is fine for production debugging but it does not build intuition.

By the end of week 3, most teams have moved from 40-60% to 65-80% on their eval rubric, with each improvement attached to a specific named intervention. More importantly, they have a mental model for what kind of problems each intervention addresses. Retrieval helps with factual grounding. Structured output helps with schema compliance. Few-shot examples help with tone and format. Chain-of-thought helps on multi-step reasoning tasks.

I should be honest: 65-80% is not a shipped product for most use cases. It is a team that knows how to iterate toward one.

Week 4: Production concerns -- latency, cost, failure modes

Week 4 is where I run the team through the things that do not show up in demo conditions. Latency. Token cost at the transaction level. What happens when the model API returns a rate limit error. What happens when the structured output schema fails to parse. What the blast radius is if the model produces confidently wrong output and downstream code acts on it.

This week is deliberately concrete. I pull up the actual pricing page and we calculate cost-per-1000-calls for the current prompt length at the current token counts. For most teams this produces one of two reactions: relief (it is cheaper than they expected) or a scramble to reduce prompt size (it is more expensive per unit than their product margin allows). Either reaction is useful.

For the failure mode discussion, I use actual incidents. Not hypotheticals -- I use things that happened in my work building Pipeshift or with clients, with details changed where appropriate. The conversation that has the most impact is always some version of: "Here is a case where the model returned a valid JSON structure that passed schema validation, but the values were inverted, and the downstream code made a decision based on inverted values for four hours before anyone noticed." That is the conversation that gets teams to build validation beyond schema compliance.

By the end of week 4, the team has a working feature in staging with a passing eval suite, explicit cost projections, and a documented set of failure modes with mitigations for the highest-severity ones. That is a shippable scope.

The thing I keep getting asked about

Every engagement, someone asks when they should fine-tune. Usually it is framed as "at what point do we need to train our own model."

My answer: almost never, for the use cases that come up in a typical product engineering team. Fine-tuning addresses a specific set of problems -- domain vocabulary that the base model does not know, consistent output format that prompting cannot enforce reliably, latency requirements that require a smaller model to match a larger one's quality. Those problems are real. They are also not the first problem most teams have.

The first problem is almost always eval infrastructure and prompt discipline. Teams that are measuring their outputs and iterating systematically on prompts and retrieval get to surprisingly high quality without touching fine-tuning. Teams that jump to fine-tuning because they hit a quality ceiling at week two are usually trying to paper over a measurement gap, not a training gap.

I might be wrong about this for teams in specialized domains -- medical, legal, very narrow technical fields -- where the base model's vocabulary coverage is genuinely insufficient. But for product feature development, I have not hit that ceiling in any engagement yet.

What carries over after I leave

The only thing that matters is whether the team keeps running evals after the engagement ends. Everything else decays. Prompts get edited without measuring the impact. New model versions get adopted because a blog post says they are better. Context gets added to the prompt opportunistically until the token cost doubles.

The teams that maintain quality are the ones where running the eval suite before and after any AI-related change is habitual, not optional. Getting that habit established in week 1, before there is a model to get excited about, is the reason I sequence the harness before the integration.

The eval infrastructure is boring. It is the right thing to start with.

Some of this work informs how I think about evaluation pipelines at Pipeshift, where eval runs are a first-class step in the CI/CD pipeline for AI features. The AI onboarding and feature development consulting is offered through Optivulnix -- I'm the founder of both, so take those mentions with appropriate skepticism.