What the First 90 Days of Building Pipeshift Actually Looked Like

Full disclosure: I am the founder of Pipeshift. Everything I write about Pipeshift is written from that position. I will try to be honest about what it is, what it is not, and what I got wrong -- but you should weight my perspective accordingly.

The problem I wanted to solve was not subtle. I kept watching teams at Optivulnix clients -- I co-founded Optivulnix, an AI and DevOps consulting firm -- try to deploy fine-tuned or updated models using the same CI/CD tooling they used to deploy web services. GitHub Actions workflows that treated a PyTorch model artifact the same way they treated a Node.js build. Jenkins pipelines with a deployment stage that uploaded weights to S3 and called it done. It mostly worked, in the same way that duct tape mostly works.

The specific failure mode I saw most often: a model update ships to production, something regresses, and there is no structured rollback path because the pipeline was never designed to understand what "rollback" means for a model. With software, rollback means reverting to a previous build. With a model, rollback is more complicated -- the model artifact, the serving configuration, the preprocessing pipeline, and sometimes the feature extraction code are all separately versioned and need to be coordinated. Existing CI/CD tools do not model that relationship. They treat the model artifact as a file and ship the file.

That is the problem I set out to build against. What I built for the first two months, and what actually happened, are instructively different things.

The original prototype (March 2026)

I had a rough prototype running after about three weeks. It was a Python service -- FastAPI backend, a simple React frontend I cobbled together from a template -- that took a GitHub repository URL, inspected the CI/CD configuration, identified model deployment steps, and generated a wrapper pipeline that added rollback checkpoints and a basic evaluation gate.

The evaluation gate was the core idea: before a model update promoted to production, the pipeline would run a suite of held-out evaluation prompts against both the new model and the current production model, compare outputs on a set of defined metrics, and block promotion if the new model regressed past a configurable threshold. The rollback checkpoint stored a reference to the current production artifact and serving config so that a failed deploy could revert deterministically.

It worked. It also required several hours of manual configuration to set up for any given model deployment, produced YAML output that was often incompatible with the target CI/CD system without hand-editing, and the evaluation step was so slow on anything larger than a 7B-parameter model that engineers were disabling it immediately.

That last part is important. An evaluation gate that gets disabled is not a safety mechanism. It is a usability failure.

What beta users from Optivulnix clients actually needed

I brought the prototype to five teams that Optivulnix had worked with or was actively working with. These were not cold outreach -- I had pre-existing relationships, and I disclosed upfront that I was asking them to spend time on something I was building. Three teams agreed to look at it.

The feedback was not what I expected.

Two of the three teams said their biggest problem was not rollback. It was knowing whether a candidate model was worth promoting at all -- not a binary pass/fail on held-out metrics, but visibility into how the model was behaving differently from the current production model across the specific query distributions that mattered for their use case. They had evaluation sets, but the evaluation sets had been assembled quickly and no one trusted them to capture the real distribution. They wanted something that could show them "here are the classes of queries where this model behaves differently, and here is whether that difference looks like regression or improvement."

The third team had a different problem entirely: their model deployment was blocked by an internal compliance requirement that any model serving PII needed to pass a data-flow audit before promotion. The bottleneck was not technical -- it was getting the audit scheduled and completed on a timeline that did not delay every model update by three weeks. They wanted automation for the audit artifact generation, not a better rollback mechanism.

Neither of these was the problem I had built against.

Pivot one: from rollback tooling to pre-promotion analysis (weeks 4-6)

I spent two weeks rebuilding the core of the prototype around behavioral diff analysis rather than rollback checkpoints. The idea: run both the candidate model and the current production model against the same query set, cluster the queries where outputs diverged significantly, and surface those clusters to the engineer with enough context to make a promotion decision.

The clustering step was the technically interesting part. I used embedding similarity on the diverged outputs, not on the queries, to find cases where the models were systematically disagreeing about a class of inputs rather than randomly varying. A model that handles factual questions about product features differently from the production model shows up as a cluster. A model that produces longer outputs on ambiguous queries shows up as a cluster. Random variation distributes evenly across clusters and does not concentrate.

This was a better version of the tool. It was also significantly more complex to explain than "add an evaluation gate to your deployment pipeline." The five-minute demo that had been easy to give for the prototype now required fifteen minutes to get to the point where an engineer understood why the clustering mattered.

I shipped it to the two teams who had asked for behavioral visibility. Both liked it. One used it. The other found it too expensive to run on their query volume -- they were processing north of 50k queries per day and the cost of running two model inference passes on a representative sample was not trivial, even with batching.

Pivot two: separating the analysis from the gate (weeks 6-9)

The problem I had created was a tool that tried to do analysis and enforcement simultaneously. The enforcement (blocking promotion) required running analysis on every deploy, which made the cost unavoidable. But for teams with established evaluation confidence, the analysis was overhead they did not need every time.

I separated the two. Analysis became an on-demand operation -- run it when you want visibility into a candidate model, not as a mandatory gate on every deploy. The deployment pipeline integration became lighter: a CLI that takes a candidate model identifier and a production model identifier, produces an analysis artifact, and returns an exit code based on whether the analysis met a configurable threshold. Teams that trusted their evaluation sets could automate the threshold check. Teams that did not trust their evaluation sets could use the analysis artifact manually and decide.

This felt like a step backward from the original vision of a fully automated evaluation gate. I think it was actually right. The original vision assumed that teams had trustworthy evaluation sets and just needed automation on top. Most teams do not have trustworthy evaluation sets. Forcing an automated gate on top of a weak evaluation set does not make deployments safer -- it gives you a false sense of safety and eventually trains engineers to distrust or bypass the gate. The on-demand analysis at least gives you something to look at.

Pivot three: dropping the CI/CD wrapper (weeks 9-12)

The original prototype had generated CI/CD pipeline YAML. This was the feature I was most attached to, because it was the feature that would have made the tool feel complete -- give Pipeshift your GitHub Actions workflow or Jenkinsfile, and it would modify it to include the evaluation and rollback steps.

I dropped it around week nine.

The reason: the diversity of CI/CD configurations in the real world is enormous, and every generated YAML file needed hand-editing for the specific team's pipeline. The generated output was useful as a reference, but it was not something anyone trusted to run unchanged. Maintaining the generators for GitHub Actions, Jenkins, GitLab CI, and CircleCI was significant surface area, and I was one person building all of it.

More importantly, the teams I had talked to did not primarily want better CI/CD YAML. They wanted the analysis. The CI/CD integration was a way to trigger the analysis automatically; it was not itself the valuable thing. Stripping the generator and providing a documented integration pattern for each major CI/CD system -- "here is how to call the Pipeshift CLI from GitHub Actions" -- cost nothing in terms of actual capability and saved weeks of maintenance work.

The moment something clicked (week 11)

One of the original beta users came back in week 11 with a specific ask. They had promoted a model update three weeks earlier, and two weeks after the promotion they had noticed a pattern: user satisfaction scores on a specific use case had dropped about 8% since the update. They wanted to understand whether the model was causing it or whether something else had changed. They had not run the behavioral diff analysis at promotion time because they had not trusted the evaluation set enough to gate on it.

They asked whether I could run the analysis retrospectively -- compare the pre-update model against the post-update model using queries from the two-week post-promotion window. I had not built that path. The tool was designed around candidate models before promotion, not retrospective analysis after.

I built the retrospective mode in about four days. The engineer ran it, found a cluster of queries in the specific use case category where the new model was consistently producing shorter, less detailed responses than the previous model -- not wrong responses, just less comprehensive ones -- and used that to make a decision about whether to roll back or patch forward with additional fine-tuning.

That use case was clearer to me than anything the first ten weeks had produced. A team has a production problem. They suspect a model change. They need to understand what changed and whether it explains the symptom. The retrospective analysis mode is a debugging tool for production model behavior, not just a pre-promotion gate. That framing -- Pipeshift as a model behavior debugger rather than a deployment safety gate -- is closer to what the tool is actually for.

What Pipeshift still cannot do

I am not going to end this with a roadmap. But I will be honest about the current gaps because they affect whether the tool is useful for a given team.

The behavioral diff analysis requires that you have access to both models simultaneously for inference. If your previous production model has been deleted, archived, or is no longer serving, you cannot run the retrospective mode against it. Several teams version their model artifacts but do not keep them accessible for inference. This is a workflow problem I have not solved.

The evaluation metric set is currently limited to semantic similarity, output length distribution, and a configurable set of task-specific classifiers for common output types (JSON validity, citation format adherence, structured output conformance). It does not do safety or harm evaluation. For teams whose primary concern is harm regression in a model update, Pipeshift is not the right tool. That capability requires infrastructure for human-in-the-loop review that I have not built.

The cost analysis for running dual-inference on large models is not great. I have batching and quantization options that reduce the cost, but for 70B+ parameter models on a meaningful query sample the per-analysis cost is high enough that some teams will only run it selectively. I know this is a real limitation. I do not have a cheap solution for it yet.

What I would do differently

Starting with a narrower problem. The original prototype tried to add both evaluation and rollback to existing CI/CD pipelines. Both are real needs. Trying to do both in the first version split the product surface area and produced something that was mediocre at both rather than good at either.

The retrospective analysis use case -- model behavior debugging after a production event -- turned out to be clearer and more immediately useful than the pre-promotion gate. I arrived at it by accident in week 11. I should have found it in week two if I had been asking better questions in early conversations: "what do you do when you think a model change caused a production problem?" rather than "how do you manage model deployments?"

The CI/CD YAML generator was the piece I held onto longest because I had built it and was attached to it. Dropping it earlier would have freed up several weeks. The lesson is the standard one but I will say it anyway: the things you have already built are not automatically worth keeping.

Pipeshift is in closed beta. If you are dealing with the production model debugging problem or the pre-promotion evaluation problem and want to run it in your environment, reach out through mohakdeepsingh.dev/contact.