The ML Pipeline CI/CD Setup I Actually Use -- and the Failure That Forced Me to Build It Right

Full disclosure: I'm building Pipeshift, a tool for managing ML pipeline deployments. The architecture I describe here is the direct predecessor to what Pipeshift automates. I have an obvious interest in this problem space.

There's a version of ML CI/CD that gets written about a lot: you train a model, you push to a registry, a pipeline picks it up, runs some evals, deploys to production. Clean, linear, testable. The posts describing this setup are usually written by people who haven't yet hit the failure mode that made me build the current setup from scratch.

The failure mode is this: a model passes offline eval with flying colors, goes through your gate, gets promoted to production, and then degrades on real traffic because the distribution of real requests doesn't match the distribution of your evaluation set. Your offline numbers were accurate. They were just measuring the wrong thing.

I hit this on a classification model in late 2024. It passed a 94% accuracy gate on a held-out eval set. In production, accuracy on the actual incoming request distribution was 71%. The eval set had been curated from historical data that didn't reflect how the request distribution had shifted over the preceding three months. No test caught this because no test was looking at it.

That failure is what shaped every gate in the pipeline I'm about to describe.

The stack

Jenkins on a self-hosted agent (Ubuntu 22.04, 16 vCPU, 64 GB RAM), S3-compatible artifact storage (Cloudflare R2 in the setup I run for my own work, AWS S3 for client engagements), and a staging environment that mirrors production infrastructure at reduced scale. The Jenkins agent has access to both the staging and production Kubernetes namespaces via a scoped kubeconfig.

I'm not using MLflow for the registry layer. I'm using DVC for artifact versioning and a lightweight metadata store I wrote as part of Pipeshift's early scaffolding -- it records eval scores, traffic distribution fingerprints, and promotion decisions alongside the artifact pointer. Whether you use MLflow or something custom, the important property is that model artifacts are immutable and versioned, and promotion decisions are logged with the eval scores that justified them.

The pipeline structure

The full pipeline runs on every merge to main in the model repository. A training job is not part of the CI pipeline -- that runs separately, on a schedule or triggered manually, and pushes a versioned artifact to S3. The CI pipeline picks up from the artifact.

// Jenkinsfile (abbreviated -- the real one is longer)
pipeline {
  agent { label 'ml-agent' }

  environment {
    MODEL_VERSION = "${env.GIT_COMMIT[0..7]}"
    ARTIFACT_BUCKET = 's3://ml-artifacts-prod'
    STAGING_NAMESPACE = 'ml-staging'
    PROD_NAMESPACE = 'ml-prod'
  }

  stages {
    stage('Pull Artifact') {
      steps {
        sh 'dvc pull models/${MODEL_VERSION}/model.tar.gz'
      }
    }

    stage('Schema Validation') {
      steps {
        sh 'python eval/validate_schema.py --artifact models/${MODEL_VERSION}/model.tar.gz'
      }
    }

    stage('Offline Eval') {
      steps {
        sh '''
          python eval/run_eval.py \
            --artifact models/${MODEL_VERSION}/model.tar.gz \
            --eval-set data/eval/current_distribution.jsonl \
            --output eval_results/${MODEL_VERSION}.json
        '''
      }
    }

    stage('Eval Gate') {
      steps {
        script {
          def results = readJSON file: "eval_results/${MODEL_VERSION}.json"
          if (results.accuracy < 0.88) {
            error("Eval gate failed: accuracy ${results.accuracy} < 0.88 threshold")
          }
          if (results.p95_latency_ms > 120) {
            error("Latency gate failed: p95 ${results.p95_latency_ms}ms > 120ms threshold")
          }
        }
      }
    }

    stage('Deploy Staging') {
      steps {
        sh '''
          kubectl set image deployment/model-server \
            model-server=registry.example.com/models:${MODEL_VERSION} \
            -n ${STAGING_NAMESPACE}
          kubectl rollout status deployment/model-server -n ${STAGING_NAMESPACE} --timeout=300s
        '''
      }
    }

    stage('Staging Traffic Gate') {
      steps {
        sh 'python eval/staging_smoke.py --namespace ${STAGING_NAMESPACE} --duration 300'
      }
    }

    stage('Promote to Production') {
      input {
        message 'Promote to production?'
        ok 'Promote'
      }
      steps {
        sh '''
          kubectl set image deployment/model-server \
            model-server=registry.example.com/models:${MODEL_VERSION} \
            -n ${PROD_NAMESPACE}
          kubectl rollout status deployment/model-server -n ${PROD_NAMESPACE} --timeout=300s
        '''
      }
    }
  }

  post {
    failure {
      sh 'python scripts/notify_slack.py --status failed --version ${MODEL_VERSION}'
    }
  }
}

This is illustrative -- the actual pipeline has more environment-specific branching and the staging smoke test is more involved than a single script. But the structure is accurate: pull artifact, validate schema, run offline eval, gate on eval scores, deploy to staging, gate on staging behavior, manual promotion approval.

What each gate actually checks

The schema validation stage is the cheapest gate and the most often skipped in pipelines I've inherited from clients. It runs before anything else. It checks that the model artifact exposes the expected input/output contract: field names, types, and the allowed range of output values. This catches the class of error where a model is retrained with a feature engineering change that silently changed the input schema, and the deployment would have worked but the feature engineering code in the serving layer was still expecting the old schema.

The eval gate checks three numbers:

Accuracy on the current distribution eval set. The key word is "current." The eval set is regenerated monthly from the last 30 days of production requests -- sampled, deduplicated, and labeled. This is the direct response to the 2024 failure. I don't care that a model achieves 94% on a held-out set from six months ago. I care that it achieves at least 88% on requests that look like what production is actually seeing right now.

P95 latency under load. The eval script runs the model against the eval set with 20 concurrent requests (matching staging's concurrency configuration) and records the p95. The threshold is 120ms for the classification workload I use this against most. For generation workloads, the threshold is different -- latency profiles are completely incomparable across task types, which is why hardcoded global latency thresholds in pipeline templates are almost always wrong.

Output schema compliance rate. Even if the model produces correct predictions, if it's returning malformed JSON or missing expected fields on 0.5% of requests, that compounds fast in downstream systems. The eval script checks every output against the expected schema and gates on 100% compliance. No exceptions, because "almost always valid" breaks exactly the request you can't afford to break.

The staging traffic gate is a 5-minute live traffic window using the staging_smoke.py script. It routes a sample of real production traffic to the staging instance (at 5% traffic split, using a weighted Kubernetes Service), collects 300 seconds of latency and error rate data, and compares against the production baseline. If staging p95 latency is more than 15% above production baseline, or error rate exceeds 0.1%, the gate fails and the pipeline halts before the manual promotion prompt appears.

The manual promotion decision

I kept the manual promotion step intentionally. There are arguments for fully automating this -- if all gates pass, promote automatically, reduce the deployment cycle from hours to minutes. I'm not convinced that's right for model deployments in the workloads I work with.

The reason: a model can pass every automated gate and still be wrong to promote on a given day. If there's a known data quality issue in the labeling pipeline from the past week, the eval set may be compromised even though the model scores look fine. If a major product change launched yesterday, the traffic distribution is in flux and staging metrics from a 5-minute window aren't representative. These are human judgment calls. The pipeline makes them cheap to act on by presenting the eval numbers clearly, but the decision to proceed remains mine (or the client team's).

This will probably change once I have a better distribution shift detector in place. The current setup monitors for distribution shift monthly when the eval set is regenerated -- it compares the embedding distribution of the new eval set against the previous month using a simple Maximum Mean Discrepancy test. If MMD exceeds a threshold, a Slack alert fires before the eval run so the team knows the distribution has shifted before looking at accuracy numbers. That's still not real-time, and it doesn't integrate with the promotion decision automatically. Getting that integration right is on the Pipeshift roadmap.

What's still manual and why I haven't fixed it

Eval set curation. The monthly eval set regeneration involves a labeling step that's still partly manual. I have a weak label generator that produces preliminary labels from production ground truth signals (downstream user actions, correction signals from the application layer), but a sample goes to human review every month. Automating this fully is possible -- Hamel Husain's writing on LLM-based eval generation is worth reading here -- but I haven't done it because the monthly cadence is acceptable for the workloads I run and because getting the label quality wrong silently is worse than running human review on a sample. I might be wrong about that tradeoff.

Rollback. The current rollback process is kubectl set image back to the previous tagged version, which is a manual CLI command someone has to run. It's fast (under 60 seconds for the rollouts I've dealt with) but it's not codified in the pipeline. A rollback stage that can be triggered from the Jenkins UI with a version selector is on the list.

Cross-model dependency tracking. Several production systems I work with use a pipeline of models -- an extraction model feeds a classification model feeds a ranker. The CI/CD pipeline covers each model independently. If model A is promoted and causes a shift in the distribution that model B receives, model B's eval set doesn't capture that until the next monthly refresh. This is the next failure mode I'm trying to get ahead of. I don't have a good answer for it yet.

How this maps to Pipeshift

The pipeline structure above is what Pipeshift is designed to generate and manage for teams that don't want to build it from scratch in Groovy. Pipeshift is not a substitute for understanding the gates -- the threshold values, the eval set maintenance cadence, the decision about when manual promotion is right -- those require domain knowledge about the specific model and workload. What Pipeshift handles is the scaffolding: the pipeline structure, the artifact versioning integration, the gate runner, and the staging/production promotion flow.

I'm disclosing the founder relationship because the line between "here's what I built" and "here's what Pipeshift does" is not perfectly clean. The architecture described here predates Pipeshift, but building Pipeshift is also how I stress-tested the architecture and found the gaps.

The actual lesson from the 2024 failure

The 94%-to-71% production accuracy drop I described at the top was not a model quality problem. The model was doing exactly what it was trained to do. It was an eval infrastructure problem: the evaluation set was measuring performance on a distribution the production system had evolved past, and nothing in the pipeline checked whether the eval set itself remained valid.

The fix was not a better model. It was monthly eval set refresh, a distribution shift detector on the eval set input, and a gate that explicitly compares the new eval set distribution against the previous month before running accuracy numbers. In that order.

If your ML pipeline CI/CD is sophisticated on model training and naive on eval infrastructure, this is the failure mode you're setting up for. The accuracy number means nothing without knowing what distribution it was measured on.

The eval gate design described here -- specifically the distribution-aware eval set refresh -- is something I help teams implement as part of ML infrastructure consulting engagements. If you're debugging a similar production accuracy gap or building this for the first time, get in touch.