Skip to content
Back to Blog
ml-pipeline-dag-checkpointing

Building the CI/CD Engine Inside Pipeshift: What the Architecture Looks Like and What I Got Wrong

I am the founder of Pipeshift, so everything in this post is written with that bias on the table. This is not an objective analysis of ML CI/CD tooling in general.

I am the founder of Pipeshift, so everything in this post is written with that bias on the table. This is not an objective analysis of ML CI/CD tooling in general. It is the specific decisions I made, the constraints that shaped them, and the places where the design is still wrong.

The short version of what Pipeshift does: it gives ML teams a CI/CD pipeline for model delivery -- the same discipline that software teams have had for 15 years, applied to the model-training-to-production lifecycle. A model change triggers a pipeline. The pipeline runs eval gates. If the gates pass, the model is promoted to staging, then production. If they fail, nothing moves. If something breaks post-deployment, rollback is one command and it goes back to the last known-good artifact.

The hard part of building that is not the pipeline execution itself. The hard part is the atomicity problem: model artifact, eval results, and deployment state are three separate systems, and making a deployment operation that touches all three feel atomic -- while making it actually recoverable when it is not -- took longer than anything else in the codebase.

The execution model: DAG with per-node checkpoints

Pipeshift pipelines are directed acyclic graphs. Each node is a step: data validation, training run, evaluation, packaging, deployment. Edges are explicit dependencies. The graph is defined in a YAML configuration that the pipeline engine ingests and validates at submission time.

I looked at using an existing workflow engine -- Prefect, Temporal, Apache Airflow -- and decided against all of them. The reasons were specific to the ML case.

Airflow's scheduler model was built around scheduled batch jobs. Running it for event-triggered model pipelines that can last anywhere from 10 minutes to 6 hours means carrying a lot of infrastructure that works against the use case. The DAG authoring model is also Python-native, which means the pipeline definition and the execution environment are coupled; that is a problem when the training step runs in a GPU container that is completely different from the environment where the scheduler runs.

Prefect and Temporal are both closer to the right shape. My objection to both was the same: they are general-purpose workflow orchestrators, and the ML pipeline case has specific concerns -- model artifact lineage, eval result storage, promotion gating -- that sit outside the workflow layer and require custom integration regardless. If I am building custom integration anyway, building the execution engine to fit exactly the model I need is justifiable.

The execution model I built is straightforward. Each pipeline run gets a unique run ID. The DAG is serialised and stored against that run ID at submission time. Each node transition is checkpointed: start timestamp, input artifact references, output artifact references, exit status. The checkpoint is written to persistent storage before the next node begins.

The checkpointing serves two purposes. First, reruns after partial failure resume from the last successful node rather than from the beginning. A training run that takes 4 hours should not have to repeat if a downstream packaging step fails. Second, the checkpoint record is the audit trail. If an eval gate fails two weeks after a deployment and someone asks what was in the model that passed -- what data version, what hyperparameters, what eval result -- the checkpoint record has the complete lineage.

The current implementation stores checkpoints in PostgreSQL. Each checkpoint row has: run ID, node ID, status (pending / running / succeeded / failed / skipped), started_at, finished_at, input_refs (JSONB array of artifact identifiers), output_refs (JSONB array), and a metadata JSONB column for node-specific structured output.

I might be wrong about PostgreSQL being the right store long-term. At higher pipeline volumes -- more teams, more concurrent runs -- the write pattern on checkpoint rows could become a bottleneck. The obvious candidates for replacement are a dedicated event store or a time-series database. I have not hit that wall yet. The current query patterns are reads by run ID and writes per node transition, which PostgreSQL handles without issue at current load.

Artifact versioning

Every object that crosses a node boundary in Pipeshift is a versioned artifact. A training dataset is an artifact. A trained model weight file is an artifact. A packaged model container image is an artifact. An eval result bundle is an artifact.

The artifact versioning scheme is content-addressed. Each artifact has an identifier that is a hash of its content plus its declared type. The same training dataset ingested twice produces the same artifact ID. This matters for pipeline reruns: if you resubmit a pipeline and the input artifact has not changed, the execution engine can skip producing that artifact again and use the cached version.

Artifact metadata is stored separately from artifact content. Content goes to object storage (currently OCI Object Storage; there is an S3 adapter for teams on AWS). Metadata -- type, size, content hash, provenance (which pipeline run produced it, from which input artifacts), created_at -- is stored in PostgreSQL alongside the checkpoint records.

The lineage graph you can reconstruct from this: given any artifact, you can walk backwards through its provenance chain to find every upstream artifact and every pipeline run that produced it. Given a model in production, you can answer: what dataset version trained it, what code revision built it, what eval results cleared it for promotion. That reconstruction is a query, not a manual investigation.

The thing I got wrong in the initial design: artifact type was a free-text string. I had types like model/pytorch, dataset/parquet, eval/json, but they were not enforced by a schema. The first time a team ingested a dataset artifact with the wrong type declaration, the downstream eval node received an artifact it could not interpret, and the error surface was confusing -- the node failed with a deserialization error rather than a type mismatch. I added a type registry with explicit schemas per artifact type. Artifacts that do not match their declared type schema are rejected at ingestion time rather than at consumption time. This sounds obvious in retrospect. I should have done it first.

Eval gates: trigger, execution, output

An eval gate is a node type with specific semantics. It receives a model artifact, runs a defined evaluation suite against it, and produces a structured evaluation result artifact. The gate has a pass/fail verdict configured by threshold rules. If the verdict is fail, the pipeline stops at that node. No downstream nodes run. The model does not move.

The trigger for an eval gate is automatic: any pipeline node of type eval runs when its upstream node completes successfully. There is no separate trigger mechanism. This is a deliberate design constraint -- eval gates that can be bypassed in practice are not eval gates.

What an eval suite looks like in Pipeshift: a YAML block on the eval node defines the evaluation to run. It specifies the evaluation framework (currently supports running a Python evaluation script in a container, with Eleuther LM Evaluation Harness integration in progress), the dataset to evaluate against (artifact reference), and the metrics to collect. The threshold rules map metric names to pass conditions.

nodes:
  - id: eval-accuracy
    type: eval
    depends_on: [package-model]
    eval:
      runner: python
      image: pipeshift/eval-runner:0.4.1
      script: evals/accuracy_suite.py
      dataset: ${artifact:validation-set:latest}
      metrics:
        - accuracy
        - f1_macro
        - latency_p99_ms
      thresholds:
        accuracy: ">= 0.87"
        f1_macro: ">= 0.83"
        latency_p99_ms: "<= 240"

The eval runner container receives the model artifact reference and the dataset artifact reference via environment variables, runs the evaluation script, and writes a structured JSON result to a known path. The pipeline engine picks up that JSON, validates it against the metrics schema, evaluates the threshold rules, and writes the eval result as an artifact -- type eval/result -- to object storage and the metadata database.

The structured result is what makes the gate useful beyond a binary pass/fail. If accuracy passes but latency_p99_ms fails, the verdict is fail, but the result artifact carries the full metric set. The reviewer sees exactly which threshold caused the failure and by how much. That specificity matters for the loop where teams iterate on model changes to make a failing gate pass.

I am not satisfied with the eval runner execution model. Currently, the eval runner is a container that the pipeline engine launches synchronously -- the pipeline node waits for the container to exit. For evaluation jobs that take 30-90 minutes on large validation sets, this means the pipeline node is held open for the duration. This works but it is architecturally unclean. The right model is async: the eval node submits the job, records the job ID, and polls or waits for a callback. I have not built this yet because it requires a job management layer the current architecture does not have.

The rollback mechanism

Pipeshift's rollback is a first-class operation with its own API endpoint and CLI command. The semantics: given a deployment target (a named environment -- staging, production), roll back to the previous successfully-deployed model version.

What rollback actually does under the hood is a deployment of the previous artifact set, not an undo of the current deployment. The distinction matters. An undo would attempt to reverse whatever operations the current deployment performed. A deployment of a previous artifact set uses the same deployment path as a forward deployment but with a pinned artifact reference pointing at an older version. This is safer because the deployment code path has been tested; the undo code path is a different execution that may have untested edge cases.

The rollback target -- the "previous successfully-deployed version" -- is determined from the deployment history record. Each deployment writes a record: environment, model artifact ID, eval result artifact ID, deployed_at, deployed_by, status. The rollback query finds the most recent deployment to the target environment with status succeeded that is not the current deployment, and promotes that artifact set.

The atomicity problem I mentioned in the opener shows up most clearly in rollback. A deployment touches three things: it updates the artifact serving reference (the pointer that says "this is the model serving traffic in environment X"), it records the deployment event, and it updates the eval gate status for the target environment (production is now running a model with these eval results). If any of those three writes fails mid-way, the state is inconsistent: the serving reference says one thing, the deployment record says another, the eval status reflects a model that is or is not actually running.

My current approach to this is a distributed saga pattern. Each of the three writes is its own compensating transaction pair: a forward operation and a compensating rollback operation. The saga orchestrator runs the forward operations in sequence and, on any failure, runs the compensating operations for everything that has already succeeded. The compensating operations themselves can fail, which is the part that still gives me trouble.

In practice: compensating operations for the serving reference update can fail if the serving infrastructure is in a bad state. The deployment record write to PostgreSQL is unlikely to fail independently of a broader database issue. The eval status update is a metadata write with no external dependencies. The highest-risk compensating operation is the serving reference revert. I handle this with a pre-write snapshot: before the forward serving reference update, the current state is captured. The compensating operation writes that snapshot back. If the compensating operation also fails, the saga records an inconsistent state rather than silently returning success, and an alert fires.

I have had three deployments land in inconsistent state in the history of the product. Each required manual resolution. I am not happy with three, but I am also not going to pretend the saga compensating-operation problem is solved, because it is not. The theoretical solution is to run everything through a two-phase commit protocol. The practical problem with that is the serving infrastructure -- whatever is behind the artifact serving reference, whether it is a Kubernetes deployment or an OCI Functions revision -- is not a database with 2PC support. You are doing distributed coordination across systems with different transactional semantics, and there is no clean answer.

What is still wrong

The first thing that is wrong is the eval result artifact not being a first-class citizen of the promotion decision. Currently, a human reviews the eval result and manually approves or denies the promotion from staging to production. The eval thresholds gate the pipeline automatically, but the staging-to-production promotion is still a manual click. Automating that promotion based on eval result thresholds is the most frequently-requested feature and the one I am most careful about getting right, because adding automatic promotion to production raises the consequence of a misconfigured threshold from "a pipeline passed that shouldn't have" to "a bad model went to production automatically."

The second thing that is wrong is artifact storage costs at scale. Content-addressed storage means you never store the same content twice, which is good for storage efficiency on small artifact sets. For large model weight files that differ by a single finetuning run -- say, a 7B parameter model at 14 GB, retrained weekly -- the content-addressed approach stores every version in full because even a tiny weight change produces a different hash. Delta storage for model artifacts is the right answer here; the implementation is not trivial for binary blobs, and I have not done it.

The third is the eval runner synchronous execution model I described above. At a pipeline volume of a few dozen runs per day, the synchronous model is fine. At higher volumes with long-running evaluations, it will become a constraint.

I am building Pipeshift in public -- this post is part of that. The architecture will look different a year from now, and some of what I have described here will have been replaced. If you are building something adjacent or have hit the same atomicity problems in a different context, I am interested in the conversation.

Pipeshift is in early access. If the pipeline execution and eval gate problems here are problems you recognize from your own ML infrastructure, the Pipeshift waitlist is open. The code architecture referenced in this post is proprietary; the patterns are not.