Skip to content
Back to Blog
mlops-platforms-worth-it

MLOps Is a Sales Category. Good ML DevOps Is Just Engineering Discipline.

The term "MLOps" started as a reasonable shorthand for "operational practices applied to machine learning systems." Somewhere between 2021 and now it became a vendor category, a conference track, a job title prefix, and a bucket of platform marketing.

The term "MLOps" started as a reasonable shorthand for "operational practices applied to machine learning systems." Somewhere between 2021 and now it became a vendor category, a conference track, a job title prefix, and a bucket of platform marketing. When a word does that much work, it stops meaning anything precise.

My working position: most teams that think they need an MLOps platform need something much more boring -- reliable CI/CD, a good eval gate, and version-controlled model artifacts. The platform vendors have done an excellent job conflating "problems that exist at Uber's ML scale" with "problems your 40-person company has." They are not the same problem.

I'll try to be specific about why I think this, where I think I'm wrong, and when an MLOps platform is actually the right answer.

What "MLOps" is actually selling you

Go read the feature list of any major MLOps platform -- MLflow, Weights & Biases, SageMaker Pipelines, Vertex AI Pipelines, Kubeflow, Metaflow, take your pick. The features cluster into roughly five categories:

  1. Experiment tracking (log runs, parameters, metrics, visualize comparisons)
  2. Model registry (version models, track lineage, promote across stages)
  3. Pipeline orchestration (DAG-based training pipelines, dependencies, scheduling)
  4. Feature stores (shared feature definitions, point-in-time correct feature retrieval)
  5. Serving infrastructure (model endpoints, A/B routing, canary deployments)

These are real engineering problems. Every item on that list represents something that will eventually bite you if you don't address it. The question is whether the right solution is a platform that bundles all five, or whether most of those problems are already solved by the tools you're already running.

Experiment tracking is a logging problem. If you're logging metrics and parameters to a structured store -- even just a PostgreSQL table or a JSON file in S3 -- and you can query and compare runs, you have experiment tracking. MLflow adds a UI and a Python API around that. Useful, but not irreplaceable.

Model registry is an artifact versioning problem. Git and a blob store (S3, OCI Object Storage, GCS) solve 80% of this. You need a naming convention, a promotion mechanism, and auditability. DVC does this well. A disciplined model-registry/ folder structure in an S3 bucket with consistent tagging does this adequately.

Pipeline orchestration is a DAG-scheduling problem. Airflow, Prefect, or just GitHub Actions with jobs that have needs: dependencies handle this. The ML-specific DAG platforms add data lineage and rerun semantics, which matter -- but whether they matter enough to justify adopting a new platform is a team-size question.

Feature stores are a genuine hard problem that most teams don't have. A feature store solves training-serving skew (the feature computation at training time differs from the computation at serving time) and point-in-time correctness for temporal features. These are real failure modes. They're also failure modes that don't surface meaningfully until you're building complex feature pipelines across multiple models with different freshness requirements. At 50 people, you probably don't have that problem yet.

Serving infrastructure is a Kubernetes + deployment tooling problem. If you can deploy a Docker container, you can serve a model. Blue-green and canary deployments are solved by Argo Rollouts or just by weighted Kubernetes services. The ML-specific serving platforms add GPU scheduling and batching optimizations that matter at inference scale -- but again, the question is your scale.

The pattern I actually see at 50-person companies

When I work with teams at the 30-to-70-person company size, the actual ML operational problems they have are almost never the ones MLOps platforms are optimized for.

The actual problems:

Models get retrained manually with no audit trail. Someone runs a training notebook, gets better numbers, copies the model weights to a server, and updates a config file. No record of what changed, what data was used, what the previous model's performance was. This isn't a platform problem -- it's a process problem. A Makefile that runs training with logged parameters and uploads artifacts to a versioned S3 path fixes this. A Jenkins job that does the same thing on a schedule, alerts on failure, and requires a human to approve promotion to production fixes this properly.

No eval gate before deployment. The model gets deployed when the training job finishes without any automated check that it meets the performance baseline on a held-out eval set. This is the single highest-leverage thing most teams can add. If the new model doesn't beat the previous model's F1 by at least X% on the eval set, the deployment does not proceed. This is not a platform feature -- it's a script and a CI check.

No reproducibility. The model that's running in production can't be reproduced because no one pinned the library versions, the training data snapshot, or the random seed. This is a dependency management and data versioning problem. DVC + a requirements.txt with pinned versions + a data snapshot in S3 with a documented hash solves it. It's tedious to set up and not glamorous, but it works.

Stale models in production. A model trained six months ago on data from eight months ago is serving live traffic, and no one knows because there's no monitoring. This is a metrics and alerting problem, not a platform problem. If you're already shipping application metrics to Datadog or Prometheus, model prediction distribution metrics can go there too.

None of these problems require an MLOps platform. They require the same engineering discipline you'd apply to any software system -- version control, CI/CD, monitoring, runbooks.

What I actually use

For the ML pipeline work I do -- including the training pipeline work I've built for clients at Optivulnix (I'm a co-founder, so take that context with appropriate skepticism) -- I use Jenkins and GitHub Actions, not a dedicated MLOps platform.

The setup that handles the common case:

  • Training jobs run as Jenkins pipeline jobs triggered on a schedule or on data change. Parameters are logged to a structured JSON file that gets committed to the run's artifact directory in S3. Nothing exotic -- Jenkins has been doing parameterized job logging for a decade.

  • Eval gate is a Python script that runs at the end of every training job. It loads the newly trained model and the previously deployed model, runs both against a fixed eval dataset, and compares on the metrics that matter for that use case. If the new model doesn't clear the threshold, the Jenkins job marks itself failed and the deployment stage is skipped. I can't overstate how much behavioral drift this has caught before it reached production.

  • Model artifacts go to a versioned S3 or OCI Object Storage path with a naming convention: models/[]/[]-[]/. The "promoted" model is tracked by updating a current.json pointer file that the serving layer reads. Rollback is updating that pointer file back to the previous path.

  • Serving is a FastAPI container deployed on Kubernetes via a standard Helm chart. The same deployment process as any other service. I use Argo Rollouts for canary traffic splitting when the team needs it; when they don't, a standard Kubernetes rolling deployment is fine.

This setup has no branded platform. It uses tools the team already understands. The whole pipeline is inspectable in a way that proprietary pipeline DSLs often aren't. When something breaks at 2am, I want to read a Jenkins log and a Python traceback, not debug a platform-specific execution graph.

I should be honest about the limitations: this setup doesn't give me the feature store capabilities I'd need if I were building complex temporal features across multiple models. It doesn't give me the experiment comparison UI that researchers who are running hundreds of HPO trials per day benefit from. Those are real gaps. My claim is that they're gaps most teams don't hit, not that they're not real.

Where the 500-person company problem is genuinely different

At some scale, the "boring CI/CD + eval gate" approach starts accumulating real costs.

When you have 20 ML engineers running experiments simultaneously, the lack of a proper experiment tracking UI creates coordination overhead. Shared Jupyter notebooks and ad-hoc S3 naming conventions don't scale to 20 people. The W&B or MLflow UI is genuinely valuable there -- the search and comparison features save hours that add up when the team is large enough.

When you have multiple models in production with overlapping feature pipelines, training-serving skew becomes an incident risk rather than a theoretical concern. A feature store -- Feast, Tecton, or a custom solution -- starts paying for itself.

When inference is at a scale where GPU utilization efficiency translates to meaningful monthly cost differences, the serving optimizations in TorchServe, Triton, or vendor-managed endpoints matter. At low request volume, the difference between a naive FastAPI endpoint and an optimized serving platform is irrelevant. At high request volume with GPU backends, batching and memory management can cut your inference cost by 40-60%.

The honest version of the "when is an MLOps platform worth it" answer is: when the specific problems the platform solves have become your actual bottlenecks, not when a salesperson has told you they will be.

I see teams at 40 people buying Vertex AI Pipelines because they read that their competitors use it, then spending three months building integrations and learning the platform's execution model, and ending up with a more complex system than what they replaced. The tool is fine. The timing was wrong.

The part I might be getting wrong

I'm aware that this position has a selection bias problem. The clients I work with tend to come to me after they've already tried the platform-first path and found it expensive or overcomplicated. I'm not seeing the cases where a team adopted an MLOps platform early, had a smooth experience, and shipped faster as a result. Those teams don't show up in my consulting pipeline because they don't need external help.

It's also possible that the "boring CI/CD" path I advocate looks simpler because I know it well. The learning curve I've already absorbed on Jenkins and GitHub Actions is not visible to me. A team that's already invested in Kubeflow or SageMaker might reasonably say the marginal cost of adding ML pipeline orchestration in a system they know is lower than adopting Jenkins patterns they don't.

I also want to be careful about the 2026 context. The tooling has improved. MLflow 2.x is substantially less friction than earlier versions. W&B has gotten easier to instrument. If you're starting from scratch today, the experimentation tracking options have lower setup cost than they did when I formed most of these opinions.

The thing that actually matters

Whether you use Jenkins, GitHub Actions, MLflow, Kubeflow, or Vertex AI Pipelines, the thing that determines whether your ML system is operationally sound is whether you have an eval gate before deployment.

Every other operational practice -- experiment tracking, model versioning, pipeline orchestration, feature stores -- is downstream of the basic question: "Is this model better than the one it's replacing, and how do I know?" If you can't answer that question with a script and a logged number, the platform sitting on top is solving problems in the wrong order.

The teams I've seen ship reliable ML systems do not all use the same tooling. They all have a forcing function that requires a model to clear a performance bar before it touches production. That's it. The bar can be a Jenkins build check. It can be a Kubeflow pipeline condition. It can be a GitHub Actions step. The mechanism is secondary.

Get the eval gate right first. Then argue about the platform.

My work with ML pipeline infrastructure at Optivulnix (where I'm a co-founder) and across client engagements is where most of these opinions were formed. Hamel Husain's writing on eval-driven ML development influenced a lot of my thinking here -- if you haven't read his newsletter, it's worth your time.