What I'd Do Differently If I Started Pipeshift Today
I am the founder of Pipeshift -- CI/CD migration intelligence for Jenkins-to-GitHub Actions moves.
Writing
Practical patterns, trade-offs, and lessons learned from real systems across OCI, AWS, and Azure — plus DevOps and agentic AI engineering.
I am the founder of Pipeshift -- CI/CD migration intelligence for Jenkins-to-GitHub Actions moves.
The popular claim is that synthetic data closes the gap between a capable base model and a task-specific one -- no labeling budget, no data collection headaches, just generate-and-fine-tune. That claim is half right.
The system had been running fine for three weeks. Queries returning in under 200ms, recall looking reasonable in our offline evals, nothing alarming in the logs.
Most engineering teams that bring me in for AI consulting have the same latent problem: one or two people understand how LLMs actually work, the rest have absorbed enough LinkedIn content to be confused, and management has already promised.
Most RAG evaluation frameworks I encounter measure the wrong thing with high precision.
Let me be clear about where I am coming from before I say anything critical: I build agents. I am the founder of [Pipeshift](https://mohakdeepsingh.dev/products), which uses a multi-agent LangGraph pipeline under the hood.
The pitch for model routing is simple: not every task needs GPT-4 or Claude Sonnet.
The B2B SaaS context changes almost every architectural decision in an LLM system. You are not building a product with one corpus and one user population.
I want to be upfront about what this post is and isn't. It's not a comprehensive market survey.
There is a version of this post where I describe a clean prompt management architecture: a dedicated prompt store, semantic versioning, A/B deployment, automatic rollback on degradation, beautiful dashboards. That system exists.
I am the founder of Pipeshift, so everything in this post is written with that bias on the table. This is not an objective analysis of ML CI/CD tooling in general.
I use Claude daily. I have built parts of Pipeshift's pipeline tooling on top of the Claude API -- I am the founder of Pipeshift, so that context matters when you weigh anything I say here.
The API pricing page is not your cost. That number -- $15 per million output tokens, $3 per million input tokens, whatever it is this week -- is the floor.
I want to be specific about what happened, because the generic version of this story -- "AI agent did unexpected things" -- gets told without enough detail to be useful.
Every engagement starts the same way. The client has a use case -- a domain-specific assistant, a document QA system, a support bot that needs to sound like the company -- and someone on their team has already formed an opinion.
I did not set out to reduce my OpenAI usage. I set out to understand what I was actually paying for, and the answer surprised me enough that three workload categories are now running on something else.
Every tutorial for deploying ML models on Kubernetes follows the same path: create a Deployment, set up a Service, maybe wire in an HPA on CPU utilization, call it done. That path is fine for getting something running in an afternoon.
The honest version of "why I built my own eval harness instead of using an off-the-shelf tool" is not ideological. I did not build it because I think NIH is virtuous or because I distrust existing tools.
Every few months a model provider announces a larger context window as if it is a straightforward quality improvement. 200k tokens. 1M tokens. And yes, for some workloads those numbers matter.
At 2am on a Tuesday, one of my early agent systems was stuck in a loop -- tool call, failed parse, retry, failed parse, retry -- and the retries were not bounded. It had been running for forty minutes.
The term "MLOps" started as a reasonable shorthand for "operational practices applied to machine learning systems." Somewhere between 2021 and now it became a vendor category, a conference track, a job title prefix, and a bucket of platform marketing.
I got a Slack message at 3:14am on a Tuesday. The client -- a B2B SaaS company, anonymized here -- had deployed a RAG-powered internal knowledge assistant I had built for them about six weeks earlier.
The standard advice on RAG evaluation is to "use RAGAS and check your metrics." That advice is not wrong.
The demo looks like this: user types a question in natural language, a friendly response appears that cites specific records, the founder calls it "AI-powered." Investors nod. Engineers ship it.
The vector database benchmark posts I keep finding online share one characteristic: they were run by the vendors, on hardware the vendors control, against query distributions that favor their product. I don't find them useful.
*Full disclosure: I am the founder of Pipeshift. Everything I write about Pipeshift is written from that position.
Every team I review thinks they built a different system. The product names differ, the domains differ, the models differ. The failure modes are nearly identical.
*Full disclosure: I'm building [Pipeshift](https://mohakdeepsingh.dev/products), a tool for managing ML pipeline deployments. The architecture I describe here is the direct predecessor to what Pipeshift automates.
The answer to "what chunk size should I use?" is always "it depends," and I've always found that answer useless. It depends on *what*, specifically, is the question. I ran a benchmark to find out.
There was a specific moment. I was three hours into debugging a production retrieval failure, staring at a traceback that ran through six layers of LangChain internals before surfacing anything I could act on.
The Kubernetes provisioning runbook was a bash script that had grown to several hundred lines with conditional logic for three environments across two clouds. That is the point where you rewrite it in Go. Here is what that decision cost and what it gave back.
Most enterprise AI tooling rollouts are underprepared for the governance questions. Here is the framework I use: RBAC tiers, MCP server configurations, Keycloak SSO integration, and LLM acceptance criteria for engineering workflows.
The engagement did not start with building the OCI landing zone. It started with figuring out what actually existed in the tenancy. A week of terraform import, in dependency order, and what that methodology looked like.
Fixed-size chunking is the default for RAG tutorials and the wrong choice for hierarchically structured documents. The section-level strategy I built for a production Oracle HLD generator -- and where it still falls short.
Operational excellence, security, reliability, performance, cost, sustainability — applied to AI agents that actually survive production. The rubric most teams skip.
A practical account of migrating production CI/CD from Jenkins to GitHub Actions — the decisions, the tradeoffs, and the patterns that didn't survive the translation.
After running production Kubernetes on both OKE and EKS, here's where OCI wins, where AWS wins, and where the answer is genuinely 'it depends' for real reasons.
Remote state, environment isolation, and the guardrails that prevent your multi-cloud IaC from becoming a liability. What I've learned running OCI, AWS, and Azure in parallel.
What actually breaks when you take a LangGraph agent pipeline off your laptop and run it on OCI Functions — and the patterns that survive contact with production.
The actual levers for reducing Kubernetes spend without regressing reliability — with numbers from real clusters on OKE, EKS, and AKS.