Claude for Code: What Works and What Costs Me Time

I use Claude daily. I have built parts of Pipeshift's pipeline tooling on top of the Claude API -- I am the founder of Pipeshift, so that context matters when you weigh anything I say here. I also use Claude Code as a coding assistant on my own machine for consulting work. I am not going to write a marketing post about it. I am going to describe what it is genuinely good at, what it is bad at, and what prompting patterns I have landed on after enough months of use that I can say something with reasonable confidence.

The short version: Claude is the best tool I have found for certain narrow categories of software development work, and a genuinely poor tool for others. The gap between the two is larger than the marketing suggests. If you are using it as a general-purpose coding assistant without distinguishing between those categories, you are getting mediocre results on the hard cases and good results on the easy ones, and averaging them together into "it's pretty good."

Where it is genuinely good

Architectural discussion and tradeoff analysis. This is where I get the most value. When I am trying to decide between two design approaches -- say, whether a pipeline orchestration layer should be event-driven or polling-based for a specific latency target -- Claude is a useful thinking partner. It knows the relevant tradeoffs. It will push back when I describe a design that has a known failure mode. It has read more architecture discussion than I have.

The caveat: it knows the general patterns, not my specific constraints. I have to supply the specifics. "Given we're on OCI, our p99 latency budget for pipeline state transitions is 200ms, and our event volume is around 50k/day with burst to 500k, should we use Kafka or a polling loop against Postgres?" gets a useful answer. "What's better, Kafka or polling?" does not. The model is only as good as the context you give it.

Test generation for code with clear contracts. Give Claude a function with a well-defined input/output contract and ask it to write pytest or Jest tests, including edge cases. It is good at this. The coverage it generates is often more complete than what I would write in the same time -- it reliably tests the null input, the empty collection, the off-by-one, the type coercion edge case.

The failure mode here: functions with implicit dependencies or side effects. If a function reads from a config file, queries a database, or has behavior that changes based on environment state, Claude's tests tend to be structurally correct but miss the mocking structure you actually need. It writes what a test should look like rather than what a test needs to be for your specific setup.

Explaining code I did not write. When I am dropped into a codebase I do not know -- a client engagement, a library I need to extend -- and I need to understand what a non-obvious piece of code is doing, Claude is faster than reading documentation and more reliable than grep-and-guess. I paste the code with the surrounding context, ask what it does and what breaks if I change a specific line, and the explanation is usually accurate. Not always. But usually.

Boilerplate that follows an established pattern. If I show Claude two or three examples of a pattern and ask it to generate a fourth instance -- a new OCI Terraform module that follows the same structure as existing ones, a new gRPC service stub that matches the convention in the codebase -- it does this well. The key is showing the examples explicitly rather than describing the pattern abstractly. It pattern-matches well. It does not invent well.

Refactoring for clarity. "Here is a 200-line function. Pull out the parts that are doing different things and name them." Claude handles this competently for code that is not doing anything exotic. The output usually needs cleanup but the structural decomposition is sound.

Where it costs me time instead of saving it

Novel algorithms or non-standard implementations. When I need something that is not a known pattern -- a custom scheduling heuristic, a novel graph traversal, something at the intersection of two domains where there are few canonical examples -- Claude produces code that looks correct but is not. It synthesizes from patterns in its training data in a way that sounds right and fails on the actual constraints. I have spent longer debugging Claude-generated novel code than I would have spent writing the correct implementation from scratch. The failure mode is subtle: the code often works on the happy path and fails on edge cases that are specific to the problem domain.

Large codebase context. This is the fundamental limitation that no amount of prompt engineering resolves: Claude does not know my codebase. I can paste relevant sections, but the moment a bug involves three files interacting in a non-obvious way, or a problem that requires understanding a pattern established somewhere I did not think to include, the suggestions are plausible-sounding and wrong. Simon Willison writes about this accurately -- the model can only reason about what it can see, and for any real codebase there is always relevant context outside the context window.

I want to be specific about how this fails: Claude will confidently suggest that I check a function whose signature I have shown it, as if the implementation it can see is the only implementation. When the bug is in an override three layers up the call stack that I did not include, the suggestion is not just wrong -- it is wrong in a way that directs my attention to the wrong place. That costs time.

Anything requiring runtime knowledge. Version-specific behavior, environment-specific issues, what actually happens when you hit a particular API with malformed input in production -- Claude's knowledge is a frozen snapshot. My knowledge cutoff disclaimer from the model: August 2025. In practice, anything that changed or was publicly documented after mid-2025 is either absent or wrong. For OCI SDK changes, Terraform provider updates, Kubernetes 1.31+ behavior, I have to verify everything Claude says against current docs. Which is fine for code where I know the domain. It is an invisible trap for code where I am relying on Claude because I do not know the domain.

Debugging when the symptom is far from the cause. Claude debugs well when the symptom is close to the cause. "This function returns None sometimes" -- fine. "The pipeline occasionally hangs after 30 minutes with no error in the logs" -- Claude generates hypotheses, some plausible, none reliably targeting the actual cause. For this category of problem I use Claude to brainstorm and verify nothing, rather than using it as the primary debugging tool.

The prompting patterns that actually work

After months of use, here is what I do differently now versus when I started:

State constraints explicitly, always. "Using Python 3.12, FastAPI 0.110, running on OCI A1 Flex instances with 4 OCPUs, the constraint is that the solution cannot make blocking calls" is a better prompt than "how do I handle concurrency in FastAPI." The more specific the constraints, the more the answer is actually about my problem rather than the general case.

Show examples, do not describe patterns. When I want Claude to follow a convention, I show it two or three existing examples from the codebase. "Write a new handler following this convention: [example 1] [example 2]" produces better results than "write a handler using our standard pattern" -- it does not know the pattern unless I show it.

Ask for the failure modes first. Before asking Claude to write code for something I am uncertain about, I ask: "What are the failure modes of using approach X for requirement Y?" The failure mode discussion is where Claude is most useful because it draws on a wide training corpus of "things that went wrong." Then I decide whether to proceed with X at all, and what to test for.

Use it as a reviewer, not an author, for anything critical. For anything where correctness matters and the problem is non-trivial, I write the code myself and ask Claude to critique it. "Here is my implementation. What are the cases where this fails? What would you do differently?" is a better use of the tool for complex code than "write this implementation for me." It catches things I miss. It does not introduce things I did not intend.

Chunk the work small. "Build me a rate limiter with Redis that handles distributed writes and has a fallback to in-memory when Redis is unavailable" produces mediocre code. "Write a Redis-backed token bucket rate limiter. Here is the interface it needs to implement. Ignore fallback behavior for now." Then, once that is solid, "Now add a fallback path. Here is the constraint for the fallback." Smaller, concrete requests with explicit interfaces get better results than large, end-to-end requests.

How I use it in Pipeshift work specifically

Since I build Pipeshift -- an ML pipeline observability tool -- on top of the Claude API in some places, I should be clear about the distinction. Using Claude as an infrastructure layer (API calls, structured output extraction, embedding generation) is different from using Claude Code as a coding assistant. The former is about whether the model's output is reliable enough for a specific bounded task; the latter is about whether the model is a useful collaborator across an entire software development workflow.

For Pipeshift, I use the Claude API for pipeline pattern classification and for explaining detected anomalies in natural language. Both are bounded tasks where the output goes through a structured validation step before it reaches a user. I do not use it for tasks where the free-form output could silently be wrong without a validation layer.

As a coding assistant on Pipeshift's own codebase, I run into the large-codebase-context problem as hard as anyone. Pipeshift's backend is not a large codebase by enterprise standards, but it is already complex enough that multi-file reasoning breaks down regularly. I use Claude Code in short, focused sessions on specific subsystems rather than as a persistent pair programmer that has full context of everything.

The honest position

Claude is the best LLM I have used for code, and "best LLM for code" is not the same as "good at code" for the full range of software development work. For the categories where it is genuinely strong -- architectural discussion, test generation for well-specified contracts, pattern-following boilerplate, code explanation -- it is meaningfully faster than my alternatives. For the categories where it fails -- novel algorithms, large codebase reasoning, runtime knowledge, deep debugging -- it is a trap that produces confident-sounding wrong answers.

The mistake I see most often, including in my own early use, is treating it as a uniform capability that you either trust or do not trust. The useful calibration is per-task: what kind of work is this, and is that a category where the model's failure modes are low-cost or high-cost for me right now?

I might be wrong about some of the failure categories being permanent limitations. Context windows are getting longer. RAG-based codebase indexing is getting better. Some of what I am describing as fundamental limitations might be limitations of the current generation rather than the category. But I am writing about what I observe today, not a prediction about the roadmap.

Some of the Claude API usage described here is within Pipeshift -- I am the founder. For consulting on AI tooling integration or evaluation, the case studies page has more context on what that work looks like.