Why I Stopped Using LangChain

There was a specific moment. I was three hours into debugging a production retrieval failure, staring at a traceback that ran through six layers of LangChain internals before surfacing anything I could act on. The actual error -- a malformed prompt template due to a breaking change introduced in 0.1.14 -- was invisible until I had stepped through the BaseChain invoke path, understood why the new LCEL format was silently swallowing my custom formatting logic, and realized the fix was to rip out the abstraction entirely and write the prompt construction myself. Three hours to discover that the framework had been doing something I didn't ask it to do, in a way I couldn't inspect without reading the source.

That was the moment I decided LangChain had become a liability rather than an asset. Not because the project is bad engineering. It is not. It is because the contract between the framework and the person using it had drifted far enough that I was paying a serious debugging tax for abstractions I did not need.

I want to be precise about what I mean and what I don't mean. This is not a complaint about LangChain being slow to adopt standards, or about Harrison Chase's team making wrong architectural decisions. The project has been enormously useful to a large number of people. What I am describing is a specific category of friction that became expensive enough in my work to justify a different approach.

What I was actually building

For context: I run agentic AI pipelines -- RAG systems, multi-step reasoning workflows, LLM-assisted code analysis -- for a mix of enterprise clients and for Pipeshift, the CI/CD migration intelligence tool I'm building. (I'm the founder; I'm not pretending to be a neutral observer there.)

The common thread across these systems is that they need to be debuggable, maintainable by someone other than me, and stable across model updates. That requirement set is where LangChain's design choices started working against me.

The version problem is worse than the changelog suggests

I started seriously using LangChain in the 0.0.x series in late 2023. By mid-2024, the project had gone through the 0.1.x restructuring that split the monolith into langchain-core, langchain-community, and langchain. By late 2024, 0.2.x introduced LCEL as the primary composition model, deprecating the old Chain classes but not removing them yet. By the time I was doing serious work with it, the question "which LangChain are you using" had become load-bearing.

The practical consequence: I maintained a RAG pipeline for a client through this period. The pipeline was built against 0.0.352. Moving to 0.1.x required understanding which of the original chain classes had been moved, renamed, or deprecated in langchain-community. Moving to 0.2.x with LCEL required rewriting the composition logic because the pipe operator behavior changed in ways that affected my custom retriever integration. None of these were catastrophic on their own. Each required a non-trivial PR, a round of regression testing, and -- most expensively -- the time to understand what the new version actually did differently.

I was not primarily working on LangChain compatibility. I was working on retrieval quality, prompt engineering, and production reliability. The version upgrade cost was pure overhead.

Abstraction leakage at the worst possible time

The thing LangChain does that I find genuinely hard to defend: it abstracts over LLM providers, retrieval systems, and prompt formatting at a layer that is opaque until something breaks, and then suddenly very transparent in ways that require you to understand the internals anyway.

The specific failure that ended my use of RetrievalQAChain in any new project: I needed to control exactly what context was injected into the prompt before the generation call. The chain abstracted this into a stuff_documents_chain that handled the formatting internally. When I needed to add a system prompt prefix that was incompatible with how the chain formatted the input, I had three options: override the internal template (possible but fragile), monkey-patch the chain (not a serious option), or abandon the chain and write the retrieval-then-generation flow explicitly.

I chose the third option. The explicit version was about 30 lines of Python. It was also immediately debuggable, easy to test, and not affected by subsequent LangChain version changes. The question I could not answer satisfactorily afterward was: what was the abstraction saving me in the first place?

For simple cases -- "given a query, retrieve and generate" -- the chain was saving maybe 20 lines of code. For anything non-standard, the abstraction was a net negative. Most production RAG systems I have worked on are non-standard in at least one significant way.

The debugging story is a function of abstraction depth

This is the point I want to make most carefully because I think it is the core of why LangChain works well for prototyping and creates friction in production.

When your LLM pipeline fails in production, the failure could be at any of several layers: the API call itself, the prompt formatting, the output parsing, the retrieval logic, the post-processing. In a well-structured explicit pipeline, the failure is localized to the layer it actually occurred in. In a LangChain chain, the failure propagates through the chain's internal invoke machinery before you see it, and the traceback includes LangChain internals that you now need to understand to interpret.

I hit this concretely with LLMChain and output parsers. A structured output parsing failure in 0.1.x surfaced as a OutputParserException that was caught, wrapped, and re-raised inside LLMChain._call, which then surfaced as a LangChainException with the original exception accessible only as e.__cause__. The actual parsing error was a missing key in the JSON the model returned. Finding that took significantly longer than it should have, because the stack trace pointed at LangChain plumbing rather than at the place in my code that constructed the prompt asking for the JSON structure.

None of this is unique to LangChain. Deep abstractions always increase debugging distance. The question is whether the value the abstraction provides justifies that cost. For prototyping, it does. For a production system I need to operate at 2am, it often does not.

What LangChain got right

I do not want to write a hit piece on a project that has contributed meaningfully to the ecosystem, so I want to be explicit about what it got right.

The community tooling and integrations are genuinely valuable. The breadth of provider integrations in langchain-community -- vector stores, LLM providers, document loaders -- saved real implementation time, and the interface standardization meant those integrations were swappable. The concept of a standard Retriever interface was a useful contribution to the way the field thinks about retrieval components.

LCEL, introduced properly in 0.2.x, is actually a reasonable composition model. The pipe operator syntax for building chains is readable and the lazy evaluation behavior makes streaming straightforward. If I were building a new simple RAG application today -- not a production system, a prototype -- I would consider using LCEL with langchain-core only, specifically avoiding the higher-level chain classes that do too much.

LangSmith is a good product. The tracing and evaluation tooling is useful and genuinely fills a gap. I use a custom structured logging approach instead (described in my post on LangGraph production patterns), but I would not dismiss LangSmith for teams that want managed observability without building their own.

What I use instead

The transition happened in stages. I did not rewrite everything at once.

For LLM API calls: direct calls to the provider SDK. OpenAI's Python SDK, Anthropic's Python SDK, the OCI Generative AI service client. These are stable, versioned, well-documented, and I control exactly what I send and receive.

For prompt management: plain Python string templates with f-strings for simple cases, Jinja2 templates for cases with conditional sections or iteration. No framework. The templates are in version control as standalone files. Changes to a template are visible in a diff.

For retrieval: I build the retrieval logic explicitly. A function that takes a query, calls the vector store's search API, applies any post-processing filters, and returns structured results. About 40-80 lines of Python depending on complexity. Easy to unit test.

For orchestration of multi-step pipelines: I use LangGraph where the task genuinely benefits from a state machine (I have written about this separately). For linear pipelines with no branching, plain Python function composition -- a list of callables, a loop, a results dict. The LangGraph/plain-Python decision comes down to whether the pipeline needs resumability, conditional branching, or parallel fan-out. If not, a state machine is overkill.

The thin routing layer I built for Pipeshift -- which handles CI/CD pipeline pattern matching -- is about 600 lines of Python, no framework dependency beyond the provider SDKs. I know every line of it. I can profile it, mock it, and modify it without reading a framework's source code to understand what a wrapper class actually does.

The honest counterargument

The reasonable pushback to everything I have said is: LangChain saves time for people who are not building complex production systems. If you are building a simple RAG chatbot on top of a managed vector store and a single LLM provider, the framework provides real value and the debugging cost is low.

I think that is correct. The place I disagree with is the extension of that logic to systems that need to be maintained across model updates, scaled under load, and operated by people who did not write the initial implementation. Those requirements shift the cost-benefit calculation significantly.

I also think the ecosystem has matured to the point where the integrations LangChain provided -- which were the strongest argument for it in 2023 -- are now available through more targeted packages. pgvector has a first-class Python interface. Every major LLM provider has a stable SDK. The glue layer that LangChain used to be is now a thinner problem than it was two years ago.

What I tell people who ask

If you are prototyping and want to move fast, LangChain is fine. Use LCEL with langchain-core, be specific about your version pin, and expect to revisit the dependency when you move to production.

If you are building something that will run in production six months from now, I would start with explicit Python from day one. You will write more code initially. You will debug less in production. The 30-line retrieval function you write yourself is less impressive-looking than the three-line chain invocation it replaces, but it is yours -- and at 2am when retrieval quality degrades and you need to understand exactly what context is hitting the model, ownership matters.

The moment I stopped fighting the framework is the moment the pipeline became something I actually understood.

The direct-API approach I describe here is what I use for the retrieval and generation components in Pipeshift. The LangGraph post I reference -- on when a state machine is and isn't the right tool -- is at mohakdeepsingh.dev/blog/langgraph-production-patterns-oci.