I want to be upfront about what this post is and isn't. It's not a comprehensive market survey. It's my actual working opinions from ~18 months of building and consulting on agentic AI systems -- mostly on OCI, some AWS, one Azure engagement I'd prefer to forget. My sample size on some tools is two or three production deployments. Where that's the case, I'll say so.
I've organized this by category: orchestration frameworks, observability, eval, and gateway/proxy. Within each category I'll tell you what I use, what I've actively stopped recommending, and why. I'll try not to bury the opinions in qualifications.
Orchestration Frameworks
LangGraph: still the one I reach for, with reservations
I'm running LangGraph 1.1.x on OCI Functions for a billing reconciliation pipeline and it's what I used as the graph layer for Pipeshift's pipeline pattern matching (I'm the founder, so take that signal accordingly). The reason I keep reaching for it: it forces you to make the state structure explicit. Nodes are functions, edges are routing logic, and the state schema is typed. When something breaks, you have a graph topology to reason about instead of a prompt chain.
The reservations are real though. MemorySaver -- the default checkpointer -- is a development toy. It stores state in process memory, which means zero durability on anything serverless or multi-process. I've seen teams ship this to production. I've also seen the resulting data loss. You need a persistent checkpointer (SqliteSaver for simple single-instance setups, or a custom implementation against your actual database) and that's not clearly documented enough in the getting-started material.
The other thing: LangGraph's debugging story requires you to think in graph state. Engineers who haven't internalized that model spend a lot of time confused. The "where is my state right now and why is this edge routing to this node" question is answerable, but only if you've built up the mental model first. It's a real on-ramp cost.
What I've stopped doing with LangGraph: using it for single-step or linear RAG pipelines. If there's no meaningful branching or cycle, the overhead buys you nothing. A plain chain is easier to test and debug.
CrewAI: good for demo, problematic at the seams
I used CrewAI on one client engagement -- a document processing workflow that needed multiple "role-based" agents collaborating. The role abstraction is genuinely nice for explaining the system to non-engineers. The demo to stakeholders went well.
The problems started when we needed the agents to do something outside the happy path. CrewAI's inter-agent communication model -- agents calling each other through a task delegation layer -- becomes hard to debug when delegation fails silently or loops unexpectedly. When I asked "what exactly was passed between agent A and agent B at step 3 of this run," the answer was either buried in logs or not accessible at all without custom instrumentation.
The sequential task model also makes partial retries awkward. If your five-agent pipeline fails at step 4, retrying from step 4 isn't straightforward. You rebuild more state than you should have to.
I wouldn't say avoid it entirely. For internal tooling where a rough retry-from-scratch is acceptable and the role metaphor genuinely helps onboarding, it works. For anything where you need production-grade fault handling, I'd want something that gives you more visibility into inter-agent state.
I'll also be honest: my experience is one production engagement and a handful of prototypes. I might be working around problems that have since been fixed in newer versions.
AutoGen: I cannot recommend it for stateful production workloads
I've used AutoGen on two engagements. The conversation-based abstraction -- agents that talk to each other in a message thread -- is conceptually elegant and maps well to how people think about multi-agent collaboration.
The production issues I kept hitting: AutoGen's state model is a conversation history, and that history grows. For long-running workflows, that history becomes the context for every agent call, which means latency and token cost grow with each step. More importantly, the implicit state-in-conversation model makes it hard to separate "what state does this agent need" from "what is the full conversation history." You end up either passing too much context (expensive, sometimes incoherent) or manually pruning the history (brittle, easy to accidentally break an agent's reasoning chain).
The second issue: concurrency. AutoGen's native model is sequential agent turns. Parallel execution requires workarounds that, in my experience, introduce race conditions that are surprisingly hard to debug in a conversation-thread model.
I haven't used AutoGen since version 0.2.x extensively. I know the 0.4 rewrite (AutoGen Core) changed the architecture significantly. My opinions here are anchored to 0.2 and may not transfer cleanly to current versions. But I haven't gone back, which is itself a data point.
LlamaIndex Workflows: the sleeper pick
I want to call attention to LlamaIndex Workflows because I think it's underrated in the orchestration conversation. The event-driven execution model -- steps emit events, other steps listen for events -- is a natural fit for workloads with complex fan-out patterns. It also tends to produce cleaner separation between retrieval logic and orchestration logic than LangGraph, where the two often blur together.
The trade-off: the event model is less intuitive to newcomers than LangGraph's explicit graph topology. And the LlamaIndex ecosystem is large enough that you need to be deliberate about which abstractions you're using -- the library has grown fast and some of the older patterns are inconsistent with the newer ones.
I've used it in a context where the team was already deep in LlamaIndex for RAG and adding orchestration. That's the natural on-ramp. If your team is starting from scratch on orchestration, I'd still default to LangGraph and move to Workflows if you hit a specific ceiling.
Observability Tools
LangSmith: useful, with a data residency asterisk
LangSmith is the observability tool I've used most. The trace visualization is genuinely good -- seeing an agent run as a nested tree of LLM calls, tool invocations, and state transitions catches problems that would take much longer to find in structured logs. The prompt playground is useful for rapid iteration on prompt versions without a full re-deploy.
The asterisk: LangSmith sends your traces to Langchain's servers. For enterprise clients with data classification requirements -- which is most of the Oracle work I've done -- this is a blocker. You're sending LLM inputs and outputs (potentially including internal documents, schema details, business logic) to a third-party service. The enterprise on-premises option exists but I haven't evaluated it.
For personal projects and non-sensitive workloads, LangSmith is the path of least resistance and it's good. For enterprise, you need to answer the data residency question before you enable tracing.
Helicone: my current first choice for team-level LLM observability
Helicone sits as a proxy in front of your LLM API calls and captures traces with essentially no code changes -- you redirect your OpenAI (or Anthropic, or Bedrock) base URL through Helicone and you get request/response logging, latency tracking, token cost per request, and error rates out of the box.
The self-hosted option is what makes it practical for enterprise use. I've deployed it on OCI with the open-source version and the operational overhead is low -- it's a Docker Compose setup with a Postgres backend, and it handles the volume I've thrown at it without any tuning.
The gaps: the trace visualization for multi-step agent runs is not as rich as LangSmith. You see individual LLM calls, not a stitched-together view of a graph execution. For simple pipelines or cost monitoring purposes, that's fine. For debugging a LangGraph agent that's misbehaving at a specific node, I'd want something that understands the graph structure.
Langfuse: promising, operationally heavier than I'd like
Langfuse's open-source self-hosted option is the strongest offering in the self-hosted LLM observability space right now. The trace model is more structured than Helicone's -- you can annotate traces with spans, link evaluations to specific traces, and run dataset-based evals through the UI.
My hesitation is operational. The self-hosted Langfuse stack (as of the version I evaluated, 2.x) requires Postgres, ClickHouse, Redis, and a separate background worker. That's four moving parts before you get to your actual application. For a dedicated ML platform team with infrastructure capacity, this is fine. For a two-person startup trying to get observability without a dedicated ops burden, it's a lot.
The cloud-hosted version is simpler to operate but returns you to the data residency question.
I use Langfuse when the team has the capacity to run the stack and wants the richer eval integration. I use Helicone when they don't.
Eval Tools
RAGAS: the right starting point for RAG evaluation, not the finish line
RAGAS is what I reach for first when a client asks "how do we know if our RAG pipeline is working." It gives you a set of reference-free metrics -- faithfulness, answer relevance, context recall, context precision -- that you can run against your pipeline without needing a labeled test set for every metric.
The important caveat I always give: RAGAS metrics are LLM-evaluated. The framework calls an LLM judge to score faithfulness, which means your eval quality is gated on your judge model's quality. I've seen RAGAS faithfulness scores look good while the pipeline was generating plausible-sounding but incorrect answers -- because the judge model was being fooled by the same surface plausibility that was fooling the end users.
Use RAGAS for directional feedback and regression detection -- "did this prompt change make faithfulness worse" -- not as a ground truth measure of pipeline quality. The moment you have any labeled ground truth for your domain, supplement RAGAS with evaluation against that ground truth.
DeepEval: good eval framework if you write the test cases
DeepEval is the testing-framework-style eval library I've started preferring for systematic evaluation. You write test cases as code, define metrics as assertions, and run them as part of CI. The mental model maps well to how engineers already think about testing.
The catch: it requires you to actually write the test cases. Unlike RAGAS where you can plug in your pipeline and get metrics back with minimal setup, DeepEval's value comes from the thought you put into defining what "correct" looks like for your specific domain. That's the right design -- it forces the eval work to be deliberate -- but it's also more upfront investment.
I've used it successfully on a client engagement where we had a domain expert who could define ground truth for a document Q&A system. Without that ground truth, you're back to relying on LLM judges.
Promptfoo: underrated for systematic prompt regression testing
Promptfoo is the tool I'd recommend to anyone doing iterative prompt engineering and wanting to catch regressions. You define test cases in YAML, specify your providers (OpenAI, Anthropic, local models), and run comparative evaluations across prompts or models. The output is a table showing pass/fail against your defined assertions for each provider/prompt combination.
The use case I find it best for: "I'm changing this system prompt -- does it break any of my 40 test cases across three model providers?" That workflow is genuinely painful without a tool, and Promptfoo handles it well.
It's not a production observability tool and it's not a full eval framework for complex pipelines. It's a prompt testing tool and it does that specific job well.
Gateway / Proxy Tools
LiteLLM: the right abstraction, operationally sensitive
LiteLLM is what I'd recommend as the first gateway layer for most teams. The unified API across OpenAI, Anthropic, Bedrock, Azure OpenAI, and OCI Generative AI means you write once against the OpenAI schema and your underlying model is configurable. That portability matters: I've moved a client from OpenAI to Bedrock by changing an environment variable rather than refactoring API calls throughout the codebase.
The self-hosted LiteLLM Proxy adds rate limiting, spend tracking, virtual API keys, and team quotas. The operational part I want to be honest about: the proxy adds a network hop and if it goes down, all LLM traffic for your application goes down. I've had two incidents where a LiteLLM Proxy config change caused unexpected routing behavior that took longer to debug than it should have, because the failure mode was "requests silently routing to the wrong model" rather than an explicit error.
The config is YAML and version-control it. If you're running the proxy as a critical piece of infrastructure, treat it like infrastructure -- change control, tested rollbacks, health checks.
Portkey: worth evaluating if you want the managed layer
Portkey is the managed gateway I've evaluated but not run in production. The feature set is similar to LiteLLM Proxy -- fallbacks, load balancing across providers, spend tracking -- but managed rather than self-hosted. The trace integration with their observability layer is cleaner than LiteLLM's out-of-the-box.
My hesitation is the same as with any managed critical path: your LLM traffic routes through Portkey's infrastructure. For teams with data classification requirements, that's another conversation to have before adoption.
I don't have enough production time with Portkey to give you a strong opinion on reliability or incident response. If you're evaluating it, I'd want to know what their SLA looks like and what your application's behavior is when the gateway is degraded.
OpenRouter: I use it for model access, not production routing
OpenRouter is useful for accessing a broad model catalog without managing multiple API accounts, and for getting access to models where you don't have a direct enterprise contract. I use it for evaluation and prototyping.
I wouldn't run it as the gateway for production traffic that matters. The abstraction adds a dependency on a third-party service that sits outside your control plane, and OpenRouter's reliability story is not the same as a managed cloud provider's. The pricing arbitrage angle -- accessing some models through OpenRouter at different rates than direct APIs -- is real but small at most production volumes.
The honest summary
The tools I actively use and would recommend without heavy qualification: LangGraph for stateful orchestration, LiteLLM for provider abstraction, Helicone (self-hosted) for team-level observability, RAGAS for RAG regression testing, Promptfoo for prompt regression testing.
The tools I've stopped recommending for production stateful workloads: AutoGen 0.2.x. CrewAI for anything where inter-agent visibility matters.
The tools where my opinion is limited by sample size: Portkey (one evaluation, no production), LlamaIndex Workflows (two deployments, both greenfield), DeepEval (one client engagement). I'm flagging these explicitly because a confident wrong opinion from someone with one deployment is exactly the kind of advice that causes expensive migrations six months later.
The thing nobody says enough in these comparisons: the best tool is the one your team will actually instrument and debug at 2am when something breaks. Technical superiority in isolation doesn't matter much. If your team doesn't understand the abstractions, the sophisticated tool becomes a liability faster than the simple one does.
Some of this tooling thinking has fed directly into Pipeshift's architecture decisions -- I'm the founder, so that's an interested perspective. If you're evaluating an agentic AI stack for a production use case and want to compare notes, my calendar is open.