LangGraph in Production on OCI: Patterns That Hold Up

The thing about LangGraph demos is they work beautifully until someone asks what happens when node three times out halfway through a graph execution. Most tutorials skip that question. This one doesn't.

I've been running a LangGraph billing reconciliation pipeline on OCI Functions for several months now. Five nodes, event-triggered, talking to Oracle DB 23ai for state persistence. What follows is the honest version — the decisions that actually mattered, the failure modes I hit in production, and the places where I'd tell you to use something simpler.

Why stateless LLM calls break under multi-step workloads

The standard pattern for LLM apps is request-in, response-out. You call the model, you get text back, you're done. That pattern starts breaking when your task has conditional branching, parallel sub-tasks, or state that needs to survive across multiple LLM calls.

Billing reconciliation is a good example. You can't do it in one shot. The full task is: ingest invoice data, classify each line item, resolve ambiguous entries against historical context, flag anomalies, generate a structured report. That's five distinct reasoning steps with dependencies between them. The naive approach — one big prompt with everything — has obvious problems: context windows blow up, you lose the ability to retry individual steps, and debugging a failure means staring at a 4,000-token prompt trying to figure out which part went wrong.

LangGraph solves this by giving you an explicit state machine. Nodes are functions. Edges define what runs next. Conditional edges let you branch on state. The graph holds a typed state dict that flows through every node, and you can checkpoint it between steps.

The value isn't that it's magic. The value is that it forces you to make the structure explicit. That explicitness pays dividends when something breaks.

What LangGraph's state model actually is

A StateGraph in LangGraph is built around three things: a typed state schema, nodes that transform state, and edges that define execution order.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class ReconciliationState(TypedDict):
    invoices: list[dict]
    classified: Annotated[list[dict], operator.add]
    anomalies: list[dict]
    report: str
    error: str | None

Every node receives the full state and returns a partial update. LangGraph merges the update into the state dict — the Annotated[list, operator.add] on classified means node outputs append rather than overwrite, which matters for parallel fan-out patterns.

def classify_node(state: ReconciliationState) -> dict:
    results = []
    for invoice in state["invoices"]:
        classification = call_llm_with_schema(invoice, ClassificationSchema)
        results.append(classification)
    return {"classified": results}

Conditional edges let you branch on state:

def route_after_classification(state: ReconciliationState) -> str:
    if state.get("error"):
        return "error_handler"
    if any(a["severity"] == "high" for a in state.get("anomalies", [])):
        return "escalation_node"
    return "report_node"

This is the pattern that makes LangGraph worth learning. Your business logic lives in routing functions, not buried in prompt text. That makes it testable.

Checkpointing: MemorySaver is a trap

LangGraph ships with MemorySaver for development. It stores checkpoints in process memory, which means they disappear when the function returns. On OCI Functions, every invocation is effectively stateless — the container may or may not be reused, and you can't rely on in-memory state persisting across invocations.

For production, you need a persistent checkpointer. The options are SqliteSaver (fine for single-instance, not for serverless), or a custom implementation against your database of choice.

I checkpoint to Oracle DB 23ai, partly because it was already in the architecture for vector search, partly because co-locating checkpoint data with the domain data (invoices, historical context) keeps the query patterns simple.

The checkpoint schema is straightforward:

CREATE TABLE langgraph_checkpoints (
    thread_id     VARCHAR2(128)  NOT NULL,
    checkpoint_id VARCHAR2(128)  NOT NULL,
    parent_id     VARCHAR2(128),
    state_blob    CLOB           NOT NULL,  -- JSON
    created_at    TIMESTAMP      DEFAULT SYSTIMESTAMP,
    CONSTRAINT pk_checkpoints PRIMARY KEY (thread_id, checkpoint_id)
);

The custom checkpointer implements BaseCheckpointSaver from langgraph.checkpoint.base and handles get, put, and list against this table. This interface is stable in LangGraph 1.1.9 — the main thing to watch is that get must return None (not raise) when a checkpoint doesn't exist.

The latency tax is real. Each checkpoint is a write to the database. On a five-node graph, that's five round-trips to DB 23ai, each adding 10–30ms at typical OCI latency. For a billing reconciliation job that runs in the background on event trigger, that's acceptable. For a user-facing chat assistant where you're summing this latency per token streamed, it's probably not.

OCI Functions deployment: the real numbers

OCI Functions is a good fit for event-driven AI workloads. You pay per invocation and per GB-second of execution time. For a reconciliation job that triggers on invoice upload, the traffic pattern is bursty rather than sustained — serverless is the right shape.

My function configuration:

Runtime: Python 3.11
Memory: 512 MB
Timeout: 300 seconds
Trigger: OCI Events rule on object storage PUT

Cold start on a warm-pool container is negligible. On a genuinely cold container — no prior invocation for 15+ minutes — I measure 4–6 seconds from trigger to first node execution. That's Python 3.11 + LangGraph 1.1.9 + the Oracle thin client initialising. Acceptable for a background reconciliation job; not acceptable for anything user-facing.

At current invocation volume (~8,000 invocations/month, average 45 seconds per run at 512 MB), the cost is:

Invocations: 8,000 × $0.00001417 ≈ $0.11
Compute: 8,000 × 45 × 0.5 GB × $0.00001417 ≈ $2.55
Total: ~$2.70/month — well within the free tier (2M invocations/month)

At 10x this volume the math is still cheap. The cost driver to watch isn't invocations — it's execution time. Every extra LLM call adds to that 45-second average.

A rough illustration at 10,000 invocations per month, each running for 60 seconds at 512 MB (0.5 GB):

Invocation cost: 10,000 × $0.00001417 ≈ $0.14
Compute cost: 10,000 × 60 × 0.5 × $0.00001417 ≈ $4.25
Total: ~$4.40/month, before the OCI Functions free tier (2 million invocations per month)

That math holds for workloads under the free tier. Beyond it, the compute cost scales linearly with execution time. The biggest lever is reducing per-invocation execution time — which mostly means minimizing redundant LLM calls and caching anything you can.

The five-node graph structure

The billing reconciliation pipeline runs as five sequential nodes with two conditional branches:

ingest_node
    |
classify_node
    |
[conditional: has_unresolved?]
    |-- yes --> resolve_node --> anomaly_node
    |-- no  --> anomaly_node
                    |
            [conditional: has_escalation?]
                    |-- yes --> escalation_node --> END
                    |-- no  --> report_node --> END

ingest_node: Reads invoice JSON from object storage, validates schema, normalizes fields. No LLM call — pure data transformation.

classify_node: Classifies each line item against a taxonomy using a structured output call. Uses with_structured_output(ClassificationSchema) to enforce the response format and avoid downstream parsing failures.

resolve_node: For line items the classifier marked as ambiguous, retrieves historical context from DB 23ai using vector similarity search and makes a second classification pass with that context injected.

anomaly_node: Compares classified totals against expected ranges. Rule-based, no LLM — anomaly detection is deterministic here, not probabilistic.

report_node / escalation_node: Generates a structured JSON report or triggers an escalation workflow depending on anomaly severity.

The conditional routing is what keeps the graph efficient — low-ambiguity invoices skip the resolve step entirely, cutting both latency and LLM cost.

Failure modes I've hit in production

Node timeout mid-graph. A classify_node call hit the LLM provider's rate limit and hung until OCI Functions hit the 300-second timeout. The checkpoint captured state up to that node, but the function exited with an error. On retry, the graph resumed from the last checkpoint — which is the intended behaviour. But I hadn't tested what "resume from checkpoint" actually looked like under rate limiting. It worked, but I got lucky. The fix: add explicit timeout handling inside the classify node with a configurable retry budget, and surface rate limit errors as a distinct error state rather than letting them bubble up as timeouts.

Conditional edge infinite loop. A routing function returned a node name that didn't exist in the graph — typo in a refactor. LangGraph raised an error at runtime rather than at graph compile time. I added a compile-time validation step that runs in CI: instantiate the graph, call .compile(), and assert no errors. Catches this class of bug before deployment.

LLM output schema drift. After a model update on the provider's side, structured output responses started including an extra field that wasn't in my ClassificationSchema. Pydantic raised a validation error. Downstream nodes received None instead of a classification result and produced nonsense reports before I caught it. The fix: use model_validate with strict=False for the classifier output and log any extra fields as a monitoring signal, rather than hard-failing on schema additions.

Context window blowup on large invoices. An invoice with 400 line items exceeded the context window when I naively concatenated all items into a single classify prompt. The node failed with a context length error. Fix: batch line items into chunks of 50, fan out classification calls, and use the Annotated[list, operator.add] reducer to merge results back into state.

Observability

The single most useful thing I added was structured logging per node transition:

import json
import logging

logger = logging.getLogger(__name__)

def log_transition(node_name: str, state: ReconciliationState, duration_ms: float):
    logger.info(json.dumps({
        "event": "node_complete",
        "node": node_name,
        "thread_id": state.get("thread_id"),
        "classified_count": len(state.get("classified", [])),
        "anomaly_count": len(state.get("anomalies", [])),
        "error": state.get("error"),
        "duration_ms": duration_ms,
    }))

OCI Logging ingests these and I can query them in OCI Log Analytics. The thread_id field ties every log line back to a specific graph execution, which makes debugging a specific failed run a matter of filtering on that ID rather than reading through interleaved logs from multiple invocations.

LangGraph 1.1.9 supports LangSmith tracing via LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY env vars — no code changes required. I'm not using LangSmith in this deployment — the OCI Logging approach is sufficient for the current complexity, and adding an external tracing service adds another dependency and another data exfiltration path that requires approval in enterprise environments.

When not to use LangGraph

The billing reconciliation case is a good fit: multi-step, event-triggered, background job, tolerance for a few seconds of latency, clear state boundaries. There are cases where it's the wrong tool.

Single-shot RAG. If your task is "retrieve relevant chunks and generate an answer," you don't need a state machine. A plain chain works, is easier to debug, and has less overhead. LangGraph's value is in graphs with branching or cycles — a linear chain with no branching is just complexity without benefit.

Sub-100ms user-facing latency requirements. The graph overhead, checkpoint writes, and the Python startup cost on a cold container are all working against you. If you're building a chat product where users expect token streaming within a second, serverless Python with a LangGraph state machine is probably not the right architecture. Edge runtimes with streaming and minimal Python are a better match.

Teams without Python fluency. LangGraph's debugging story requires reading graph state, understanding checkpointers, and tracing conditional edge evaluation. That's tractable for Python engineers. For teams primarily working in Go, Java, or TypeScript, the cognitive overhead is real and the tooling is thinner.

The honest version of any "use LangGraph in production" post has to include this section. The framework is genuinely useful for the right problem shape. But I've seen teams reach for it because it felt modern and then spend weeks fighting complexity that a simpler approach wouldn't have introduced.

I'm building Pipeshift — CI/CD migration intelligence that uses a multi-agent LangGraph pipeline under the hood. The patterns in this post come directly from that work.