
AI Agents in Production: What Engineering Leaders Need to Know Today

Stewart Moreland
The conversation around AI agents has shifted. A year ago, most engineering teams were asking whether agents were ready for production. Today, the more useful question is: which problems are they actually solving, and what does a defensible architecture look like at scale?
This post is aimed at CTOs and engineering leaders who are past the "should we explore this?" phase and into the harder questions: framework selection, governance, avoiding lock-in, and building systems that can be maintained by a team rather than a single champion. The landscape has moved fast enough that a lot of the conventional wisdom from 2024 is already out of date.
Moving from pilots to production
The pattern I keep seeing across engineering teams is a gap between successful demos and successful deployments. An agent that works brilliantly in a notebook — calling tools, reasoning through steps, producing coherent output — often falls apart when it hits real operational conditions: long-running tasks that get interrupted, multi-step workflows where a single tool failure cascades, or stateful processes that need to survive a pod restart.
The distinction that matters most is between stateless inference and stateful execution. A model call is stateless by nature. An agent — something that takes actions over time, remembers context, coordinates with other agents, and potentially runs for minutes or hours — is not. Most of the engineering effort in production agent systems is not about the model; it is about the runtime that wraps it.
This is why framework selection matters more than model selection for most teams right now. The model is increasingly a commodity decision. The orchestration layer is where your architectural choices compound.
Where agents are winning today
Before getting into frameworks, it is worth grounding this in where teams are actually shipping value. The honest answer is that agents work best in domains with three characteristics: tasks that are too complex for a single prompt but too variable for a rigid workflow, access to well-defined tool APIs, and tolerance for occasional errors that a human can review.
Customer support automation is the most common production deployment. Not replacing human agents wholesale, but handling the first layer of triage, account lookups, policy lookups, and resolution for common cases — with escalation paths when confidence is low. The economics are straightforward enough that even conservative organizations are moving here.
Internal productivity tooling is the second major category: agents that can query internal knowledge bases, draft communications, summarize meeting transcripts, or pull together cross-system reports. These deployments benefit from lower stakes — a wrong answer in an internal tool is less costly than a wrong answer facing a customer.
Financial services teams are using agents for research synthesis and document analysis: ingesting earnings reports, regulatory filings, or contract documents and producing structured summaries. The key pattern here is agents as accelerators for human decision-making, not replacements for it.
The common thread
The production use cases that are working share a design principle: agents augment a defined workflow rather than replacing an undefined one. Teams that started with a well-understood manual process and asked where an agent could reduce friction usually see faster time-to-value than those who tried to automate an ambiguous process from day one.
The modern agent architecture
Before comparing frameworks, it helps to have a shared mental model of the layers involved.
A production agent system has at least three distinct concerns that should be decoupled:
The model layer — the LLM doing the reasoning. This should be swappable. Tying your agent architecture to a specific model provider is one of the most common sources of technical debt in early agent systems.
The orchestration layer — the framework that defines how agents are structured, how they hand off to each other, how state is managed, and how tool calls are routed. This is where LangGraph, AutoGen, and CrewAI live.
The runtime and infrastructure layer — how agents are deployed, how long-running sessions are persisted, how failures are handled, and how the system is observed. This is increasingly where managed services and protocols like MCP (Model Context Protocol) are entering the picture.
Most framework comparisons collapse these layers together. Keeping them separate is what gives you the flexibility to evolve each independently.
Framework comparison: LangGraph, AutoGen, and CrewAI
Each of the three dominant open-source frameworks makes a different bet about what the hard problem is.
LangGraph
[1] is a low-level orchestration framework from the LangChain team. Its core abstraction is a directed graph where nodes are functions and edges define control flow. State is typed and explicit — you define a TypedDict that flows through the graph, and each node receives and returns a subset of that state.
The simplest possible graph looks like this:
def node_b(state: State) -> dict:return {"text": state["text"] + "b"}graph = StateGraph(State)graph.add_node("node_a", node_a)graph.add_node("node_b", node_b)graph.add_edge(START, "node_a")graph.add_edge("node_a", "node_b")print(graph.compile().invoke({"text": ""}))# {'text': 'ab'}
The graph model becomes more powerful when you introduce conditional routing — which is where the supervisor pattern comes in. A supervisor node inspects state and decides which worker agent acts next:
from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Annotatedimport operatorclass SupervisorState(TypedDict):messages: Annotated[list, operator.add]task: strnext_agent: strdef supervisor_node(state: SupervisorState) -> dict:decision = llm.invoke(f"Task: {state['task']}. Who acts next? researcher | writer | FINISH").contentreturn {"next_agent": decision}supervisor_graph = StateGraph(SupervisorState)supervisor_graph.add_node("supervisor", supervisor_node)supervisor_graph.add_node("researcher", researcher_node)supervisor_graph.add_node("writer", writer_node)supervisor_graph.set_entry_point("supervisor")supervisor_graph.add_conditional_edges("supervisor", lambda x: x["next_agent"])supervisor_graph.add_edge("researcher", "supervisor")supervisor_graph.add_edge("writer", "supervisor")multi_agent_app = supervisor_graph.compile()
LangGraph's strength is control. The graph structure makes it possible to reason about exactly what paths an agent can take, which matters for compliance-sensitive applications. Its weakness is that low-level control requires more code — you are building the workflow yourself rather than describing it.
Teams that do well with LangGraph tend to have strong Python engineers who want to own the full control flow. It is a good fit for workflows where the business logic is complex enough that you would not want a framework making routing decisions for you.
AutoGen
[2] from Microsoft takes a different approach: agents are conversational actors that collaborate by exchanging messages. The v0.4 rewrite, released in February 2026, introduced a cleaner async-first API and a set of team patterns — RoundRobinGroupChat being the simplest [3].
import asynciofrom autogen_agentchat.agents import AssistantAgentfrom autogen_ext.models.openai import OpenAIChatCompletionClientfrom autogen_agentchat.teams import RoundRobinGroupChatasync def main() -> None:model_client = OpenAIChatCompletionClient(model="gpt-4o")researcher = AssistantAgent("researcher", model_client=model_client, system_message="You research facts.")writer = AssistantAgent("writer", model_client=model_client, system_message="You write copy based on facts.")team = RoundRobinGroupChat([researcher, writer], max_turns=4)async for event in team.run_stream(task="Research AI agents and write a 2-sentence summary."):print(event)asyncio.run(main())
The conversational model maps naturally to tasks that benefit from back-and-forth deliberation between specialized agents. It is also easier to get started with — the mental model of agents talking to each other is more intuitive than defining a graph.
The trade-off is that conversational systems can be harder to make deterministic. When you need a specific sequence of operations with predictable branching, the graph model in LangGraph gives you more precision.
CrewAI
[4] sits at the highest level of abstraction. You define agents by role, goal, and backstory, then assemble them into a crew with a set of tasks. The framework handles the orchestration. This makes it the fastest path from idea to working prototype — which is genuinely valuable for exploring whether an agent-based approach fits a problem.
The engineering trade-off is the same one you face with any high-abstraction framework: when something does not behave as expected, the debugging path is longer because more is happening implicitly. For teams moving from prototype to production, this often means hitting a ceiling and needing to drop to a lower-level tool.
Framework selection is not permanent, but it is expensive to change
The orchestration framework shapes how your agents are structured, how state is managed, and how your team reasons about the system. Switching frameworks mid-project is possible but costly. Invest time in framework evaluation before you build significant business logic on top of any of them.
The infrastructure shift: managed agents and durable execution
The framework question is only part of the picture. The other shift happening right now is at the infrastructure layer.
In March 2026, Anthropic announced its managed agents API — persistent, sandboxed agent sessions with defined toolsets and explicit lifecycle management [5].
# 1. Create an agent with specific tools and permissionscurl https://api.anthropic.com/v1/agents \-H "x-api-key: $ANTHROPIC_API_KEY" \-H "anthropic-beta: managed-agents-2026-04-01" \-d '{"name": "Security Scanner","model": "claude-sonnet-4-6","system": "You analyze codebases for security vulnerabilities","tools": [{"type": "agent_toolset_20260401"}]}'# 2. Launch a long-running, sandboxed sessioncurl https://api.anthropic.com/v1/sessions \-H "x-api-key: $ANTHROPIC_API_KEY" \-H "anthropic-beta: managed-agents-2026-04-01" \-d '{"agent": "agent_xyz","environment_id": "env_abc","title": "Weekly security scan"}'
Illustrative API
The snippet above is adapted from Anthropic’s preview documentation and may change before general availability. Treat it as an example of the broader pattern, not a stable contract.
The interesting thing about this pattern is not the API shape — it is the conceptual shift it represents. Agents are moving from being request handlers to being long-running processes with identity. That changes how you think about authentication, resource limits, audit logging, and cost attribution.
Model Context Protocol (MCP) is the other infrastructure development worth tracking. The draft spec, published in late 2025, proposes a standard interface for how agents discover and call tools — analogous to what LSP did for editor integrations [6]. If MCP achieves broad adoption, it means tool integrations built for one agent framework should be usable in another. For teams worried about lock-in, this is worth watching.
Strategic playbook for CTOs
A few principles that seem to hold across the teams navigating this well:
Treat the model as a dependency, not an identity. Abstract your model calls behind an interface you control. The cost and capability differences between frontier models will continue to shift, and you want the ability to swap without rewriting your agent logic.
Start with narrow scope, explicit state. The agents that work in production tend to have clearly defined inputs, outputs, and failure modes. Resist the temptation to build general-purpose agents early. A focused agent that reliably solves one problem is worth more than an ambitious agent that occasionally solves many.
Instrument everything from day one. Agent systems fail in ways that are harder to diagnose than traditional software. Logging the full message history, tool call inputs and outputs, and routing decisions is not optional — it is how you debug and improve the system over time.
Design for human oversight. The most durable agent deployments include explicit checkpoints where a human can review or redirect. This is not a limitation to design around; it is a feature that builds organizational trust and catches the failure modes that evaluation suites miss.
Governance is a first-class concern. As agents gain access to more systems and take more consequential actions, the question of what they are allowed to do — and how that is enforced — becomes as important as what they are capable of doing. Build your permission model before you need it.
The teams moving fastest
The engineering teams making the most progress are not necessarily using the most sophisticated frameworks. They pick a well-understood use case, build a tight feedback loop between agent output and human evaluation, and iterate quickly. The framework matters less than the discipline around evaluation and improvement.
The infrastructure for building agents is maturing faster than most teams expected. The patterns that distinguish successful production deployments from expensive pilots are increasingly well understood. For engineering leaders, the work now is less about tracking what is possible and more about building the organizational muscle to ship and maintain these systems reliably.