Anatomy of a Production Multi-Agent System: LangGraph, .NET, and Context Management

"An agent that replies to emails" is a demo you write in an hour. "An agent that safely processes a refund or retries a fulfillment in a live production system" — that's engineering. I spent the past few weeks designing and building exactly that: a multi-agent automation for ticket handling, where each agent has a specific role, its own tools, and real impact on company systems.

This post is a look under the hood — architecture, design decisions, and the most important lesson I didn't see coming. I'm deliberately leaving out the client's business domain; the focus is on the engineering, which transfers to any process.

Contents:

1. Architecture: two brains, one queue
2. Agent graph instead of one big prompt
3. Tools, API, and external system integration
4. The most important lesson: context management
5. Safety and control: Copilot, confidence thresholds, Dry Run
6. Tech stack at a glance
7. Summary

1. Architecture: two brains, one queue

The starting point: separate "thinking" from "executing". The system is two processes connected by a message broker.

.NET - orchestration and I/O. Accepts webhooks, validates signatures, holds state (EF Core + Postgres), exposes internal APIs with data, and is the only process that talks to the outside world. Transactions, idempotency, and audit live here.
Python - AI Brain. Agent graph on LangGraph: classification, routing, tool selection, draft generation. No writes to the database, no calls to external APIs — that's .NET's job.
RabbitMQ - the bus between them. Queues with TTL and DLQ decouple both processes: AI Brain can crash, restart, and catch up on the backlog without losing anything.

The boundary between processes is a deliberate trust boundary: LLM has no direct write access to the database or external systems — every action call goes through .NET, where it's authorized and logged. Model non-determinism stays on one side of the bus.

End-to-end flow: from incoming event to ready response or escalation.

Backend (.NET): layers and event bus

The backend is five projects, dependencies pointing inward — the domain knows nothing about EF Core, RabbitMQ, or which helpdesk sits on the other side:

Events — event contracts, zero dependencies; any consumer can subscribe,
Domain — entities, interfaces, value objects,
Infrastructure - EF Core, MassTransit, typed HttpClients,
ApiClient — typed REST client for the helpdesk API,
API — webhook ingest, use-cases (MediatR), REST for GUI and internal AI endpoints.

Project dependencies (top) and message flow through the message broker (bottom).

The event bus runs on MassTransit + RabbitMQ: the producer doesn't know its consumers, so the next subscriber (analytics, SLA monitor, notifications) subscribes to the same exchanges without touching the sender. Delivery guarantees live in the endpoint configuration: per-message TTL, retry with backoff, and DLQ after retries are exhausted.

Webhooks: adapter per source

Webhooks land on /webhook/{source}. Based on {source}, we pick an adapter, validate the HMAC SHA256 from the header, and map the raw payload to a provider-agnostic internal IncomingTicketDto. New source = new adapter + DI registration, no domain changes. Every accepted event also lands in an append-only log (raw JSON in jsonb) — audit and replay for free.

2. Agent graph instead of one big prompt

Single-prompt design — one system prompt handling all intents — doesn't scale qualitatively: role boundaries blur, the model gets tools it doesn't need, and hallucinates where it shouldn't. We went the other way: a graph of specialized agents.

At the entry point sits the classifier agent. It identifies the ticket's intent and confidence level, then routes the case to the right specialist. Each specialist is narrow: a clear role and a limited tool set. The status agent can't process a refund — that's the point. Smaller surface area means fewer places to hallucinate and easier testing.

Classifier routes the ticket to the right specialist; each has its own tools. Green - action tools that perform real operations.

How it looks in LangGraph

The graph is nodes (agents) and conditional edges (routing). A single, explicit state flows between nodes — and it's deliberately narrow (more on that shortly). Routing depends on intent and confidence:

class AgentState(TypedDict):
    raw_message: p
    intent: str | None
    confidence:float
    reference: p | None # extracted ticket identifier
    case_data: dict | None # data pulled from client systems
    draft: p | None
    needs_escalation: bool
    trace: list[dict] # step-by-step, stored as jsonb

def route(state: AgentState) -> str:
    if state["confidence"] < 0.7: # configurable threshold
        return "escalation_agent"
    return ROUTES.get(state["intent"], "escalation_agent")

graph.add_conditional_edges("classifier_agent", route)

The confidence threshold (default 0.7) is a hard gate: below it, the case goes to escalation rather than automatic action. We treat uncertainty as a signal, not an error.

3. Tools, API, and external system integration

Most of the work went not into prompt engineering but into tools. We wired the system to the client's existing infrastructure through dedicated endpoints and tool functions built exclusively for this architecture — not generic "catch-all" APIs, but interfaces tailored to what each specific agent is allowed to do.

Tools in two categories:

Read-only - fetch context: case data, conversation history. In LangGraph, these are @tool-decorated functions calling a private REST endpoint in .NET (API key auth, private network). Idempotent, testable in isolation.
Action tools — execute a real operation: refund, retry fulfillment, escalate to a human queue. These are what turn a "chatbot" into a system that actually does something — and that's why they're gated behind a confidence threshold, Copilot mode, and Dry Run (section 5).

AI Brain never calls client systems directly – everything goes through .NET as the only trusted boundary. Tool functions available per-agent include: classify_intent, extract_reference, fetch_conversation_history, fetch_case_data, execute_action, generate_draft.

Models connected via OpenRouter, configured per-agent (model ID, temperature, max_tokens) in the database — changeable at runtime, no restart needed. The classifier gets a cheaper/faster model; agents generating the draft use a stronger one. New helpdesk provider = new adapter + DI entry, no changes to graph logic. Each deployment unit has its own Dockerfile.

4. The most important lesson: context management

I was convinced the hardest part of a multi-agent system would be picking the right model. I was wrong. The hardest — and most decisive for quality — turned out to be context management.

All quality comes down to one question: what exactly does each agent see?

Too much context — the model gets confused, mixes threads, pulls in irrelevant data, and hallucinates. Dumping the full history and all fields into every graph node is the quickest path to quality degradation.
Too little context — the agent makes a decision with half the picture (e.g., it doesn't see that the customer has already written three times about the same issue).

The solution isn't magic, but it requires discipline: each agent gets exactly as much context as its role requires — and nothing more. The graph state is an explicit contract, not a catch-all bag. The ticket subject gets appended to the body only where it aids classification. Conversation history is fetched deliberately, for agents that actually need it. Context, not the model, is the real lever here — you can swap the model in one config line; well-structured context isn't something any API can sell you.

5. Safety and control: Copilot, confidence thresholds, Dry Run

A system that executes real actions needs brakes baked into the architecture — not bolted on at the end.

Copilot mode (human-in-the-loop). By default, AI doesn't send responses to the customer — it prepares them as private notes for a human to approve. The final word stays with the operator.
Confidence threshold. Below a configurable threshold, cases automatically go to escalation instead of automatic action. Uncertainty is a signal, not an error.
Dry Run on by default. A new system starts in "calculate and suggest, but send nothing" mode. Full analysis is visible in the dashboard before you consciously flip the switch to production.
Full observability. Every run records a step-by-step trace: which agent acted, what tools it used, what decision it made, how long it took, and why it escalated. Without this, debugging a system that "sometimes makes different decisions" is a dead end.
Model-agnosticism. Different roles use different models (including OpenAI GPT and Anthropic Claude), selected for quality, cost, and speed. The classifier model is configured separately from the rest.

The trace is an ordered JSON array (jsonb), built during graph execution — one entry per node. It's what the dashboard uses to draw the flow, and what saves the day when debugging:

{
  "node": "classifier_agent",
  "agent": "ClassifierAgent",
  "duration_ms": 320,
  "skills_called": ["classify_intent", "extract_reference"],
  "decisions": { "intent": "status", "confidence": 0.95 },
  "output_fields": ["intent", "confidence", "reference"]
}

6. Tech stack at a glance

AI orchestration: Python + LangGraph - agent graph, conditional routing, explicit state
Backend: .NET (MediatR, EF Core), PostgreSQL
Messaging: RabbitMQ + MassTransit - TTL, retry with backoff, DLQ, multiple consumers
Integrations: adapters per source, webhooks with HMAC SHA256, internal REST API (read-only, API key)
Models: OpenRouter - GPT and Claude, per-agent configuration
Observability and control: GraphTrace (jsonb), Copilot mode, Dry Run, confidence threshold
Delivery: each unit with its own Dockerfile

7. Summary

The boundary between a non-deterministic model and production systems must be hard and deliberately designed. When an agent makes a mistake — and it will — GraphTrace tells you exactly which node, the confidence score tells you why it hesitated, and the DLQ holds the message for replay. Without this observability, you're debugging a system that "sometimes makes different decisions" with no foothold.

Context turned out harder and more important than model selection. Swapping a model is one line in the database. Well-structured graph state — what exactly reaches each node, when you fetch data, how you build the prompt — is weeks of work that no model hosting can replace.

Have questions about specific architectural decisions or building something similar? Reach out.