How I Built an AI Agent System for Enterprise Supply Chain Intelligence

February 10, 2026

English

The first version of IntelliSupply had one agent. It received supply chain data, signals from external sources, and the instruction to "analyze this and recommend mitigation strategies."

It was impressive in demos. It was useless in production.

The agent would hallucinate delay estimates. It would produce recommendations that were logically structured but mathematically wrong — "rerouting through Supplier Y saves 3 days" when the supply graph made that physically impossible. When I traced the bad outputs, I found the same root cause every time: the LLM was doing math. It would reason about delay cascades, calculate propagation through the supply chain, and arrive at confident wrong numbers.

The fix sounds obvious in hindsight but took me longer than it should to accept:

Use LLMs for language. Use algorithms for math.

These are not interchangeable roles. The moment I stopped asking the LLM to compute delay cascades and gave that job to a deterministic BFS algorithm, the outputs became reliable enough to actually use.

The Four Components

Before the architecture: what we're actually building.

Component	What It Does
Digital Twin	A live graph model of your supply chain (suppliers → warehouses → distributors → customers)
Risk Agents	Monitor external signals (news, weather, financials) and flag potential disruptions
Simulation Engine	When a risk is flagged: run "what-if" scenarios on the supply graph
Recommendation Engine	Translate simulation outputs into actionable mitigations

The LLM touches the first step (risk scoring from text signals) and the last step (generating human-readable recommendations). The simulation — the part that requires math — is entirely deterministic.

Why Not One Big Agent?

My first instinct was a single "supply chain expert" LLM agent with all the context. It failed for three reasons beyond the math problem:

Context overload — a real supply chain graph with 200+ nodes doesn't fit in a prompt
Reliability — a monolithic agent hallucinates more; no structured validation between steps
Debuggability — when it gives a wrong recommendation, you can't tell which step went wrong

The right approach: specialist agents with structured Zod-validated handoffs.

Loading diagram...

Each agent receives a Zod-validated input and produces a Zod-validated output. No free-form text passes between agents.

Inter-Agent Contracts

The contracts between agents are more important than the agents themselves:

// Risk Monitor output → Simulation input
export const RiskAssessmentSchema = z.object({
  severity: z.enum(['low', 'medium', 'high', 'critical']),
  affectedNodes: z.array(z.string()),
  riskType: z.enum(['supplier_failure', 'logistics_delay', 'demand_spike', 'geopolitical']),
  confidence: z.number().min(0).max(1),
  estimatedTimeToImpact: z.number().describe('Hours until disruption'),
});

// Simulation output → Recommendation input
export const SimulationResultSchema = z.object({
  baselineDeliveryDays: z.number(),
  disruptedDeliveryDays: z.number(),
  affectedOrders: z.number(),
  estimatedRevenueLoss: z.number(),
  criticalPath: z.array(z.string()),
  alternativeRoutes: z.array(z.object({
    route: z.array(z.string()),
    additionalCostPct: z.number(),
    feasibilityScore: z.number(),
  })),
});

If the Risk Monitor returns severity "low" and confidence < 0.4, the orchestrator filters it out before it ever reaches the simulation agent. The orchestrator is the only component with flow control authority.

The Orchestrator: Escalation Ladder

The orchestrator follows a strict escalation ladder to keep costs predictable:

Loading diagram...

This escalation ladder cuts LLM costs by ~60%. 80% of signals are low-severity noise. Only medium+ risks trigger the simulation, and only high/critical risks trigger the expensive recommendation agent.

The Digital Twin: LLMs for Language, Algorithms for Math

The simulation engine is where the architecture decision that changed everything lives.

Loading diagram...

The BFS traversal propagates delays downstream through the supply graph: if Supplier A is disrupted with a 3-day delay, every node that depends on it accumulates that delay plus transit time. The critical path is the longest delay chain to any customer node. This is deterministic — given the same graph and the same disruption, the output is always the same.

The LLM then reads the numeric simulation output and generates the recommendation. Not the numbers — the numbers are fixed. The reasoning about what to do about them, and communicating that clearly to a human decision-maker, is where the LLM adds value.

The recommendation agent's output quality depends entirely on the simulation accuracy. If your BFS has a bug — wrong graph traversal, incorrect delay calculations — the LLM will confidently generate recommendations based on wrong numbers. Always validate the deterministic simulation separately before connecting it to the LLM.

What Impressed Users Most

Two features consistently got the strongest reaction in demos.

The "what-if" visualization. A supply chain graph where disrupted nodes glow red and the delay cascade propagates downstream in real time — people immediately understood the impact without explanation. Showing a 7-day customer delay as a red path through the supply graph landed in 3 seconds what a table of numbers couldn't communicate in 30.

Recommendations with numbers. Not just "use an alternative supplier" but: "switching to Supplier Y for 30 days costs an estimated ₹4.2L but reduces delay from 18 days to 6 days — feasibility score 0.84." Numbers make decisions actionable. Vague recommendations get ignored. Specific ones with cost estimates get implemented.

The second feature was only possible because the simulation ran first and produced exact numbers that the LLM could cite. This is the practical proof of "LLMs for language, algorithms for math" — the LLM produces better language when the math is done for it.

Live Dashboard with Supabase Realtime

Every agent run creates an immutable event record in Supabase. The dashboard subscribes to new incidents without polling:

Loading diagram...

The schema stores each agent's structured output as JSONB — giving you a full audit trail, replay capability, and the ability to query historical decisions for model evaluation. When the recommendation agent gives an unusual output, you can replay the exact simulation input that produced it.

Agent Performance

Agent	Avg Latency	Token Usage	Model
Risk Monitor (pre-filter)	0.3s	~800 tokens	llama-3.1-8b (cheap)
Risk Monitor (full analysis)	1.2s	~2,400 tokens	llama-3.3-70b
Simulation	N/A	0	Deterministic BFS
Recommendation	2.1s	~3,800 tokens	llama-3.3-70b
Full pipeline (high/critical)	~3.5s	~7,000 tokens	—

Resources

LangGraph.js docs — State machine orchestration
Supabase Realtime docs — WebSocket subscriptions
GitHub — rocker1166 — Demo and architecture

IntelliSupply showed me that multi-agent systems are careful software engineering applied to LLM primitives. The hardest architectural decision — and the one that took longest to accept — was that the LLM shouldn't do the math. An LLM writing narratives about a supply chain disruption is producing its best work. An LLM computing delay cascades through a 200-node graph is producing its worst. Put each component in the role it's actually good at, and the system becomes both more reliable and more impressive. That's the insight. Everything else is implementation.

Posted ondevwith tags:

#ai-agents #nextjs #typescript #supabase