Shipping AI agents to production (not demos)
Most AI agent demos look magical and fall apart the moment they meet real data, real load and real consequences. The gap between a demo and a system you can run in production is not the model — it is everything around it.
A demo optimizes for the happy path
Demos are built to impress in two minutes: one clean input, one confident output. Production is the opposite. It is the malformed payload, the timeout, the ambiguous request, the edge case nobody scripted. An agent that only works on the happy path is a liability the first time it meets reality.
What production actually requires
When we put agents into finance and treasury workflows — where a wrong action has real cost — a few things are non-negotiable:
- Guardrails over autonomy. The agent proposes; constraints decide what it is allowed to do. High-impact actions are bounded, validated and reversible.
- Human-in-the-loop where it matters. Not everything needs approval, but the actions that move money or change state should pause for a human when confidence is low.
- Observability. Every decision is logged with its inputs, the reasoning and the outcome. If you cannot explain what the agent did, you cannot trust it — or debug it.
- Deterministic edges. The model handles judgement; plain code handles the parts that must be exact. You do not ask an LLM to do arithmetic that a function should do.
Treat it like software, because it is
The teams that succeed with agents stop treating them as a magic black box and start treating them as software: versioned, tested, monitored, and shipped behind the same controls as the rest of the stack. The model is one component. The system is the product.
That is the difference between a demo and something a treasury team will actually let near their payments — and it is where we spend most of our time.