What autonomous agents actually means in a production context

A client once told us they wanted to build an "autonomous procurement agent." When we dug in, what they actually needed was a system that checked a spreadsheet each morning and sent a Slack message if inventory fell below a threshold. Useful. Worth building. Not autonomous by any meaningful definition.

This isn't a criticism of them — the framing came from the vendor pitching them. But it meant we spent the first two weeks of the project unwinding expectations before we could start designing anything real.

So before we take any engagement, we ask clients to place their idea on a scale. It clarifies scope, it sets honest expectations, and it usually changes what they decide to build.

The autonomy spectrum

We use four categories internally. They're not industry-standard terminology — they're just what we've found useful in client conversations.

Triggered automation

An event fires. A predefined sequence of actions executes. The system doesn't reason — it responds. A webhook triggers a notification, a schedule triggers a report, a threshold triggers an alert.

Most of what gets called an "AI agent" in vendor marketing is this. It's valuable. It reduces manual work. But calling it autonomous is a stretch, because the entire decision tree is hardcoded by whoever set it up.

Supervised agent

The system can reason over a task — plan a series of steps, use tools, adapt if a tool call fails — but requires human sign-off at defined checkpoints. You'd use this where the downside of an error is significant: modifying a customer record, sending a legal document, committing a purchase above a certain value.

Most of what we build falls into this category. The agent handles the cognitive labour. A human approves before anything irreversible happens.

Delegated agent

The system operates independently within a clearly scoped domain. It can make decisions, take actions, and adapt — but only on tasks explicitly delegated to it. Think of an agent that fully owns a company's supplier invoice processing: it receives invoices, validates them against purchase orders, flags exceptions, and routes for payment — without human review of each one.

This requires well-defined scope, good tooling around edge case detection, and audit logging so someone can review what happened. When these things are in place, full delegation is practical and often faster than keeping a human in the loop.

General agent

A system that can reason about open-ended goals, break them into subgoals, spawn sub-agents, and manage the results — without a predefined task structure. This is largely research territory for now. The foundations are moving fast, but production systems operating at this level reliably are rare.

We don't build these. We watch the research carefully, and we're integrating techniques from this space into more constrained deployments. But anyone claiming they're shipping general autonomous agents into production at scale is either defining "production" loosely or they've solved problems the rest of the field hasn't.

Why the distinction matters for how you build

Each category has different engineering and operational requirements. This is the practical reason we care about placing a project on the spectrum before designing anything.

A triggered automation needs clear event definitions and error handling. That's it. You don't need a reasoning loop, you don't need tool registration, you don't need memory management.

A supervised agent needs all of those things plus: a way to communicate its plan before executing, a checkpoint mechanism that pauses execution at human decision points, and a clear handoff UI so the human reviewer actually has what they need to make a good decision quickly. Skip any of those and you'll build a system that's theoretically supervised but practically never reviewed — which means you've built a delegated agent without the engineering that a delegated agent requires.

A delegated agent needs scope guards — hard limits on what the agent can and can't touch — plus anomaly detection, audit logging, and a clear escalation path for when the agent encounters something outside its scope. Without scope guards, you've built something that can plausibly do anything, which means operators won't trust it and it'll run with so many human checkpoints that you've accidentally built a supervised agent.

The question clients almost never ask

Most conversations about autonomous agents focus on what the agent can do. Can it write emails? Can it query databases? Can it make bookings?

The question that actually determines whether a deployment succeeds is: what can the agent not do, and how does it know when it's reached that boundary?

This is scope definition, and it's unglamorous work. It's also where most projects that fail in production ran into trouble. The agent worked fine in testing — because testing had been designed around the happy path. Then in production it encountered an invoice with a missing field, or a supplier name that was spelled two different ways in two systems, or an ambiguous approval status, and the system either silently failed or proceeded incorrectly.

Good scope definition means: you have a list of inputs the agent is designed to handle, you have a list of edge cases the agent should detect and escalate, and you've tested both. Not just the inputs. The edges.

Autonomy is a trust relationship

The reason "autonomous" is worth defining carefully isn't philosophical. It's practical. Operators grant agents access to systems. They allow them to take actions. The level of autonomy they're comfortable granting corresponds directly to how well they understand what the agent will do, how reliably they know the agent will stay within that scope, and what the recovery path is when something goes wrong.

We've seen well-designed agents lose operational trust because the team that built them couldn't answer these questions clearly. And we've seen simple, narrow automations earn genuine confidence because operators understood exactly what they did.

Autonomy isn't a feature you ship. It's a property that gets granted by the people who run the system, based on accumulated evidence that the system does what you say it does and doesn't do what you say it won't.

Build for that. The terminology will sort itself out.