Agent Engineering: A Framework for Building Production-ready, Remotely Deployed Agents

Parker Hegstrom

There’s no shortage of hype around “fully autonomous AI Agents.” Scroll through Twitter or read your favorite tech blogs, and you’ll find bold claims about Agents that can handle entire workflows, make complex decisions, and operate without human intervention.

The reality? Most Agents aren’t anywhere close to “fully autonomous.” Between the chatbot demo and the deployed Agent running in production, there’s a significant amount of iteration that needs to happen. Most teams don’t have a clear mental model for how to get there.

At Mural Pay, we call this process Agent Engineering.

Over the past few months, our engineering team has been using a simple levels framework to rapidly engineer Agents that can truly run autonomously. This post breaks down that framework. What we found was that this framework was not just useful for our technical team but also our business development team.

Definitions

Agent: A chain-of-thought-loop execution with access to context, skills, tools, MCP servers or other Agents.

Agent Engineering: The process of iterating on an Agent — using tools like skills, MCP servers, sub-agents, and agent teams — to push that Agent toward its Complexity Boundary.

Complexity Boundary: The upper bound of task complexity an Agent can handle while still maintaining a single entry point (no human-in-the-loop once the chain of thought loop has been initiated) and generating repeatable, high-confidence results.

Keep these definitions in mind. They are the north star for everything that follows.

First Principles

Before jumping into the levels framework, a few foundational principles (hot takes?) worth internalizing:

I. Agents are for Tasks, not Roles

Don’t build a “Software Engineer Agent.” Build a set of Agents that can each perform a task a software engineer would do: root cause an error, write a unit test, open a pull request, notify your team the fix is ready. Then orchestrate those Agents via an event handler or cron job for remote deployment, or keep a human in the loop for now.

Agents that are defined by a generic role sound useful, but this is the wrong mindset if you are trying to build production-ready, remotely deployed Agents. Engineers know this. If you want to build something that actually works, you need to build the component pieces. And for Agent Engineering, the component piece is the atomic task an Agent can accomplish.

II. The Best Agent Is One That Doesn’t Need You

The best Agent can accomplish its task without any human intervention. Here’s why this matters so much: if you have to be in the loop, you can never deploy your Agent remotely. An Agent that requires human-in-the-loop can be an incredible force multiplier for you personally — an Iron Man suit that makes you dramatically more effective. But there is only one of you.

If you want an AI coworker rather than an AI assistant, it must be able to run without you. Humans reviewing the output of an Agent’s work is fine and expected — but a human being required to steer the Agent mid-task means that Agent will never scale.

III. Maximize Complexity In Your Agent, But Maintain a Single Entry Point

The goal of Agent Engineering is to maximize your Agent’s Complexity Boundary — the sweet spot where your Agent handles maximum task complexity while maintaining:

  1. A single entry point (you kick it off once)

  2. Repeatable, high-confidence results

Think of a well-engineered Agent as an artifact — like a function in a library. You call it with a single input, it does its thing, and it returns a result. If you find yourself with a “god agent” that requires constant human-in-the-loop to course correct, your Agent is doing too much. Once you’ve found your Agent’s Complexity Boundary, you’re ready to deploy it. Not to run it locally, not to supervise it, but to deploy it as autonomous infrastructure — triggered by events, running on a schedule, operating as a true AI coworker.

That’s the real goal of Agent Engineering: to build Agents that can be deployed.

The Agent Levels Framework

Think of these levels as a progression. Each level represents a more capable, and most importantly, a more deployable Agent. While an Agent at any of these levels below is already providing value, constant movement up the levels framework is the goal.

Level 1: Basic Chat

What it looks like: You’re working with an AI in a back-and-forth conversation to complete a task. Write me an email. Summarize this document. Help me debug this function.

Products: Claude.ai, Claude Code in conversational mode, ChatGPT.

This is the entry point for most people. Useful, but not scalable. The human is doing all the orchestration.

Level 2: Basic Chat + Access to State

What it looks like: The Agent has read/write access to persistent state — a codebase, a config file, a Notion document or an Asana Task. You’re still working with the Agent, but now it can read context and persist an output to be used in subsequent Agent executions.

Products: Claude Code, ChatGPT with plugins/connectors, Claude with connectors.

This is where an Agent starts to feel genuinely useful. However, you’re still the orchestrator, and the Agent still can’t run without you.

Level 3: The Atomic Task Agent

What it looks like: You’ve codified a repeatable, bounded task into a ‘command’ or ‘skill.’ The Agent can now execute this task reliably from a single prompt, without you guiding it step by step.

Examples:

  • Open a PR from the current branch into dev

  • Write a unit test for a given function

  • Post a summary of today’s datadog monitor triggers to a slack room

  • Produce a product proposal using Gong and meeting notes from our google drive

Why this matters: At this level, you have created a reproducible artifact. It can be shared with teammates to run locally or — more importantly — deployed and invoked remotely.

The key discipline here is specificity. Atomic tasks have clear inputs, clear outputs, and a bounded scope. If you can’t describe the task in one sentence, it probably isn’t atomic.

Level 4: The Meta-Task Agent

What it looks like: Instead of a single atomic task, your Agent handles a meta-task — a composed sequence of atomic tasks that together accomplish something meaningful.

Here’s the insight: people, employees, developers already think in meta-tasks. When you’re fixing a bug, you don’t just “fix the bug.” You:

  1. Root cause the issue

  2. Formulate a plan

  3. Implement the plan

  4. Test your implementation

  5. Open a PR

  6. Notify the team

That’s six atomic tasks chained together under one goal: “fix the bug.” A Level 4 Agent can run that entire sequence from a single prompt.

What you’ll need to manage at this level:

  • Context window: In general, Context Engineering becomes much more important at this level as a bloated context window will confuse your Agent and degrade its output quality. Consider offloading specific sub-tasks to sub-agents (a Claude term but still generalizable) to preserve the main Agent’s context window.

  • Progressive disclosure of context: Don’t front-load your Agent with every piece of context it might ever need. Introduce additional context, skills, and tools only when the Agent needs them for a given sub-task. This keeps the context window clean and the Agent focused.

  • Runtime Performance: For complex meta-tasks, consider using sub-agents to parallelize work.

Level 5: Maximize the Agent’s Complexity Boundary

Level 5 is less a discrete level and more a ~mindset~.

It represents the ongoing work of incrementally expanding the purview of a Level 4 Agent — taking advantage of model improvements, new tooling, better harnesses, and accumulated engineering experience to do more with a single prompt.

Maybe today your Agent confidently handles 3 of the 5 sub-tasks in your meta-task. Tomorrow, with better prompting, a new MCP server, or a smarter sub-agent, it handles 4. This is the work of Level 5: continuously pushing the Complexity Boundary of what your Agent can accomplish autonomously.

But there’s an important caveat: more is not always better. At some point, the correct answer is not to make the Agent do more. It’s to orchestrate multiple Agents via an external mechanism such as an event handler, a cron job, or a workflow engine. Recognizing this inflection point is an art form.

That inflection point is the Agent’s Complexity Boundary.

Conclusion

The path from “I’m using AI to help me code” to “I have deployed Agents running autonomously in production” is not a leap — it’s a deliberate, incremental climb through these levels. Start with atomic tasks. Codify them into skills. Chain them into meta-tasks. Manage context carefully. Push the boundary until you find your Agent’s Complexity Boundary. Then deploy it.

The engineering muscle required to reach Level 4 and Level 5 is real, but it’s learnable. And the payoff — Agents that work while you sleep, that scale beyond what any individual can do, that operate as genuine AI coworkers — is worth it.