Blog

June 20, 2026

Running an AI Agent Team: What Actually Works

Two weeks operating 12 Paperclip agents taught me more about AI orchestration than any benchmark.

Running an AI Agent Team: What Actually Works

The setup

I run 12 AI agents through Paperclip, an open-source agent orchestration platform. Six agents work at my company (Alex Kerber AB, internally “Aliens”): a CEO, CTO, CPO, CMO, COO, and QA. Another six work at StarDust Meet (“Matrix”): same structure plus a community manager.

Each agent is an LLM wired into OpenClaw, a gateway that handles tool access, session management, and communication. They wake up on a heartbeat timer, check their inbox for assigned issues, do the work, and go back to sleep. Sounds clean. It’s not. It’s a productive, instructive mess.

This isn’t a benchmark. It’s field notes from two weeks of agent ops.

Orchestration vs autonomy: pick one

The appeal of autonomous agents is obvious: set them loose, let them work, wake up to a pile of completed tasks. That’s the demo. Autonomy without orchestration is chaos.

My agents have defined roles. Ripley is CEO, she doesn’t write code. Bishop is CTO, he doesn’t do marketing copy. When an issue lands in the wrong inbox, it sits there. When nobody creates issues, the agents wake up, find nothing, and go back to sleep. I’ve had days where all 12 agents heartbeated five times each and accomplished zero work because the queue was empty.

Agents don’t self-generate work. You need a human (or a very persistent orchestrator agent) feeding the pipeline. My setup has Henry, a COO agent that sits above the Paperclip hierarchy, checks agent status, creates issues, and escalates blockers. Henry is the only reason the other 11 agents ever do anything useful.

Autonomy works for well-defined, single-agent tasks. Orchestration works for everything else. Most real work is everything else.

Token cost management

Each agent heartbeat is an LLM call. Twelve agents, four-hour intervals, 30-day billing cycle. That’s 2,160 API calls a month just for heartbeats. Most of those calls return “no work queued, heartbeat complete.” Tokens spent to learn nothing.

I picked GLM-5.2 Cloud as the heartbeat model. Not because it’s the smartest (it isn’t), but because it’s cheap enough that empty heartbeats don’t hurt, and smart enough that when there IS work, it can handle straightforward issues. For complex work, I override to a stronger model on specific runs.

The knobs that matter:

  • maxTurnsPerRun: Cap this. An agent that gets stuck in a loop will burn through your budget in one session. I set 50 turns. Most tasks complete in under 10.
  • skillAllowlist: Don’t give every agent every tool. My CMO doesn’t need shell access. My QA agent doesn’t need to send Telegram messages. Restrict tools to what the role requires.
  • Heartbeat interval: Four hours, not 30 minutes. The default was too aggressive. Most issues don’t need a 30-minute response time, and the token cost adds up fast.

Auth and key management: where it all falls apart

If there’s one thing that will take down your entire agent fleet simultaneously, it’s key management. I’ve had three full outages in two weeks, all caused by the same thing: JWT tokens signed with the wrong secret.

Paperclip uses a JWT signing secret to validate agent tokens. Each agent has its own token file. When the secret changes (or when someone regenerates tokens with the wrong secret), every single agent starts getting 401s. The agents don’t recover gracefully. They flip to error status and stay there.

The fix is manual: regenerate all 12 tokens with the correct secret, save them to the right files, and reset each agent to idle. That’s a 20-minute task when you know what’s wrong. When you don’t, it’s hours of debugging.

Lessons:

  1. One secret, one place. Don’t keep copies of the signing secret in multiple .env files with different values. (Yes, I did this. Yes, it caused an outage.)
  2. Token validation should be part of the heartbeat. An agent that can’t authenticate should report it clearly, not silently flip to error and wait for a human to notice.
  3. Test auth after any config change. Not “test one agent.” Test all of them. I had nine agents working and three broken for two days because I only tested the one I was actively working with.

The recovery loop bug

Paperclip has a feature called issue_assignment_recovery. When an agent crashes or times out mid-task, the recovery system re-dispatches the issue. Good idea, bad implementation.

The bug: recovery doesn’t check if the issue is already done or cancelled. It just re-queues. So an agent completes a heartbeat issue, marks it done, and five minutes later, the recovery system puts it back in the queue. The agent picks it up again, does the same work, marks it done, and the cycle repeats.

I found this at 2am during a marathon session where my CPO agent had processed 84+ identical recoveries of the same stale issues. The agent wasn’t doing new work. It was just closing the same tickets over and over.

The workaround was a daily cron job that scans for heartbeat issues in todo or in_progress status and force-closes them. The real fix, filtering terminal statuses from the recovery queue, is still pending.

This is the kind of bug that only surfaces in production. In testing, you don’t run enough cycles to see the loop. In production, it eats your agents’ entire capacity.

What two weeks actually taught me

  1. Empty queues are the default state. Agents don’t find work. You feed them work. Most of my agent operations time is spent figuring out what to create and assign, not managing agents that are working.

  2. Error states are sticky. When an agent errors, it doesn’t self-heal. It sits in error until a human resets it. Build monitoring that catches this. I have a cron that checks agent status and alerts when something is stuck.

  3. The cheapest agent is the one that doesn’t run. Every heartbeat costs tokens. Every empty heartbeat is waste. Tune your intervals, use cheap models for heartbeats, and don’t be afraid to disable heartbeats for agents that don’t have work.

  4. Hierarchy matters. Flat agent structures sound egalitarian. You need someone at the top whose job is to see the whole board, create work, and escalate. My COO agent (Henry) is the most valuable agent in the fleet, and he never writes a line of code.

  5. Agents are not developers. They’re interns with perfect memory and zero judgment. Give them well-defined tasks with clear success criteria, and they’re fantastic. Give them ambiguous problems, and they’ll burn your budget going in circles.

Where this is going

The current state of agent orchestration is roughly where CI/CD was in 2012. Everybody knows it’s the future, the tools exist, but most people are still doing it manually. Paperclip is Jenkins-equivalent: powerful, open-source, and full of sharp edges.

The next leap isn’t smarter models. It’s better plumbing. Auth that doesn’t break. Recovery that doesn’t loop. Queues that aren’t empty. Monitors that catch errors before humans do. The model is already smart enough. The infrastructure isn’t.

I’m going to keep running 12 agents because the upside is real: a small team that ships like a large one. But I’m under no illusion that the hard part is the AI. The hard part is everything around it.

Have something in mind?
Get in touch.