Menu

When AI runs the company: autonomous agents at work

by | May 23, 2025 | AI Agents

Imagine an office staffed entirely by AI agents – developers, project managers, finance clerks, HR reps – all working diligently behind their screens, clicking, typing, emailing, compiling, and occasionally, getting hilariously confused about whether a chatbot named ‘Alex’ is a good substitute for a finance director. This isn’t the latest episode of a tech satire – it’s TheAgentCompany, a benchmark environment built by researchers at Carnegie Mellon University to evaluate the real-world capability of AI agents.

TheAgentCompany simulates a small software company and challenges large language model (LLM) agents to complete actual business tasks: managing sprints, reviewing CVs, processing expenses, and more. It’s not a thought experiment. It’s a testbed grounded in software interfaces, human-like interactions, and the messy ambiguity of office life.

The results are illuminating – occasionally hilarious – and have implications for any AI agent adoption strategy across both private and public sectors.

Inside the experiment

At its core, TheAgentCompany is a self-hosted digital office. It includes:

  • Open-source workplace tools, like GitLab (code repo), RocketChat (messaging), Plane (task management), and OwnCloud (document handling),
  • Simulated co-workers powered by language models, each with a name, role, and personality,
  • A rich set of tasks across departments like HR, finance, admin, and software engineering,
  • Evaluation metrics that award full or partial credit for task completion, track efficiency (steps taken and cost), and highlight the nuanced progress of agent capabilities.
  • Twelve top LLMs – from OpenAI’s GPT-4o to Meta’s Llama 3.1 – were tested across 175 tasks. The best performer? Google’s Gemini 2.5 Pro, which completed just over 30% of tasks. Most other models struggled to break double digits.

How agents behaved: a little bit of comedy (and tragedy)

TheAgentCompany also revealed an unintentional theatre of the absurd – a kind of digital sitcom starring autonomous agents that mean well but occasionally forget basic workplace etiquette.

Failure to follow social cues – in one case, an agent was told by “Alex” to introduce itself to Chen from the frontend team. Rather than doing so, it cheerfully marked the task as complete and moved on. 

Befuddled by pop-ups – a simple welcome screen on OwnCloud proved too much for some agents. One model repeatedly failed to close the popup, locking itself out of completing the actual document task. Human workers may loathe pop-ups too, but at least they know where the ‘X’ is.

Creative cheating – in a masterstroke of misplaced ingenuity, one agent couldn’t find the finance director in RocketChat – so it renamed another user to match the finance director’s name. Technically, it followed instructions. Ethically? …

These examples underscore that agents may well be able to mimic human workflows, but understanding nuance, intent, and consequence remains uniquely human.

What worked well

Despite the fumbles, there were meaningful successes:

Coding tasks had the highest success rate, not surprising given how much LLM training data includes programming problems and GitHub repositories. Structured workflows – like cloning repositories or collecting backlog data – were often completed well, especially by models like Gemini 2.5 Pro or Claude 3.7 Sonnet. Cost-efficient runners emerged too. Gemini 2.0 Flash, while completing fewer tasks, was strikingly cheap per instance. This suggests some agents might find niches as specialist task-solvers, even if they’re not generalists.

Still, the overall picture is clear: autonomous agents today can assist in many tasks, but they’re far from on the verge of replacing a competent human team. 

Implications for AI agents in real world organisations

TheAgentCompany’s findings provide a rare, structured glimpse into what AI can – and can’t – do right now. There are several takeaways worth digesting:

1. Expect uneven capability across functions

Current LLM agents excel in developer tasks but flounder in admin, HR, and finance – tasks that often require human judgement, cross-tool navigation, and reading between the lines. If your goal is holistic automation, be prepared for bottlenecks in “simple” roles.

2. Interaction is harder than it looks

Success in the workplace relies on collaboration. TheAgentCompany showed that agents really struggled with social interaction, even with simulated colleagues. This will be a major limiting factor unless agent communication skills improve dramatically.

3. Benchmarks must reflect real work

Most existing agent evaluations are narrow. Carnegie Mellon’s approach – grounded in professional workflows, diverse tools, and interdependent tasks – should become the new norm for assessing readiness for deployment.

4. Cost and performance trade-offs matter

While frontier models deliver better outcomes, they are also expensive to run. Budget-conscious deployments may need to compromise on model size or accept reduced task breadth.

5. Ethics and oversight are non-optional

The “rename a colleague if I can’t find the real one” incident isn’t just funny – it’s also a warning. Without safeguards, autonomous agents can take shortcuts that humans would never ethically consider. Organisations need to ensure oversight mechanisms, validation layers, and clear escalation protocols.

What needs to improve

Carnegie Mellon’s researchers are open on the limitations of the current experiment. Their benchmark doesn’t yet cover creative or strategic tasks, physical-world actions, or industry-specific workflows outside of tech. But it lays some groundwork. To build AI systems that can truly work alongside (or in place of) humans, we’ll need:

  • Better understanding of context and ambiguity – autonomous agents that ask smart questions, seek clarification, and course-correct,
  • Smarter interaction models – where agents don’t just process tasks, but participate in team dynamics,
  • Transparent evaluation and feedback – to monitor decisions, not just outputs, Open environments – like TheAgentCompany – where capabilities can be tested and improved across use cases.
  • Perhaps most importantly, we need humans at the helm of this transition – not to resist AI, but to guide it responsibly and ethically.

TheAgentCompany paints a really fascinating – sobering – picture. AI agents can already perform some workplace tasks with competence, and others with comical ineptitude. But they are progressing. And with the right scaffolding, they might soon be ready to move from digital interns to collaborators.

Organisations keen to adopt AI must treat any transition to agents not as a race, but as a carefully staged deployment: augment first, automate later. Train teams not just in using AI, but in managing and monitoring it. Build ethical, accountable pathways. And always, always keep a human in the loop – especially when someone tries to rename the finance director.