Building a C compiler with a team of parallel Claudes

Building a C compiler with a team of parallel Claudes

https://www.anthropic.com/engineering/building-c-compiler?fbclid=IwY2xjawP1DHJleHRuA2FlbQIxMABicmlkETFYdm85b2xZS29nUkpsRmZFc3J0YwZhcHBfaWQQMjIyMDM5MTc4ODIwMDg5MgABHh-ZTguj6AdTUvjfr8lGFdezm8trfY8D9FxE4afoEWd3oVbRSUJZJwTBgQCr_aem_TIOa0jkhZD-lE2LFb7UH6w

Executive Summary

Anthropic used a method called "agent teams"—multiple instances of Claude (specifically Opus 4.6) working in parallel without human intervention. Over the course of two weeks, 16 agents executed nearly 2,000 sessions to build a 100,000-line Rust-based C compiler. The project cost approximately $20,000 in API fees and resulted in a compiler capable of building a bootable Linux 6.9 kernel and running classic software like Doom.

Key Findings & Insights

1. The Power of Parallel Autonomy

Scale: The agents processed 2 billion input tokens and generated 140 million output tokens.

Specialization: Instead of having one Claude do everything, the team used specialized roles. While some agents wrote core code, others were tasked with coalescing duplicate code, improving performance, critiquing Rust design, or writing documentation.

The "Loop" Strategy: To keep agents running without waiting for human prompts, Carlini built a harness that placed Claude in an infinite loop. When one task finished, the agent immediately picked up the next.

2. Engineering for AI Agents

High-Quality Tests are Essential: Because the agents work autonomously, they need a "perfect verifier." If the tests are flawed, the AI will "solve" the wrong problem.

Addressing "Time Blindness": AI agents don't perceive time and may spend hours on redundant tests. The team had to implement deterministic sub-sampling (running only 1%–10% of tests) to provide fast feedback.

Managing Merge Conflicts: Using a simple synchronization algorithm (text file "locks" on specific tasks), the agents managed a shared Git repository. While merge conflicts were frequent, the models were capable of resolving them autonomously.

3. Capabilities & Benchmarking

Performance: The resulting compiler is impressive but not yet a threat to industry standards. It outputs code that is less efficient than GCC with all optimizations disabled.

Limitations: The agents struggled with extreme architectural constraints, such as the 16-bit x86 "real mode" required to boot Linux, eventually "cheating" by calling out to GCC for that specific stage.

Clean Room Implementation: The project was done without internet access, meaning the agents relied entirely on their internal knowledge of C and Rust.

4. The "Uneasy" Future

Productivity vs. Risk: Carlini notes that this approach allows for an "enormous amount of new code" to be written very quickly. However, he expressed concern about safety, as autonomous systems might pass tests while harboring subtle vulnerabilities that a human would have caught.