The Code Quality Battle Royale
Pitting 6 AI agents against each other to see who writes better code
Written by Tim Gilboy on
Depending on what article you read you’ll either see someone claim that AI is going to be able to write any piece of software or that AI generates complete slop that isn’t maintainable or usable long term.
I think we all know the reality lies somewhere in between, but I wanted to see if I could test out some objective metrics for the quality of the code different from AI agents and see how they compared against each other. So I pitted 6 AI agents against each other to see who would come out on top.
The contenders:
- Loveable
- Bolt
- V0
- Replit
- Claude Code
- Cursor
There are a whole host of additional agents I could have tested out here (Copilot, Zed, Gemini CLI, RooCode, Cline, Codex, etc) but wanted to try out these 6 as a first pass to cover the browser based agents, a CLI based agent, and an IDE based agent.
tl;dr Bolt and Claude Code easily produced the best quality code for this simple task. Claude wound up being a bit more verbose in some of its areas, but outperformed all the other agents in terms of complexity and maintainability for something so simple.
Now for the details
The Task
The goal was to give each agent an identical task that could ideally be used as a one-shot prompt. I wanted to use a relatively simple task to start with that each of the agents should be able to handle well and shouldn’t need a ton of code complexity.
Each agent was asked to create a simple personal finance tracking web app which could track individual transactions, monitor budgets for specific categories, and had interactive charts and easy filtering to allow the user to drill into the details.
Every agent succeeded. Kind of…
The four browser based agents nailed this task - probably not too surprisingly. Generating a simple web app is right in their wheelhouse and they all had the basic functionality built in, but had relatively simplistic (and similar) designs.
Cursor and Claude needed a bit more nudging to actually wind up in the right state. Both took a more multi-step approach, generating a plan and then needing a slight nudge from me (mostly telling it to keep going/continue) to progress through the plan, while the browser based agents did everything in a single step. Both Cursor and Claude then ran into the same bug in their implementation where they hadn’t properly defined how to add a new transaction to the tracker. But it was easy enough to get them to fix that issue with a single prompt.
Claude’s
take on a simple personal finance dashboard
Different designs and approaches
Almost all of the agents wound up taking a different approach to how they structured their projects and they tended to reflect the core approaches each agent takes to writing software.
Within the browser based agents they all took their standard project structure and approach:
- Lovable - created a React/Typescript based project with a feature based architecture.
- Bolt - also used a React/Typescript based project, but took a slightly more streamline approach than Loveable
- V0 - used a Next.JS router architecture with Typescript. Probably the most complex overall approach of any of the agents I took a look at.
- Replit - used a Python backend and a JavaScript frontend.
loveable-code-quality-analysis/
├── public/
├── src/
│ ├── components/
│ ├── hooks/
│ ├── lib/
│ ├── pages/
│ ├── types/
│ ├── utils/
│ ├── App.css
│ ├── App.tsx
│ ├── index.css
│ ├── main.tsx
│ └── vite-env.d.ts
├── bun.lockb
├── components.json
├── eslint.config.js
├── index.html
├── package-lock.json
├── package.json
├── postcss.config.js
├── README.md
├── tailwind.config.ts
├── tsconfig.app.json
├── tsconfig.json
├── tsconfig.node.json
└── vite.config.ts
Cursor and Claude took simpler approaches.
Cursor used vanilla JS with a single app file rather than going for a more component based approach.
cursor-code-quality-analysis/
├── assets/
├── src/
│ ├── app.js
│ └── styles.css
├── node_modules/
├── index.html
├── package-lock.json
├── package.json
└── README.md
Claude took it even one step simpler and just set up a flat and simple project with a single html, js, and css file.
claude-code-quality-analysis/
├── index.html
├── README.md
├── script.js
└── styles.css
I wouldn’t say that there was anything inherently better or worse about the approaches that each agent took - although Claude’s might have been so minimal that it would need to be fairly significantly expanded if we were looking to extend the project in the future.
The front end agents struggle with quality (except for Bolt)
There isn’t a completely objective way to measure code quality across different approaches for a project like this, but I wanted to see if we could use a few quantitative metrics to assess the high level differences in quality across agents. The major metrics I looked at were:
- Complexity - both Cyclomatic and Cognitive Complexity to understand how
- difficult the code would be for a developer to understand.
- Halstead Volume - another way to represent complexity and provide a proxy for broader maintainability.
- Maintainability Index - a proxy for maintainability and bug risk.
- Function Length - I didn’t care at all here about average length, but looking if there were overly long functions was a good indicator of whether or not the code had quality issues.
I also looked at a handful of other common code smells including:
- Overly Large Classes
- Duplicate Code
- Complex Conditionals
The browser based agents tended to have their projects structured with more, smaller files (especially when compared with Claude Code). But this didn’t lead to them having less complex code - if anything their individual functions tended to be much more lengthy and complex than what I saw from Cursor or Claude.
Replit and Loveable in particular each had a number of highly complex (both cognitive and cyclomatic complexity > 8) functions in their solutions. V0 and Bolt were better with few (2 each) high complexity functions, but V0’s most egregious functions still had cyclomatic complexity > 10.
Cursor had similar issues to Replit and Loveable with 9 highly complex functions (in a lot less code than Replit and Loveable generated), although its most egregious functions weren’t quite as complex.
When it came to complexity, Claude matched Bolt and V0 with only 2 highly complex functions and its most complex functions weren’t as bad as Bolt’s and V0’s
| Agent | Max Cognitive Complexity | Number of High complexity Functions |
|---|---|---|
| Loveable | 22 | 8 |
| Replit | 20 | 8 |
| Cursor | 10 | 9 |
| V0 | 20 | 2 |
| Bolt | 12 | 2 |
| Claude | 9 | 2 |
From a broader maintainability perspective we saw similar trends with a couple of notable exceptions. When it came to Maintainability Index all 6 were relatively similar, with Loveable and V0 falling slightly behind the others. In general all of the code generated by the agents was ranked as relatively maintainable, with the only variance really occurring on some of the more complex files generated.
The Halstead Volume was where we saw some of the biggest variation between agents. The browser based agents all had a lower average Halstead Volume than Cursor or Claude, but that masked some particularly unwieldy functions they had created. Replit and Loveable were particularly bad with their worst functions having Halstead Volumes that were 2x and 3.5x the worst cases we saw from Bolt, V0, Claude, and Cursor.
Bolt and V0 led the way when it came to having manageable functions, with each of them only having 2 functions that exceeded 50 lines (a bit of an arbitrary threshold, but a decent benchmark). Loveable on the other hand each had 11 such functions and Replit had 18!
| Agent | Minimum Maintainability Index | Number of Long Functions |
|---|---|---|
| Loveable | 31.7 | 11 |
| Replit | 53.7 | 18 |
| Cursor | 51.9 | 5 |
| V0 | 44.8 | 4 |
| Bolt | 54.6 | 2 |
| Claude | 58.5 | 5 |
And the winner is…
Bolt and Claude Code both performed near or at the top on every quality metric I looked at. And in terms of absolute metrics there is a case to be made that Bolt actually generated cleaner code across the board than Claude. But, for the simplicity of the approach that Claude took for this problem I’d give it the edge.
Notably, none of the code generated by any of the agents was all that bad. Loveable and Replit felt over engineered, heavily opinionated, and a bit too complex, but even they were taking an apporach that would be easy to extend from where they left off. It definitely didn’t feel like any of the agents generated complete garbage like I’ve seen some people claim.
What’s next?
As I’ve said throughout - this was a very simple project that I asked these agents to go and tackle. To really understand how they impact code quality I want to test them out with a series of different tasks with escalating complexity and then see how they handle extending those projects.
But maybe more importantly, it’s worth looking into what impact decreasing code quality has on an agent’s ability to successfully solve a problem. To do this I want to test out how a few different agents can (or can’t) accomplish the same sets of tasks as the starting codebase they’re working from deteriorates in quality.
In the next couple of weeks I’ll share the results of both of these sets of experiments.