Improving LLM Responses After the Fact

One of the first big problems we ran into when we started building our code review engine was that the in-line comments the LLMs produced were incredibly inconsistent. Some were brilliantly insightful, picking up issues in the code or flagging great potential improvements, some were tedious nitpicks, and some were dead wrong.

Early Sourcery review comments — Early comments (good and bad) from Sourcery's reviews

Hallucinations and factual accuracy are likely to continue to be a problem for LLMs for the foreseeable future, even while they’re improving. But what we started to realize was that for code reviews the issue isn’t simply a question as to whether or not a comment is true, but whether or not it is useful.

Usefulness as a metric

When we think about code reviews we want the comments we’re leaving to be actionable items that add value to the author of the pull request and help them make sure their code passes muster. This means that there are a whole class of comments that wouldn’t be “useful”:

Comments about the code that are factually wrong - either in general or specifically to that codebase
Comments about the code that are too generic to make a difference to that PR
Comments about the code that would not be actionable in some way
Comments about the code that are too nitpicky to be worth mentioning in a review

Looking at these classes of comments we didn’t want to make, we set up a relatively simple metric to measure the quality of our comments in our reviews - a usefulness score. The usefulness score is simply the ratio of comments made that were useful divided by the total number of comments made.

2 code comment examples, one we marked useful and one we did not

Usefulness is a weird metric because it’s strictly subjective - even among our own team we don’t always agree about whether or not a given comment is useful or not. But it wound up being a great aggregated metric over time. We review comments across our whole team and average out the results and it allows us to see how a set of changes we made to our reviews or to our validation improved our comment quality (or made them worse). We’ve continued to use usefulness as a metric for a variety of factors of tracking our reviews - things like individual comment usefulness and full review usefulness analysed by our team or individual comments usefulness as rated by the author of a PR.

Optimising for Usefulness - a first pass

At first we tried to improve the usefulness of comments simply by expanding our prompt to say the comments had to explicitly be useful. Unfortunately (and maybe we should have expected this) this didn’t help cut down on the number of not useful comments. LLMs, it turns out, have a fairly persistent sense that the content they generate is useful.

Our next try was to ask the LLM to walk through its decision about why a comment was useful while we were generating it (a bit of a chain of reasoning style approach). But again, this didn’t meaningfully improve how useful the generated comments were.

Finally we tried sending a follow up to the LLM after it created the comment asking it which of its comments were useful. But still it always managed to create a plausible argument about why what it put forward was useful. As I said - LLMs seem pretty reluctant to say that their own creation isn’t useful.

A different optimisation and filtering approach

With our pre-filtering approaches not working particularly well, we needed to take a different approach to improve the usefulness rate of our comments. Generically trying to filter for “usefulness” from the same LLM conversation didn’t work out and we suspected that using a different LLM agent for the same type of check (”is this comment useful”) wouldn’t work particularly well - and it didn’t. This simple post-filter check didn’t yield any real improvements to the comment quality we were making (43% vs 42%).

Needing a bit of a different approach we decided to try to look at what did a comment need to be useful. In general it needed to be:

Factually correct
Valid to the code
Actionable
Specific, and not generally applicable
Correct in the context of the code
Taking into account the other changes in the PR
Valuable to the author

We put together a validation request using a separate LLM for every comment we generated where we asked it to give a boolean answer for each of these categories above. Along with that we asked it to give us an explanation about why the comment in question fit or didn’t fit the category. This gave us 28 potential validation combinations to check to see which of them wound up being the most valuable at improving usefulness - and unlike our previous attempts, these actually moved the needle.

A matrix of how useful different validators were — A comparison of how combinations of the validation factors improved Usefulness. Green is better, red is worse.

Ultimately we found that 4 of these checks had the most impact on improving our comment usefulness - Valid to the code, Actionable, Specific, Valuable to the author. And that combining the 4 of them together led to the biggest lift in usefulness without overly filtering out truly useful comments. In the end we were able to increase the average usefulness of the comments in our code reviews with this relatively simple method from the low 40s% to roughly 60%.

This won’t be the end of us trying to improve the usefulness of our comments and there are lots of approaches that we’ll want to take in the next few weeks/months. A few things we have in mind are:

Expanding the context that our initial review agent has access to when making comments
Changing from binary validators to score based validators so we can make more finely tuned decisions

We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.