Tackling Complex Tasks with LLMs

How we tackle reviewing PRs for code complexity issues


Feb 15, 2024

Workflow Diagram
An overview of our LLM Workflow

Large Language Models are getting better at dealing with complicated tasks, but they still wind up struggling to handle difficult reasoning tasks. As we’ve been working on building a more thorough automated code review tool, we’ve been trying to figure out how to properly analyze and feed back on issues around increasing complexity in code.

There are great solutions for finding quantitive measures of code complexity or flagging specific code smells, but these can’t catch everything. Our goal was to see how we could start flagging cases where new implementations were making the code base too complicated that wouldn’t be these existing solutions.

What we found worked well for us is a multi-step request process to multiple LLM agents plus some post process filtering, which we suspect will work well for other complex LLM driven analysis decisions. Here’s an overview of how it works for our code analysis:

  1. Splitting the initial context
  2. Heuristic checks
  3. Expanding the context
  4. LLM analysis & first filter
  5. Structuring a useful response with a LLM
  6. Second LLM filter
  7. Post processing

Workflow Diagram
The generalized version of what our approach looks like

Complexity Analysis Workflow Diagram
Our approach for checking for complexity issues during code reviews

Splitting into many pieces so we can be more efficient

One of the most common approaches to helping a LLM tackle a complicated task is to try to increase the context that it has available. But this can have two big problems:

  1. Increasing the size of the context provided to an LLM has diminishing returns and, at a certain point, seems to actually harm the accuracy of the output, and in the case of code reviews can lead to false positives or hallucinations.

  2. Increasing context is expensive! Especially if we’re needing to provide a lot more context.

When trying to review complexity issues in a code change we always start by looking at the diff of the code change. The diff alone isn’t enough for us to be able to check if there are major complexity issues. We need to be able to see how the code changes interact with their surrounding code. But we usually can’t just send over the full code base, or even a set of full files, before and after the code changes as context. It would be way too much noise and would be incredibly expensive and inefficient.

So our first step in the review process is to break up the code changes into atomic chunks that we can then analyze separately. Typically for code changes we look at grouping relevant changes, like those made within a single function or single class, and split all the changes in a diff into a number of these chunks.

Now we’ve gone from having a diff with many changes to looking at to a bunch of small changes - but realistically not all of those small changes are ever going to be relevant to our analysis.

“Deterministic” filtering up front

Most of the types of code changes in a typical pull request aren’t likely to add in bad complexity. Maybe the change isn’t actually touching code, maybe it’s just adding a couple of new imports, maybe it’s just too small of a change. We group these types of potential changes that we know are unlikely to impact the code complexity into a number of heuristic checks that we then put all of our small code changes from step 1 through. If we can see that they aren’t going to impact complexity up front, there’s no reason for us to even get an LLM involved.

This won’t be relevant for every type of complex task you might be asking an LLM to do, but if there are ways for you to pre-filter out some sections of your input data deterministically then you can simplify the later analysis significantly and make the whole process more streamlined and cheaper.

INFO:__main__:Skipping hunk 13-13 in file diff: coding_assistant/query_handler/base.py (limited new code introduced)
INFO:__main__:Skipping hunk 29-29 in file diff: coding_assistant/query_handler/openai.py (limited new code introduced)
INFO:__main__:Skipping hunk 121-123 in file diff: coding_assistant/query_handler/openai.py (limited new code introduced)

Now can expand the context

Now that we’ve filtered down the initial set of code changes we care about to the potentially relevant sections we can think about adding additional context. For now, we keep things relatively simple in the additional context we provide the LLM to help with the analysis process. We take the chunks of code changes we want to check for complexity changes and turn them from the code diff to the slightly expanded before and after versions of the code - adding in lines that are relevant to what the code does, but which get lost by the diff.

Expanding the Diff to the Before and After Code

There are a whole host of other ways we could look to expand the context further in the future - things like providing context about where else a function is used, issue context around the PR itself, etc. And because we’ve cut down on the number of places where we need to expand the context we’re able to add in more useful context more efficiently. In the example we’re looking at here we managed to save more than 5,000 tokens by breaking it down into multiple chunks and filtering those out before we turned to the LLM.

Finally turning to a LLM

In a post about how we’re using LLMs for complex analysis it’s taken me a while to actually get to the part where we start using a LLM for anything. Figuring out whether a code change is too complex is a pretty complicated ask for a LLM, so we use a chain of thought approach to ask it to walk us through how it’s breaking down whether or not a given section of code is introducing an unreasonable set of complexity and then to give us an answer as to whether or not that code change is making things too complicated. Finally we ask it for how it would improve the complexity if it’s decided that it’s too complex.

We don’t just want the output of this analysis, we also want the reasoning why it is too complicated and how it would fix it. All of that information is going to be very useful for us to feed into turning the result into a valuable code review. And for cases where the LLM has told us that the code change isn’t too complex then we can discard that code change chunk from the rest of our analysis.

Turning our results into useful comments

We now have a potential set of code changes that need to be flagged during a code review and a reason why we’re worried they’re too complicated. Now we need to figure out how to actually tell the developer about this potential issue and help them decide whether or not it needs to get fixed. This is probably the simplest step of our entire process - we can turn back to a LLM and ask it to convert the reasoning and suggested improvements into a useful comment for a developer given the relevant code change.

And onto another LLM to filter

Even though we’ve been pretty rigorous in trying to cut down on the number of potential complexity risks we’re flagging up until this point we still consistently see false positives and unhelpful code review comments sneak through. As we were testing we saw pretty consistently that these false positives and unhelpful comments were coming up in cases where our earlier LLM steps were generating responses that were way too generic to ever be helpful during a code review.

To help cut down on these cases we can lean on another LLM request to check if the feedback we would be giving a developer would be too generic. If it says it is we throw out that as a potential response. During our experimentation and development of the complexity check we saw that this cut down on 80+% of the false positives we were seeing at this stage.

The process we take for doing these complexity checks has quite a few steps and involves a bit of a back and forth with LLMs, but we think it is worth it. In the example we’ve been looking at in this post it reduced the number of tokens used in LLM requests by 77% and cut out several potential false positive responses compared with just using a single LLM request with expanded contexts.

We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.