Better LLM Prompting using the Panel-of-Experts

How roleplaying a panel discussion can improve LLM results

Date

May 13, 2024

Birds
Photo by David Clode on Unsplash

A lot of the time you can ask a large language model (LLM) to perform a task and it will give you a decent answer. For example here’s the start of a poem about a pirate doing a code review:

In the glow of a lantern, low and dim,
A pirate stares at the screen’s rim.
“Arr, mateys, this code be a mess,
Spaghetti lines in a deep distress.

However, for more complex tasks or tasks involving multiple steps LLMs are a lot less reliable. Anyone who’s tried to incorporate them into a business or software process has probably run into issues with hallucinations, stubborn refusal to pay attention to prompts and other failures.

One approach that has seen great success in improving performance is Chain-of-Thought prompting. This involves asking the model to think step-by-step, laying out its reasoning process as part of the output before coming to a final conclusion. It has been shown to improve the ability of LLMS to perform complex tasks, and indeed we’ve been able to make great use of it at Sourcery.

The limitations of Chain-of-Thought

In a previous post I wrote about how we are trying to incorporate automatic reminders to update docstrings into our Sourcery code review. Take a look at that one if you want to see one way of working around an LLM which isn’t paying attention to negative instructions in the prompt.

In this post I’ll be talking about the next steps we took to improve the docstring reminders, which involved moving beyond Chain-of-Thought prompting.

The focus of the previous work was on making the model focus just on updating existing docstrings (rather than adding new ones). This worked well, but unfortunately we found that a lot of the suggested updates were still noisy. This was a combination of minor changes to the function which didn’t need an update to the docstring and ‘changelog-style’ docstring updates, where the model would add something like ‘This function now pays attention to GrobState when executing the GrobBlinger. We wanted the suggested updates to trigger only where they will be necessary and valuable, and to reflect the state of the code as it is rather than act as a changelog.

Here's an example of the type of update we wanted to avoid:

**suggestion (docstrings):**
Please update the docstring for function: `get_github_installation_by_id`

Reason for update:
Functionality has changed to include verification that the installation belongs to the authenticated user.

Suggested new docstring:

"""
    Fetches a GitHub installation by its ID from the database.
    Now verifies whether the installation belongs to the authenticated user.
"""

I had recently stumbled across the ‘Panel-of-Experts’ approach, and decided to try it out.

What's Panel-of-Experts?

This is a prompting method inspired by the ‘Tree-of-Thought’ approach which envisaged a tree-like reasoning process, with a model coming up with various solutions and iterating through them a step at a time using a search algorithm.

This tree of thoughts was then developed into a prompting strategy, using a prompt like this:

Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realises they're wrong at any point then they leave.
The question is...

This (at least anecdotally) was able to achieve improved results on reasoning tasks than Chain-of-Thought.

The Panel-of-Experts is a further extension to this, using a similar prompt but evoking the idea of a panel discussion, and drawing on the idea that the training data for LLMs contains many such discussions. A big problem with LLMs is that once they start a line of reasoning which is incorrect they will usually follow it to a conclusion rather than realising the error. A panel with different personas is more likely to introduce differing viewpoints and arguments, hopefully leading to better reasoning and outcomes.

Using the Panel-of-Experts in practice

To experiment with the new approach we first created unit tests with a mix of examples that the current prompt found easy and difficult. The average error rate on this test set from the original prompt was around 40%.

We then tried out various modifications to our prompting - Here’s the start of our system prompt for this panel discussion:

You are a panel of three experts on code documentation - Alice, Bob and Charles.

When given a diff containing code changes, your task is to determine any updates required to the docstrings in the code.
You will do this via a panel discussion, trying to solve it step by step and make sure that the result is correct.

At each stage make sure to critique and check each others work - pointing out any possible errors.

To ensure that we got coherent responses we also added a section asking the panel to collate results at the end:

Where you find a function with a docstring, one person should put forward the argument for updating it,
then the others should critique it and decide whether the update is necessary.

The existing docstrings must be present in the diff.
Only include a function if the update to the docstring is significant and is directly related to the changed lines.

Once you have the complete list of updates return them as a json list in the following format.
You should not add any further points after returning the json.
<json schema here>

This consistently returned a discussion followed by a json list which we could parse.

When given a diff this prompt could produce some interesting discussions, and I was able to see examples where one panellist started down an incorrect reasoning path and was rescued by another.

Here’s a snippet from one of the panel discussions where they decide to make an update:

Bob: Let's move on to the `spiral_double` function in the `gdsfactory/components/spiral_double.py` file.

Charles: The function `spiral_double` has an existing docstring, and the parameter `min_bend_radius` now allows `None` as a value,
         defaulting to the radius from a cross-section object.
         This change is not reflected in the docstring.

Alice: The developer did not update the docstring to reflect this change.
       We should suggest an update.

and here's a discussion where one panellist proposes an update but the others catch that it isn't necessary:

Alice: Lastly, let's look at the new files and changes in `yaml_tests.py`.
       The changes here include the addition of `not_contains` in the `ExpectedComments` model and adjustments in `run_yaml_test` to handle this new field.
       These are significant changes to how the tests are run and what they expect.

Bob: However, Alice, there are no existing docstrings in the diff for `yaml_tests.py` that discuss the specifics of handling different types of expected comments.
              The changes are substantial, but without an existing docstring that we can see in the diff that needs updating, we can't propose a change.

Charles: That sums it up well. We don't have any existing docstrings visible in the diff that require updating based on the changes made.
         The developers have either updated them already or the changes are to parts of the code not covered by the existing docstrings.

It took some tweaking of prompts and iteration, but in the end this approach proved very fruitful - knocking the error rate on my test set from 40% down to a more manageable 20% (remembering that the test set was chosen to be full of hard examples).

Cost Considerations

This improvement didn’t come for free however. This prompting approach involves more of the model discussing things back-and-forth than our previous approach, which was based on Chain-of-Thought prompting.

The panel of experts led to a rough doubling of our cost to review the docstrings for a particular diff. Given that the overall costs are still small and it led to a significant performance boost we decided that this was worth it, but this is something to keep in mind when considering this approach.

Conclusion

Wrangling LLMs can be tough and frustrating, so finding a clear improvement to performance like this was a blessed relief. I hope you found this useful - do let me know if it was helpful or if there are other topics you’d like us to cover.


We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.