Don’t tell me what (not) to do!

On the surprising stubbornness of large language models

Written by Nick Thapen on Apr 26, 2024

Stopwatch — Photo by Phil Goodwin on Unsplash

Dealing with language models

Large language models (LLMs) are amazing pieces of technology, able to (fairly) reliably perform some advanced tasks which computers have not previously been able to.

A big issue though is that their behaviour cannot be reasoned about like traditional software. If I’m coding up a script and there’s a failure case I want to avoid it’s straightforward to tell the computer not to do it. When I run the script it will then do what I want every single time (threading considerations aside). An LLM is different since it behaves probabilistically. Maybe it will do what you want 90% of the time, maybe only 20% - which is a problem.

Why won’t you just do what I want?

It’s often very easy to quickly spin up a great demo using LLMs, or get a product 70-80% of the way to being great. Pushing for that last 20% can be a hard slog, and a large part of that is the unreliability. When getting a system involving LLMs to work reliably you can use prompt engineering to get as close as possible, then add non-LLM based workarounds to handle the rest.

Often a model will do something mostly right, but insist on handling some cases in a way that you don’t want. It’s then tempting to instruct the model not to do that. Sometimes this works, but often it doesn’t.

In fact mentioning something in a prompt negatively can draw the model’s attention to it and perversely make it more likely to do it.

A typical interaction with an LLM

At this point it’s tempting to ask the model NOT TO DO THE THING, or resorting to tips or threats, but whether this works is kind of inconclusive. A better way forward is to work with the grain of the model - if it really wants to do something, let it do it, but ask it to classify it’s behaviour. You can then add a later step that filters out the responses you don’t want.

A concrete example - code reviews

Here’s an example of this that I’ve been working on lately at Sourcery. We’d like to add a comment to our code reviews to remind users when they’ve forgotten to update the docstring to a function or method. This would be really nice - it’s easy to forget to update docstrings, and then they gradually decay and become less and less useful over time.

It would be very hard to do this with traditional static analysis tools - how can you tell if the functionality of a method has changed and now differs from what the docstring says? However translation between code and natural English is something an LLM does pretty well.

Here’s an example of such a comment:

**suggestion (docstrings):** Please update the docstring for function: `get_github_installation_by_id`

Reason for update: Functionality has changed to include verification that the installation belongs to the authenticated user.

Suggested new docstring:

"""Fetches a GitHub installation by its ID from the database and verifies it
belongs to the authenticated user. Returns None if the installation does not
exist or does not belong to the user."""

To generate these comments we give the model an extended diff of the changes made, so that it can see which functions have been updated and the existing docstrings. Here’s where the stubbornness of the model comes in - it’s extremely keen on adding docstrings for functions that don’t already have them. Whether to add a docstring to a function is very specific to each team and codebase, so we don’t want to start annoying users by constantly suggesting they add docstrings.

We experimented with various methods for getting the model to stop doing this:

Asking it nicely not to do it
Asking it not to do it in the system prompt
Using a chain of thought approach to get it to carefully consider whether the functions have docstrings
Stressing that it should only return updates for existing docstrings
- Here we went as far as asking it to return the existing docstring to check that there was one, but it just started hallucinating them where there wasn’t one.

Sadly none of these approaches worked.

A typical interaction with an LLM

Models are better at classifying their behaviour than changing it

After thinking about it for a while we decided to try and go around the problem rather than tackling it directly. Rather than asking the model to stop producing responses for functions with docstrings we ask it to classify whether the function has an existing docstring.

We ask for the results in JSON - so here was the field we added to the schema:

// Does the function have an existing docstring that is present in the diff. Look back at the diff to determine this.
has_existing_docstring: boolean;

Unlike the previous approaches this actually worked pretty well!

Then you can filter things yourself

We then added a filtering step to remove responses from the model with has_existing_docstring as False. This hasn’t eliminated the problem completely, but it’s close!

This isn’t quite as good as if the model had never produced the responses - there’s still a cost and a performance implication to getting output tokens that you then have to throw away. To us it’s definitely worth it to eliminate noise.

We’ve had success applying this technique to various different aspects of our code review process - it’s definitely a useful tool in the prompt engineer’s arsenal.

A quick recap

If the model is showing an unwanted behaviour which negative prompting isn’t resolving you can:

Ask the model to classify its response as showing or not showing the behaviour
Then filter out the responses that show the behaviour in code

We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.