Unit Testing Code with a Mind of Its Own

Setting up unit tests when your code isn’t deterministic anymore

Written by Brendan Maginnis on

LLM unit test

A few weeks ago we ran into an annoying regression in our product.

We spent a lot of time trying to eliminate praise-like comments1 from our code reviews with LLM instructions and filtering, and, for a while, that worked perfectly.

But then one day the praise comments snuck back in in a pretty insidious way.

Instead of clearly praising the code they were comments that snuck in the praise. They were showing up as potential bug comments or other types of comments, but they were actually praising the implementation.

A sneaky praise comment

Not only were these comments annoying because praise comments create noise in a review, they were being miscategorized as bug risks or security issues which made the whole review feel wrong.

We started thinking - how can we fix this and how could we have stopped it in the first place?

For most of our code, we would have had unit tests in place and would’ve caught a potential regression like this immediately - but we’d never really considered testing LLM outputs in the same way.

LLM unit testing isn’t that different

The way we approach testing LLM outputs isn’t all that different from how we approach standard unit tests. Structurally they look the same — we use pytest, have fixtrues, set up parameterized tests, etc.

For example, here’s one of our tests:

def test_find_hardcoded_secret(
    review_config: ReviewConfig,
) -> None:
    diff = Diff(
        clean_triple_quote_text(
            """
                diff --git a/.env b/.env
                index 99792dd..0000000
                --- a/.env
                +++ b/.env
                @@ -0,0 +1,2 @@
                 # Created by Vercel CLI
                +POSTGRES_PASSWORD="REDACTED"
            """
        )
    )
    comments = review_security_issues.invoke(diff, review_config.to_runnable_config())
    assert len(comments) == 1
    [comment] = comments
    assert comment.location == ".env:2"
    assert comment.file == ".env"
    assert comment.comment_type == ReviewCommentType.ISSUE
    assert comment.importance == ReviewCommentImportance.BLOCKING
    assert comment.area == CommentRequestReviewCommentArea.SECURITY
    assert comment.diff_line_for_comment == '+POSTGRES_PASSWORD="REDACTED"'
    assert (
        comment.diff_context
        == ' # Created by Vercel CLI\n+POSTGRES_PASSWORD="REDACTED"'
    )
    assert comment.start_line_in_file == 2
    assert comment.end_line_in_file == 2
    assert comment.start_line_in_diff == 1
    assert comment.end_line_in_diff == 1

Just like we’d have in a traditional unit test it has some input data (in our case a code diff), the data is processed/interacted with (we run a code review over it), and we perform a series of assertions on the results.

But one big difference between testing for our LLM application and other applications is that we can’t mock network requests like we usually would. What we get back from an LLM varies at least a little bit on almost every response, so we can’t predict every variation of the response in our mocks. Instead, we wind up actually calling the LLM so we get a “real” response.

Tackling Non-Deterministic LLM Responses

The variability of LLM responses creates another bit of trickiness in writing assertions for the tests. For example, we can’t assume a model will respond with “Looks good” for a given diff, because it may as well say “Looks great”, or even “LGTM”, or any variation of this.

We have a couple of strategies to deal with this:

Looking back at the example test from earlier - you might have noticed that we don’t actually inspect the comment message directly. Instead we check that there is a comment, that it’s a blocking security comment, and that the comment is positioned on the right line. We don’t need to validate the specific comment content, but we can look at the broader info to understand that the security issue is being correctly flagged.

Looking Ahead

We’re still in the early days of using unit tests to help validate LLM responses. They’ve allowed us to be more confident in our development and push faster, but there are still some big opportunities to improve our tests that we’re thinking about.

We’ve seen some great early benefits to unit testing the LLM responses in Sourcery, both in terms of being able to more confidently make changes without fearing regressions and in validating that new changes do what we want them to do. To quote one of my teammates, “I am now 1000% more confident that I am not introducing regressions,” — and I’d recommend that anyone working with LLMs try setting up unit tests for their projects.


1Most of the LLMs we’ve tested tend to praise excessively during code reviews, creating a ton of noise


We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.