Unit Testing Code with a Mind of Its Own

Setting up unit tests when your code isn’t deterministic anymore

Written by Brendan Maginnis on Apr 25, 2024

A few weeks ago we ran into an annoying regression in our product.

We spent a lot of time trying to eliminate praise-like comments¹ from our code reviews with LLM instructions and filtering, and, for a while, that worked perfectly.

But then one day the praise comments snuck back in in a pretty insidious way.

Instead of clearly praising the code they were comments that snuck in the praise. They were showing up as potential bug comments or other types of comments, but they were actually praising the implementation.

A sneaky praise comment

Not only were these comments annoying because praise comments create noise in a review, they were being miscategorized as bug risks or security issues which made the whole review feel wrong.

We started thinking - how can we fix this and how could we have stopped it in the first place?

For most of our code, we would have had unit tests in place and would’ve caught a potential regression like this immediately - but we’d never really considered testing LLM outputs in the same way.

LLM unit testing isn’t that different

The way we approach testing LLM outputs isn’t all that different from how we approach standard unit tests. Structurally they look the same — we use pytest, have fixtrues, set up parameterized tests, etc.

For example, here’s one of our tests:

def test_find_hardcoded_secret(
    review_config: ReviewConfig,
) -> None:
    diff = Diff(
        clean_triple_quote_text(
            """
                diff --git a/.env b/.env
                index 99792dd..0000000
                --- a/.env
                +++ b/.env
                @@ -0,0 +1,2 @@
                 # Created by Vercel CLI
                +POSTGRES_PASSWORD="REDACTED"
            """
        )
    )
    comments = review_security_issues.invoke(diff, review_config.to_runnable_config())
    assert len(comments) == 1
    [comment] = comments
    assert comment.location == ".env:2"
    assert comment.file == ".env"
    assert comment.comment_type == ReviewCommentType.ISSUE
    assert comment.importance == ReviewCommentImportance.BLOCKING
    assert comment.area == CommentRequestReviewCommentArea.SECURITY
    assert comment.diff_line_for_comment == '+POSTGRES_PASSWORD="REDACTED"'
    assert (
        comment.diff_context
        == ' # Created by Vercel CLI\n+POSTGRES_PASSWORD="REDACTED"'
    )
    assert comment.start_line_in_file == 2
    assert comment.end_line_in_file == 2
    assert comment.start_line_in_diff == 1
    assert comment.end_line_in_diff == 1

Just like we’d have in a traditional unit test it has some input data (in our case a code diff), the data is processed/interacted with (we run a code review over it), and we perform a series of assertions on the results.

But one big difference between testing for our LLM application and other applications is that we can’t mock network requests like we usually would. What we get back from an LLM varies at least a little bit on almost every response, so we can’t predict every variation of the response in our mocks. Instead, we wind up actually calling the LLM so we get a “real” response.

Tackling Non-Deterministic LLM Responses

The variability of LLM responses creates another bit of trickiness in writing assertions for the tests. For example, we can’t assume a model will respond with “Looks good” for a given diff, because it may as well say “Looks great”, or even “LGTM”, or any variation of this.

We have a couple of strategies to deal with this:

Comment Type Verification: We confirm that the type of comment, such as “issue” or “testing” (which are generated by an LLM but constrained to a fixed set of types), matches our expectations based on the known characteristics of the diff.
Content Relevance Checking: Instead of looking for exact phrases, we scan for keywords or themes that align with the intended output.
Exclusion Checks: We ensure that non-relevant types of feedback, for example praise-like comments, do not appear in the outputs.

Looking back at the example test from earlier - you might have noticed that we don’t actually inspect the comment message directly. Instead we check that there is a comment, that it’s a blocking security comment, and that the comment is positioned on the right line. We don’t need to validate the specific comment content, but we can look at the broader info to understand that the security issue is being correctly flagged.

Looking Ahead

We’re still in the early days of using unit tests to help validate LLM responses. They’ve allowed us to be more confident in our development and push faster, but there are still some big opportunities to improve our tests that we’re thinking about.

Using LLMs to review LLMs: Some of our more complex tests might benefit from adding an intermediate layer of LLM analysis of the response to help identify certain response characteristics or bucket the response into a given category that we can then deterministically assert against. But this might also add a layer of uncertainty to the tests, so we’ll have to see how well this works.
Figuring out CI tolerance: Because of the inherent variability within the tests, we don’t want our CI to fail right now if any test fails. But there is a threshold at which it should fail. We’re working to identify what that threshold should be, how many times we should run each test per CI run, and what a “pass threshold” would look like for each test. Right now we’re required that 95% of our LLM tests pass, and this seems like a good level for the moment.

We’ve seen some great early benefits to unit testing the LLM responses in Sourcery, both in terms of being able to more confidently make changes without fearing regressions and in validating that new changes do what we want them to do. To quote one of my teammates, “I am now 1000% more confident that I am not introducing regressions,” — and I’d recommend that anyone working with LLMs try setting up unit tests for their projects.

¹Most of the LLMs we’ve tested tend to praise excessively during code reviews, creating a ton of noise

We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.