Setting up unit tests when your code isn’t deterministic anymore
Apr 25, 2024
A few weeks ago we ran into an annoying regression in our product.
We spent a lot of time trying to eliminate praise-like comments1 from our code reviews with LLM instructions and filtering, and, for a while, that worked perfectly.
But then one day the praise comments snuck back in in a pretty insidious way.
Instead of clearly praising the code they were comments that snuck in the praise. They were showing up as potential bug comments or other types of comments, but they were actually praising the implementation.
Not only were these comments annoying because praise comments create noise in a review, they were being miscategorized as bug risks or security issues which made the whole review feel wrong.
We started thinking - how can we fix this and how could we have stopped it in the first place?
For most of our code, we would have had unit tests in place and would’ve caught a potential regression like this immediately - but we’d never really considered testing LLM outputs in the same way.
The way we approach testing LLM outputs isn’t all that different from how we approach standard unit tests. Structurally they look the same — we use pytest, have fixtrues, set up parameterized tests, etc.
For example, here’s one of our tests:
def test_find_hardcoded_secret(
review_config: ReviewConfig,
) -> None:
diff = Diff(
clean_triple_quote_text(
"""
diff --git a/.env b/.env
index 99792dd..0000000
--- a/.env
+++ b/.env
@@ -0,0 +1,2 @@
# Created by Vercel CLI
+POSTGRES_PASSWORD="REDACTED"
"""
)
)
comments = review_security_issues.invoke(diff, review_config.to_runnable_config())
assert len(comments) == 1
[comment] = comments
assert comment.location == ".env:2"
assert comment.file == ".env"
assert comment.comment_type == ReviewCommentType.ISSUE
assert comment.importance == ReviewCommentImportance.BLOCKING
assert comment.area == CommentRequestReviewCommentArea.SECURITY
assert comment.diff_line_for_comment == '+POSTGRES_PASSWORD="REDACTED"'
assert (
comment.diff_context
== ' # Created by Vercel CLI\n+POSTGRES_PASSWORD="REDACTED"'
)
assert comment.start_line_in_file == 2
assert comment.end_line_in_file == 2
assert comment.start_line_in_diff == 1
assert comment.end_line_in_diff == 1
Just like we’d have in a traditional unit test it has some input data (in our case a code diff), the data is processed/interacted with (we run a code review over it), and we perform a series of assertions on the results.
But one big difference between testing for our LLM application and other applications is that we can’t mock network requests like we usually would. What we get back from an LLM varies at least a little bit on almost every response, so we can’t predict every variation of the response in our mocks. Instead, we wind up actually calling the LLM so we get a “real” response.
The variability of LLM responses creates another bit of trickiness in writing assertions for the tests. For example, we can’t assume a model will respond with "Looks good" for a given diff, because it may as well say "Looks great", or even "LGTM", or any variation of this.
We have a couple of strategies to deal with this:
Looking back at the example test from earlier - you might have noticed that we don’t actually inspect the comment message directly. Instead we check that there is a comment, that it’s a blocking security comment, and that the comment is positioned on the right line. We don’t need to validate the specific comment content, but we can look at the broader info to understand that the security issue is being correctly flagged.
We’re still in the early days of using unit tests to help validate LLM responses. They’ve allowed us to be more confident in our development and push faster, but there are still some big opportunities to improve our tests that we’re thinking about.
Using LLMs to review LLMs: Some of our more complex tests might benefit from adding an intermediate layer of LLM analysis of the response to help identify certain response characteristics or bucket the response into a given category that we can then deterministically assert against. But this might also add a layer of uncertainty to the tests, so we’ll have to see how well this works.
Figuring out CI tolerance: Because of the inherent variability within the tests, we don’t want our CI to fail right now if any test fails. But there is a threshold at which it should fail. We’re working to identify what that threshold should be, how many times we should run each test per CI run, and what a “pass threshold” would look like for each test. Right now we’re required that 95% of our LLM tests pass, and this seems like a good level for the moment.
We’ve seen some great early benefits to unit testing the LLM responses in Sourcery, both in terms of being able to more confidently make changes without fearing regressions and in validating that new changes do what we want them to do. To quote one of my teammates, “I am now 1000% more confident that I am not introducing regressions,” — and I’d recommend that anyone working with LLMs try setting up unit tests for their projects.
1Most of the LLMs we’ve tested tend to praise excessively during code reviews, creating a ton of noise
We’re working to build a better automated version of code reviews at Sourcery. Try it out on any of your GitHub repos or reach out to hello@sourcery.ai to learn more.