Embracing tools and improving the feedback.
Mar 06, 2023
Code review is the main quality gate in many software projects. Often, it also takes up a significant amount of developer time. This brings up the question: Can we automate at least parts of it? Can we make the review more consistent and free up some developer time by making tools do the job?
This post tries to come up with some guidelines: Which parts of the review can be done well by our current tools? Which parts benefit from human intuition?
To give a non-code example for both concepts: Checking a text with various tools is super useful to point out various types of mistakes in spelling, style, and structure. But there isn't a tool to tell you whether a text is understandable for your target audience. At least not yet.
We have amazing coverage tools to check whether a piece of code is executed by tests. But these tools can't tell us much about the quality of these tests.
As an example, consider this test for a create
CLI command:
def test_basic():
result = runner.invoke(
app,
["create", "example-data", "--plain"],
)
assert result.exit_code == 0
It executes the create
command => coverage ✅
Unfortunately, this test doesn't tell us anything besides a successful exit code.
stdout
and stderr
?Checking whether a test contains the necessary assertions feels like a great job for a tool. Yet, I'm not aware of the existence of such a tool.
One aspect difficult to formalize and automate is relevance: Do tests check what matters to the users of the end product? To evaluate this, we need to consider the various ways the product can be used. What are the use cases? Which of them is more important?
A next tricky question is: Is the test data relevant? Even if we extend the test
with various assertions, we'll only learn that everything is fine with
example-data
. But is example-data
a good example indeed of what our users
will input as this argument? Should we extend our test suite with very long
words, special characters, spaces, urls?
To answer these questions, both domain and technical knowledge is necessary. Which means a review by at least 1 (or often more) human.
What can we automate?
Where do we need human review?
Although documentation often gets less attention and less thorough review, that's strange if we consider the various roles it plays. While code is read (mainly) by the developers working on the project, documentation can have a much broader audience:
Daniele Procida phrases this in a very explicit way in a blog post about documentation at Canonical:
Our software documentation is part of how we talk to each other – our users, our colleagues, our community. It’s a way we demonstrate how we value each other – including how we value you. We’ve understood the importance of this for some time, but actually finding a way to express those values in our practice is less easy.
In his thought-provoking talk, How to wag a dog?, he also argues that documentation should inform product and code architecture decisions. While this idea might sound radical, it echoes a line from the Zen of Python:
If the implementation is hard to explain, it's a bad idea.
This topic has become more interesting in the past months. Folks have started to use generative AI to write both docstrings and longer forms of documentation. Most of the time, the conclusion: ChatGPT & co write surprisingly good docs. They serve as amazing starting points, but need review and improvement from a human. This makes the question even more relevant: Where can we trust tools and where do we need human insight?
For documentation, we see a similar patterns as for tests. Tools can tell us whether a specific type of documentation exists. They can't tell us whether the documentation makes sense.
Consider this example, the most popular template of the Sourcery Rules Generator
@app.command()
def create(
# Various CLI options listed here.
):
"""Create a new Sourcery dependency rule."""
interactive = sys.stdin.isatty() and interactive_flag
"Create a new Sourcery dependency rule." is a correct description. It makes a lot of sense, assuming that you know what Sourcery is and what kind of rules it uses.
However, this description doesn't do a good job in introducing this command to somebody who:
A better try is probably something like this:
@app.command()
def create(
# Various CLI options listed here.
):
"""Create rules to check the dependencies between packages of a project.
Create rules to verify that internal and external packages are imported only by the allowed parts of a project.
"""
What can we automate?
Where do we need human review?
Code quality is traditionally the focus of code reviews. This is also the topic where the chance for discussions and disagreements is the biggest. Many reviews contain both objective improvements and statements of personal taste and the line between those two is blurred. Automated tools bring multiple benefits:
Let's see a function, which could benefit both from automated and human review.
def _read_source_type_from_tag(content: str) -> str | None:
for line in content.splitlines():
if line.startswith("- source/"):
return line.partition("/")[2]
return None
How can a review tool improve this function? It will probably replace this loop
with the next
built-in function. (In fact, that's exactly what the
use-next
rule from Sourcery
does.)
Now, the code looks like this:
def _read_source_type_from_tag(content: str) -> str | None:
return next(
(
line.partition("/")[2]
for line in content.splitlines()
if line.startswith("- source/")
),
None,
)
This is an improvement indeed.
What did the human reviewer say to this improved code?
She wasn't quite happy with that new version. Her biggest advantage compared to the reviewing tool: context awareness. She didn't just look at this function, but at the whole project. In fact, she also knew what the goal of this tool is: Extracting information from the front matter of markdown pages. Occasionally even editing that front matter.
With that crucial info about the context, the human reviewer was quick to point
out that this function reads information from the front matter in yaml format
using string functions like splitlines
and startswith
. She also spotted
other parts in the same project, where string functions were used not just for
reading the front matter, but even for changing its content...
By using these string functions, the code assumes that there's only 1 line that starts with "- source/" and this is the tag containing the source type. The code also makes assumptions about how this yaml is formatted.
Thus the human reviewer's suggestion: Use a yaml converter library and Pydantic to read & write the front matter. And to explicitly validate the assumptions about the fields and their values.
You could argue that the issue raised by the human reviewer was much more important. Using string functions to manipulate yaml is a maintenance nightmare. Having a surplus loop is a minor nuisance compared to it.
But a crucial point is that the reviewing tool and the human reviewer aren't in some kind of competition with each other. She was actually quite happy to have tools doing a first sanity and aesthetical check. While looking at nicely formatted and cleaned up code, she was able to take a broader view and ask more fundamental questions.
Way too often, we see very skilled developers raising 20 tiny issues during a code review. And missing the big issue, like the lack of a yaml converter.
What can we automate?
Where do we need human review?
Names are a crucial aspect of code quality. Python has the reputation of being very readable, similar to a human language. How readable a specific piece of Python code is depends highly on the names of the artifacts: modules, functions, classes.
It's easy to forget that this works in both directions: Misleading names, inconsistent terminology make the code more difficult to understand. They can give wrong cues about where to look for the root cause of a bug.
def find_pgn():
f = glob.glob("*.pgn")
return f[0] if len(f) == 1 else None
Several tools exist that will point out that the single-letter variable name f
is an antipattern.
It's more difficult to detect with a tool the problems with the name find_pgn
.
For that, again, we need a bigger context. Looking not just at the name itself,
but also at the code of the function. With that broader view, a human reviewer
can point out that the name find_pgn
fails to communicate 2 important pieces
of information:
.pgn
file.Some possible improvements:
find_only_pgn_in_current_dir
.This is a pretty common pattern. Taking a closer look at a bad name reveals further problems. That will be possibly an interesting workflow for code quality tools as well:
But let's recap what our current code quality tools can and can't do. 🙂
What can we automate?
Where do we need human review?
Several aspects of a code review can be automated. While these aren't the biggest and most important parts of the review, it still makes sense to automate them. By automating the mundane stuff, the human reviewers are able to focus on the more interesting parts.