Code Review: What And What Not To Automate?

screen with a code and glasses

Code review is the main quality gate in many software projects. Often, it also takes up a significant amount of developer time. This brings up the question: Can we automate at least parts of it? Can we make the review more consistent and free up some developer time by making tools do the job?

This post tries to come up with some guidelines: Which parts of the review can be done well by our current tools? Which parts benefit from human intuition?

What To Automate?

Tools can determine whether something exists. They aren't good at determining whether something makes sense.
Tools are great in detecting problems. Usually, they can't verify that something is great.

To give a non-code example for both concepts: Checking a text with various tools is super useful to point out various types of mistakes in spelling, style, and structure. But there isn't a tool to tell you whether a text is understandable for your target audience. At least not yet.

Tests

We have amazing coverage tools to check whether a piece of code is executed by tests. But these tools can't tell us much about the quality of these tests.

As an example, consider this test for a create CLI command:

def test_basic():
    result = runner.invoke(
        app,
        ["create", "example-data", "--plain"],
    )

    assert result.exit_code == 0

It executes the create command => coverage ✅

Unfortunately, this test doesn't tell us anything besides a successful exit code.

What does the user see in stdout and stderr?
Has the right file been created?
Is it in the expected directory?
Does it have the right content?

Checking whether a test contains the necessary assertions feels like a great job for a tool. Yet, I'm not aware of the existence of such a tool.

One aspect difficult to formalize and automate is relevance: Do tests check what matters to the users of the end product? To evaluate this, we need to consider the various ways the product can be used. What are the use cases? Which of them is more important?

A next tricky question is: Is the test data relevant? Even if we extend the test with various assertions, we'll only learn that everything is fine with example-data. But is example-data a good example indeed of what our users will input as this argument? Should we extend our test suite with very long words, special characters, spaces, urls?

To answer these questions, both domain and technical knowledge is necessary. Which means a review by at least 1 (or often more) human.

What can we automate?

line coverage Coverage.py
branch coverage Coverage.py
randomizing test data with property-based testing Hypothesis
Is there's an assertion for each important property of the system? (This feels like something that could be automated, but I'm not aware of a tool for it. 🤔)

Where do we need human review?

Do we have tests for the important use cases?
Is the test data realistic?

Documentation

Writing For Your Audience(s) and Wagging Dogs

Although documentation often gets less attention and less thorough review, that's strange if we consider the various roles it plays. While code is read (mainly) by the developers working on the project, documentation can have a much broader audience:

end users
developers using a library
support team members trying to figure out a production issue

Daniele Procida phrases this in a very explicit way in a blog post about documentation at Canonical:

Our software documentation is part of how we talk to each other – our users, our colleagues, our community. It’s a way we demonstrate how we value each other – including how we value you. We’ve understood the importance of this for some time, but actually finding a way to express those values in our practice is less easy.

In his thought-provoking talk, How to wag a dog?, he also argues that documentation should inform product and code architecture decisions. While this idea might sound radical, it echoes a line from the Zen of Python:

If the implementation is hard to explain, it's a bad idea.

Written by AI, Reviewed by Humans?

This topic has become more interesting in the past months. Folks have started to use generative AI to write both docstrings and longer forms of documentation. Most of the time, the conclusion: ChatGPT & co write surprisingly good docs. They serve as amazing starting points, but need review and improvement from a human. This makes the question even more relevant: Where can we trust tools and where do we need human insight?

Automated & Human Docs Review

For documentation, we see a similar patterns as for tests. Tools can tell us whether a specific type of documentation exists. They can't tell us whether the documentation makes sense.

Consider this example, the most popular template of the Sourcery Rules Generator

@app.command()
def create(
    # Various CLI options listed here.
):
    """Create a new Sourcery dependency rule."""
    interactive = sys.stdin.isatty() and interactive_flag

"Create a new Sourcery dependency rule." is a correct description. It makes a lot of sense, assuming that you know what Sourcery is and what kind of rules it uses.

However, this description doesn't do a good job in introducing this command to somebody who:

has no prior knowledge about Sourcery;
is interested in making the package structure of their project less chaotic.

A better try is probably something like this:

@app.command()
def create(
    # Various CLI options listed here.
):
    """Create rules to check the dependencies between packages of a project.

    Create rules to verify that internal and external packages are imported only by the allowed parts of a project.
    """

What can we automate?

Is there a docstring for every module, class, and public function?
Does the docstring mention all arguments?
Spelling and grammar checks.

Where do we need human review?

Does the docstring make sense for the intended audience?
Is there a record about the important decisions taken? Is this record at a place where readers of the code are likely to find it?
Have we introduced a concept where a visualization would make sense?

Code Quality

Code quality is traditionally the focus of code reviews. This is also the topic where the chance for discussions and disagreements is the biggest. Many reviews contain both objective improvements and statements of personal taste and the line between those two is blurred. Automated tools bring multiple benefits:

Increase the objectivity.
Formalize agreements to avoid recurring discussions about the same topic.

Automated & Human Code Quality Review

Let's see a function, which could benefit both from automated and human review.

def _read_source_type_from_tag(content: str) -> str | None:
    for line in content.splitlines():
        if line.startswith("- source/"):
            return line.partition("/")[2]
    return None

How can a review tool improve this function? It will probably replace this loop with the next built-in function. (In fact, that's exactly what the use-next rule from Sourcery does.)

Now, the code looks like this:

def _read_source_type_from_tag(content: str) -> str | None:
    return next(
        (
            line.partition("/")[2]
            for line in content.splitlines()
            if line.startswith("- source/")
        ),
        None,
    )

This is an improvement indeed.

What did the human reviewer say to this improved code?

She wasn't quite happy with that new version. Her biggest advantage compared to the reviewing tool: context awareness. She didn't just look at this function, but at the whole project. In fact, she also knew what the goal of this tool is: Extracting information from the front matter of markdown pages. Occasionally even editing that front matter.

With that crucial info about the context, the human reviewer was quick to point out that this function reads information from the front matter in yaml format using string functions like splitlines and startswith. She also spotted other parts in the same project, where string functions were used not just for reading the front matter, but even for changing its content...

By using these string functions, the code assumes that there's only 1 line that starts with "- source/" and this is the tag containing the source type. The code also makes assumptions about how this yaml is formatted.

Thus the human reviewer's suggestion: Use a yaml converter library and Pydantic to read & write the front matter. And to explicitly validate the assumptions about the fields and their values.

You could argue that the issue raised by the human reviewer was much more important. Using string functions to manipulate yaml is a maintenance nightmare. Having a surplus loop is a minor nuisance compared to it.

But a crucial point is that the reviewing tool and the human reviewer aren't in some kind of competition with each other. She was actually quite happy to have tools doing a first sanity and aesthetical check. While looking at nicely formatted and cleaned up code, she was able to take a broader view and ask more fundamental questions.

Way too often, we see very skilled developers raising 20 tiny issues during a code review. And missing the big issue, like the lack of a yaml converter.

What can we automate?

Detecting antipatterns.
Have we chosen the right data structure?
Sometimes: Is this code at the right place (package, module)?
Are some functions or modules too long?
Calculating general code quality metrics.

Where do we need human review?

Is this code at the right place (package, module)?
Do we use the right level of abstraction?
Is the code more complex than necessary?
Are we doing the right trade-offs?
Are we using the right tool for the job?
Which assumptions does this code have? Are these assumptions correct? Should we make them explicit?
Anywhere where a bigger context is beneficial.

Names

Names are a crucial aspect of code quality. Python has the reputation of being very readable, similar to a human language. How readable a specific piece of Python code is depends highly on the names of the artifacts: modules, functions, classes.

It's easy to forget that this works in both directions: Misleading names, inconsistent terminology make the code more difficult to understand. They can give wrong cues about where to look for the root cause of a bug.

Automated & Human Naming Review

def find_pgn():
    f = glob.glob("*.pgn")
    return f[0] if len(f) == 1 else None

Several tools exist that will point out that the single-letter variable name f is an antipattern.

It's more difficult to detect with a tool the problems with the name find_pgn. For that, again, we need a bigger context. Looking not just at the name itself, but also at the code of the function. With that broader view, a human reviewer can point out that the name find_pgn fails to communicate 2 important pieces of information:

The scope of the search: the current directory.
The assumption that the directory contains exactly 1 .pgn file.

Some possible improvements:

Rename the function to find_only_pgn_in_current_dir.
Or probably even better: Add a parameter for the directory.

This is a pretty common pattern. Taking a closer look at a bad name reveals further problems. That will be possibly an interesting workflow for code quality tools as well:

Analyze the names to identify the problematic parts in the code base.
Use further analyzers only for those problematic parts.

But let's recap what our current code quality tools can and can't do. 🙂

What can we automate?

Names that don't adhere to a convention (e.g. PEP 8) gpsg-naming-pep8 rules
Too short (e.g. 1 character) names
Mismatch between name and type. see this Sourcery recipe
Deny-listed names Voldemort rules in Sourcery Rules Generator.

Where do we need human review?

Does the name describe well what the code is doing?
Is the name misleading in any way?
Is the terminology consistent with the rest of the codebase, the documentation, the user stories etc.?

Conclusion

Several aspects of a code review can be automated. While these aren't the biggest and most important parts of the review, it still makes sense to automate them. By automating the mundane stuff, the human reviewers are able to focus on the more interesting parts.

Related Sources

📹 Daniele Procida: How to wag a dog? PyCon UK 2022 September
Daniele Procida: The future of documentation at Canonical 2021 November
📖 Tornhill, Adam: Software Design X-Rays March 2018, Pragmatic Bookshelf

From Sourcery

Tim Gilboy: Code reviews need to shift-left
Discussion on HackerNews
Tim Gilboy: Why You Need Coding Standards
📹 Ben Martineau, Reka Horvath: Actionable insights vs ranking How to use and how NOT to use code quality metrics PyCon US 2022 April
Reka Horvath: My 3 Worst Python Function Names