Best practices for tracking down and resolving software issues
As developers a lot of our time is spent on tracking down and fixing bugs. Most of us see it as an unwelcome chore and would rather be writing exciting new code. There's some truth in that, and bug-fixing can certainly be frustrating (and unpleasant if there's time pressure).
However tracking down and fixing software issues can also be a fascinating puzzle. Sleuthing your way to a complete understanding of an issue and then crafting a solution can make you feel like a cross between Sherlock Holmes and Scotty the Engineer.
To try and make the process quicker and more enjoyable here are some tips on the best way to identify and fix bugs.
The first thing you have to do when faced with a bug report is reproduce the issue. This is absolutely vital, since if you can't see the evidence that the code is broken then there's no way of checking that its been fixed.
The effort required to reproduce an issue can vary from trivial to impossible. If you've noticed the bug yourself or the bug report is detailed then there's not much to do except check that it can be reproduced reliably. Unfortunately most users are not technical experts and so will not automatically write detailed and informative bug reports. This means that you will often need to get more details. Try and get step by step instructions on how to reproduce the issue, along with any other useful context such as software versions and the operating system used. If you're on a helpdesk it can be helpful to talk directly to the person who experienced the bug to avoid long email exchanges.
If you can't immediately reproduce the issue then make sure you are using as similar an environment to the reporting user as possible, including things like OS version, software/browser version and any configuration settings.
If the issue is sporadic or just can't be reproduced at all then you'll have to use a logging-based approach (see later on in this article).
To narrow down the issue it's often best to see if you can identify the crucial steps needed to reproduce it. For example if the bug relates to placing an order then steps on the bug report related to viewing account information may not be needed. This will help identify the section of the code that is at fault, as well as saving time since you might need to reproduce/debug the issue many times before you've cracked it.
If at all possible try to create a failing unit test that reproduces the issue. This will let you reproduce it instantly and capture it in the debugger, as well as verifying it is fixed once the test passes.
Bugs that can't be reproduced can be a real nightmare. The best thing to do in this case is to gather as much information about the issue as possible. This can involve getting logs from the machines where the issue occurs, turning on enhanced logging or adding extra logging to the application and delivering it. Screenshots of the issue occurring can also be extremely helpful in identifying the circumstances in which it occurs.
An example where logging came in handy was at a previous company where users of our desktop application started reporting strange behaviour. After using the application for some time other applications would shut down, or ours would, or the whole OS would behave strangely or crash. We tried and failed to reproduce it locally so started collecting logs from users.
At first we focused on memory usage, thinking a memory leak must be to blame (the application was written in Delphi). However the logs didn't show a clear memory leak, and we were stuck on the issue for days.
Fortunately we had decided to log the OS of the users, and I noticed that all the ones having trouble were on Windows XP. This brought to mind an old blog post I had read on the limits of Windows. The post mentioned that XP had a much lower limit on memory allocated to USER objects (representing windows, icons, popups, etc) than later versions. This limit was small enough to not show much of an impact on overall memory allocation.
Armed with this hypothesis I looked at recent changes to how we handled UI elements, and discovered a change to how we created popup windows - this was the cause of the USER object handle leak. In an XP VM I was then able to reproduce the issue by right-clicking 40-50 times (something users would naturally do in the course of a day but that we didn't in our shorter testing sessions).
There are several ways of finding the code which is the cause of the errant behaviour. Sometimes it is obvious, either through the nature of the error or previous experience. Where it's not try one of the following:
Determine exactly when it started happening, and in which versions, and look for commits that went live at that time. Depending on where you work there may be particular files (or developers) that are more prone to issues than others.
If you're starting to get stuck it's often super useful to ask someone. Firstly the process of having to explain the issue to someone else can often jog something in your brain and help you to solve it. Secondly a more experienced dev, or someone who is knowledgeable about that part of the system may be able to shortcut hours or even days of work by identifying the problem straight away. As a senior developer being able to point people in the right direction was one of the most rewarding parts of the job. Don't be afraid to ask for help as long as you've had a good go at figuring out the issue and are able to explain the problem in a detailed and logical way.
What conditions in the code must be true for the bug to occur? If you write these down, along with the assumptions that you're making, it can often help locate the issue. When you eventually find the cause of a problem it's often an obvious thing that you assumed couldn't go wrong.
A recent example came when developing a new VS Code extension for Sourcery. The Language Server Protocol API call that we were trying to use wasn't working, and we spent a long time tweaking the parameters assuming that we were doing something wrong. In the end it turned out that our dependencies weren't up to date and we were using an old version of the API that didn't contain that particular call.
When you've been tracking down a bug long enough it's easy to convince yourself that it can't be in your code and that some outlandish problem that's outside your control must be to blame. Unfortunately in 99.99% of cases it's a bug in your code. Unless the problem is a one-time or extremely sporadic issue that is fixed by rebooting, you'll just have to keep plugging away at it. If you're totally stuck it's time to:
Staring at an issue long into the night is very rarely helpful. A good night's sleep will often provide the answer, and you'll come in the next morning with the solution in hand, or at least with new approaches to try.
Once you've found the code that's to blame it's tempting to patch it right away, but it's worth spending a little more time digging. Make sure that you really understand the root cause of the issue, and don't just paper over the cracks. For example if memory is overflowing increasing the heap size won't stop the issue from occurring again later - you need to find and fix the leak.
Since you reproduced the issue at the start, you can now test it with and without your fix to make absolutely sure it goes away. If possible you should also add or extend a unit test for the issue. This makes sure that if it regresses then you'll know about it before your users do.
Thanks for reading! If you work in Python and want refactoring suggestions to avoid bugs in the first place please check out Sourcery. Also I would love to hear your best bug stories - the best place to contact me is on Twitter.