Comparing to None in Python and Pandas

nothing

Missing data are a frequent source of headache (and bugs 🐛). Often, it's far from obvious whether a value can be empty. And if it can, it usually means introducing several conditionals in the code and edge cases in the tests.

Truthy vs Falsy Values and None in Python

The Concept of Truthy and Falsy

To make these conditionals checking for missing data more concise, many programming languages, incl. Python, have a concept of "truthy" and "falsy" values. Thanks to this, various non-boolean data types can be interpreted in boolean contexts. It's important to distinguish between:

The literal True or False values: These are boolean.
Truthy or falsy values: These can be any boolean or non-boolean data type.

By default, the following are considered "falsy" in Python:

Constants defined to be false: None and False.
Zero of any numeric type: 0, 0.0, 0j, Decimal(0), Fraction(0, 1)
Empty sequences and collections: '', (), [], {}, set(), range(0)
Note that an empty string is also considered an empty collection.

Any other value is considered "truthy".

Comparisons

The concept of truthy and falsy values has a big benefit: It allows you to use non-boolean expressions in conditions and other boolean operations. This makes the code more concise.

data = []

if data:
    print("Data is truthy!")  # This won't print.
else:
    print("Data is falsy!")  # This will print.

data is an empty list, which is considered falsy. => The else statement gets executed.

Falsy Values with Special Meaning

The concise comparison above has one big assumption: That all falsy values should produce the same behavior. It doesn't work anymore if a falsy value has a special meaning. E.g. if the code needs to work differently for None and an empty string.

This is usually a bad practice, because it can easily lead to confusion and bugs 🐛. For example, Django's documentation discourages using multiple values for "no data":

Avoid using null on string-based fields such as CharField and TextField. If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.

Django 4.2 Documentation / Model field reference / Field options

Comparing to None (if You Absolutely Have To)

What if (after considering the trade-offs above) you've decided to give a special meaning to a falsy value? How to compare whether a value is actually None?

There are 2 ways to achieve this:

using the equality (==) operator ❌
using is ✅

PEP 8 has a clear recommendation:

Comparisons to singletons like None should always be done with is or is not, never the equality operators.

The problem with using == is that it's possible for a class to override the __eq__ method (which determines the behavior of ==), which could lead to unexpected results.

class AlwaysEqual:
    def __eq__(self, other):
        return True


object = AlwaysEqual()

print(object == None)  # prints: True
print(object is None)  # prints: False

The AlwaysEqual class overrides the __eq__ method to always return True. Therefore, even though the object isn't None, when compared to None using ==, it returns True.

Pandas: Use `isna`

Similarly to the comparison to None in Python, there are 2 ways to detect missing values in Pandas. And one is clearly preferred:

Comparing to numpy.nan with the equality operator == ❌
Using the isna or isnull function ✅

An equality check with the == equals operator, like df['column'] == np.nan, behaves differently than what you might expect. This stems from a peculiar property of numpy.nan: It is not considered equal to any value, even itself. (Note that this is a difference to Python's None. It's a singleton, so None==None returns True.)

Let's consider this DataFrame with 2 columns A and B as an example:

data = {"A": [1, 2, np.nan, 4], "B": [9, 10, 11, 12]}
df = pd.DataFrame(data)

Comparison with `== numpy.nan`

print(df["A"] == np.nan)

The output:

0 False
1 False
2 False
3 False
Name: A, dtype: bool

The returned value is always False, even for np.nan.

Comparison with `isna()`

print(df["A"].isna())

The output:

0 False
1 False
2 True
3 False
Name: A, dtype: bool

The returned value is:

True for np.nan
False for every other value.

For more info, check out Pandas Docs / Missing Data

Summary

Dealing with missing and empty values is tricky. In this post, we've discussed 3 guidelines, that make it less error-prone:

Don't assign special meaning to falsy values.
When comparing to None in Python, use is or is not.
When looking for missing values in Pandas, use the isna or isnull functions.