Getting the most out of Python collections

A guide to comprehensions, generators and useful functions and classes

Date

Feb 18, 2020

Two boxes
Photo by Rumman Amin on Unsplash

One of Python's best features is that it has awesome capabilities for creating and handling collections. Building a good understanding of these will help you write clean Pythonic code.

To get you started this guide covers comprehensions, generators, and some really useful built-in functions and collection types.

Create and filter Python collections with Comprehensions

List, Dictionary and Set comprehensions are a powerful feature of the Python language that enable you to declare these collection types and fill them with values at the same time. This lets you do all sorts of filtering and mapping operations in a simple and readable way.

List Comprehensions

Without comprehensions a common pattern used to create a list and fill it with values is as follows:

cubes = []
for i in range(20):
    cubes.append(i**3)

Each operation is carried out separately - an assignment to create the empty list, and then a for loop iterating over a sequence within which the list is appended to. We end up with a list of cubed numbers sitting in the cubes variable.

This can be shortened to the following using a list comprehension:

cubes = [i**3 for i in range(20)]

Let's unpack what's going on. At first glance it can be confusing, with lots of the elements from the for loop version moved into different places in the code.

We know the purpose of the code is to create a list of cubed numbers. We therefore have an assignment using square brackets to signify that we are creating a list. Inside the comprehension we can start with the second half: for i in range(20). This is a standard for loop, but instead of a for body we put an expression in front of it giving the element that we will create at each iteration. In this case we have i ** 3, cubing the target variable i. The expression can be simple, or can involve any sorts of function calls or operations that you like. It does not even have to use the target variable.

We can also add an if clause at the end to filter the list:

odd_cubes = [i**3 for i in range(20) if i % 2 != 0]

Often squeezing code onto one line can make it more difficult to read. In the case of comprehensions all the elements that you need are nicely presented, and once you are used to the syntax it is actually more readable than the for loop version of the same code.

Dictionary Comprehensions

You can also directly create dictionaries using a dictionary comprehension. Here you set the keys and values at once using the following syntax:

squares = {x: x**2 for x in range(100)}

This creates a dictionary mapping numbers to their squared values. As you can see you set the keys and values while iterating over a sequence. Again you can use if clauses in your dictionary comprehensions in order to perform filtering at the same time.

Filtering would look like this:

even_squares = {x: x**2 for x in range(100) if x % 2 == 0}

Set Comprehensions

Moving on, there are also set comprehensions, which create a set, so all values are unique.

square_roots = {math.sqrt(number) for number in number_list}

Again these can be filtered. If you want to filter using an if expression then the position of the conditional in the comprehension changes like this:

positive_square_roots = {
    math.sqrt(number) if number > 0 else 0 for number in number_list
}

More complex comprehensions

Comprehensions can also replace nested for loops. Take the example of flattening a list of lists:

list_of_lists = [[1, 2], [3, 4]]
flat = []
for sub_list in list_of_lists:
    for element in sub_list:
        flat.append(element)

This produces [1, 2, 3, 4]. As a comprehension this turns into:

list_of_lists = [[1, 2], [3, 4]]
flat = [element for sub_list in list_of_lists for element in sub_list]

You can see that the ordering of the for loops has been maintained. Using comprehensions for nested for loops in this way becomes a bit unintuitive. Past a certain level of complexity it is more difficult to understand than the same code would be using a for loop.

Just because you can achieve something in a one line comprehension using multiple for and if clauses doesn't mean that it's a good idea.

Iterating with Generators

A generator is a special type of Python function that returns a lazy iterator. You can loop over these results without the whole list of them having to be stored in memory.

For a concrete example take this function that generates the Fibonacci sequence of numbers:

def fib():
    a, b = 0, 1
    while 1:
        yield b
        a, b = b, a + b

You might then call it with code like this:

def print_fibonacci(max_size):
    for number in fib():
        if number < max_size:
            print(number)
        else:
            break

So what's going on here? From the point of view of the calling function fib() just looks like a collection. When fib() is first invoked it executes as normal until it hits the yield b statement. It then returns control to its caller along with the value, just like return b would do. However, fib() has only been suspended, and all its state has been saved. When the iterator it returns is queried again it resumes from where it left off instead of starting again from the beginning like a normal function.

The other way of defining generators in Python is using generator expressions. These use the same syntax as comprehensions, but are wrapped in round brackets:

cubes = (i**3 for i in range(20))

These return an iterator rather than a collection. If the iterator is exhausted and you ask it for another item it will return a StopIteration exception.

In fact a comprehension behaves like wrapping a generator expression in a call to list() or set():

cubes = list(i**3 for i in range(20))

Note that when passing a generator into a function, Python allows you to omit one of the pairs of brackets (like in the example just above - we are making a call to list()).

One tip is that functions like any, all and sum allow you to pass in a generator rather than a collection. This may lead to performance improvements due to the lazy evaluation (i.e. the function may be able to return before evaluating all of the elements from the generator).

The primary benefit of generators comes from the aforementioned lazy evaluation. This can make the memory overhead of a generator much lower than creating the whole collection at once and then iterating over it.

Take the example of reading lines from a file. If the file is large then reading the whole of it into memory may be extremely slow or impossible. The file object returned by open() is already a generator of lines, so can be used like this:

def read_lines(file):
    with open(file) as f:
        for line in f:
            print(line)

Useful types of collection from the collections module

An important way to improve your collections code is to make use of the available builtin types. Here are two of the handiest:

defaultdict

This is a subclass of dictionary that calls a factory function to supply missing values.

This lets you replace this:

fruit_sizes = {}
for fruit in fridge:
    if fruit.name in fruit_sizes:
        fruit_sizes[fruit.name].append(fruit.size)
    else:
        fruit_sizes[fruit.name] = [fruit.size]

with this:

fruit_sizes = defaultdict(list)
for fruit in fridge:
    fruit_sizes[fruit.name].append(fruit.size)

Here any missing values are initialized to empty lists. This avoids cluttering up the code with checks and initializations.

Other common arguments are to create defaultdicts of int, float, dict or set types.

To create one which sets the default to anything you like you can use lambda functions. For example if all students are assumed to have a small green hat you would use:

student_hats = defaultdict([Hat("green", "small")])

Counter

Need to count occurrences of multiple different hashable objects? A Counter is your friend. It is a dictionary subclass where the elements are the keys and their counts are the values.

When creating or updating the counter you give it an iterable. It iterates through and counts the occurrences of items, matching them by their hashes.

An easy way to get started on understanding Counter is to count all of the occurrences of strings in a list.

counts = Counter(["Fred", "Samantha", "Jean-Claude", "Samantha"])

counts then looks like this:

Counter({"Samantha": 2, "Fred": 1, "Jean-Claude": 1})

You can then update the Counter with further iterables, or other instances of Counter.

Useful functions for collection handling

Once you have a collection or a generator the next stage is to do something useful with it. Python contains many useful functions for collection handling that can improve how you write your code.

Using any() and all()

A common pattern to establish something about a collection is the following:

any_green = False
for hat in hats:
    if hat.colour == "green":
        any_green = True
        break

This can be replaced with a call to any() using a generator:

any_green = any(hat.colour == "green" for hat in hats)

This returns whether any item in the generator satisfies the condition, and will shortcut the evaluation just as the above for loop does with the break statement.

all() behaves similarly, but returns whether all items satisfy the condition.

Iterating over collections together with zip()

You often find yourself with multiple collections which you need to iterate over, collecting data together from each one.

for name, email in zip(names, emails):
    create_person(name, email)

This steps through both collections at the same time, returning a tuple consisting of elements from each one. Here we have unpacked the tuple into separate values. zip() can take as many collections as you like, but will stop when the shortest one is exhausted. If you'd like to deal with all values you can use itertools.zip_longest(*iterables, fillvalue=None). This will carry on until all collections are exhausted, filling in missing values with the fillvalue.

One neat use of the zip() functions is to iterate over pairs of elements in the same collection. For example:

differences = [next_elt - elt for elt, next_elt in zip(items, items[1:])]

Chaining over multiple collections

If you need to iterate over multiple collections one at a time, you can use itertools.chain().

for name in itertools.chain(first_name_list, second_name_list):
    create_person(name)

This iterates through the first collection until it is exhausted, then moves on to the next and so on.

There are lots more interesting and advanced functions in the itertools module, and we will return to them in a further blog post.

time.sleep()

Not ready to decide?

Book a demo call with our team and we’ll help you understand how Sourcery can benefit your business and engineers.