A guide to comprehensions, generators and useful functions and classes
Feb 18, 2020
One of Python's best features is that it has awesome capabilities for creating and handling collections. Building a good understanding of these will help you write clean Pythonic code.
To get you started this guide covers comprehensions, generators, and some really useful built-in functions and collection types.
List, Dictionary and Set comprehensions are a powerful feature of the Python language that enable you to declare these collection types and fill them with values at the same time. This lets you do all sorts of filtering and mapping operations in a simple and readable way.
Without comprehensions a common pattern used to create a list and fill it with values is as follows:
cubes = []
for i in range(20):
cubes.append(i**3)
Each operation is carried out separately - an assignment to create the empty
list, and then a for loop iterating over a sequence within which the list is
appended to. We end up with a list of cubed numbers sitting in the cubes
variable.
This can be shortened to the following using a list comprehension:
cubes = [i**3 for i in range(20)]
Let's unpack what's going on. At first glance it can be confusing, with lots of the elements from the for loop version moved into different places in the code.
We know the purpose of the code is to create a list of cubed numbers. We
therefore have an assignment using square brackets to signify that we are
creating a list. Inside the comprehension we can start with the second half:
for i in range(20)
. This is a standard for loop, but instead of a for body we
put an expression in front of it giving the element that we will create at each
iteration. In this case we have i ** 3
, cubing the target variable i
. The
expression can be simple, or can involve any sorts of function calls or
operations that you like. It does not even have to use the target variable.
We can also add an if clause at the end to filter the list:
odd_cubes = [i**3 for i in range(20) if i % 2 != 0]
Often squeezing code onto one line can make it more difficult to read. In the case of comprehensions all the elements that you need are nicely presented, and once you are used to the syntax it is actually more readable than the for loop version of the same code.
You can also directly create dictionaries using a dictionary comprehension. Here you set the keys and values at once using the following syntax:
squares = {x: x**2 for x in range(100)}
This creates a dictionary mapping numbers to their squared values. As you can see you set the keys and values while iterating over a sequence. Again you can use if clauses in your dictionary comprehensions in order to perform filtering at the same time.
Filtering would look like this:
even_squares = {x: x**2 for x in range(100) if x % 2 == 0}
Moving on, there are also set comprehensions, which create a set, so all values are unique.
square_roots = {math.sqrt(number) for number in number_list}
Again these can be filtered. If you want to filter using an if expression then the position of the conditional in the comprehension changes like this:
positive_square_roots = {
math.sqrt(number) if number > 0 else 0 for number in number_list
}
Comprehensions can also replace nested for loops. Take the example of flattening a list of lists:
list_of_lists = [[1, 2], [3, 4]]
flat = []
for sub_list in list_of_lists:
for element in sub_list:
flat.append(element)
This produces [1, 2, 3, 4]
. As a comprehension this turns into:
list_of_lists = [[1, 2], [3, 4]]
flat = [element for sub_list in list_of_lists for element in sub_list]
You can see that the ordering of the for loops has been maintained. Using comprehensions for nested for loops in this way becomes a bit unintuitive. Past a certain level of complexity it is more difficult to understand than the same code would be using a for loop.
Just because you can achieve something in a one line comprehension using multiple for and if clauses doesn't mean that it's a good idea.
A generator is a special type of Python function that returns a lazy iterator. You can loop over these results without the whole list of them having to be stored in memory.
For a concrete example take this function that generates the Fibonacci sequence of numbers:
def fib():
a, b = 0, 1
while 1:
yield b
a, b = b, a + b
You might then call it with code like this:
def print_fibonacci(max_size):
for number in fib():
if number < max_size:
print(number)
else:
break
So what's going on here? From the point of view of the calling function fib()
just looks like a collection. When fib()
is first invoked it executes as
normal until it hits the yield b
statement. It then returns control to its
caller along with the value, just like return b
would do. However, fib()
has
only been suspended, and all its state has been saved. When the iterator it
returns is queried again it resumes from where it left off instead of starting
again from the beginning like a normal function.
The other way of defining generators in Python is using generator expressions. These use the same syntax as comprehensions, but are wrapped in round brackets:
cubes = (i**3 for i in range(20))
These return an iterator rather than a collection. If the iterator is exhausted
and you ask it for another item it will return a StopIteration
exception.
In fact a comprehension behaves like wrapping a generator expression in a call
to list()
or set()
:
cubes = list(i**3 for i in range(20))
Note that when passing a generator into a function, Python allows you to omit
one of the pairs of brackets (like in the example just above - we are making a
call to list()
).
One tip is that functions like any
, all
and sum
allow you to pass in a
generator rather than a collection. This may lead to performance improvements
due to the lazy evaluation (i.e. the function may be able to return before
evaluating all of the elements from the generator).
The primary benefit of generators comes from the aforementioned lazy evaluation. This can make the memory overhead of a generator much lower than creating the whole collection at once and then iterating over it.
Take the example of reading lines from a file. If the file is large then reading
the whole of it into memory may be extremely slow or impossible. The file object
returned by open()
is already a generator of lines, so can be used like this:
def read_lines(file):
with open(file) as f:
for line in f:
print(line)
collections
moduleAn important way to improve your collections code is to make use of the available builtin types. Here are two of the handiest:
defaultdict
This is a subclass of dictionary that calls a factory function to supply missing values.
This lets you replace this:
fruit_sizes = {}
for fruit in fridge:
if fruit.name in fruit_sizes:
fruit_sizes[fruit.name].append(fruit.size)
else:
fruit_sizes[fruit.name] = [fruit.size]
with this:
fruit_sizes = defaultdict(list)
for fruit in fridge:
fruit_sizes[fruit.name].append(fruit.size)
Here any missing values are initialized to empty lists. This avoids cluttering up the code with checks and initializations.
Other common arguments are to create defaultdicts
of int
, float
, dict
or
set
types.
To create one which sets the default to anything you like you can use lambda functions. For example if all students are assumed to have a small green hat you would use:
student_hats = defaultdict([Hat("green", "small")])
Counter
Need to count occurrences of multiple different hashable objects? A Counter
is
your friend. It is a dictionary subclass where the elements are the keys and
their counts are the values.
When creating or updating the counter you give it an iterable. It iterates through and counts the occurrences of items, matching them by their hashes.
An easy way to get started on understanding Counter
is to count all of the
occurrences of strings in a list.
counts = Counter(["Fred", "Samantha", "Jean-Claude", "Samantha"])
counts
then looks like this:
Counter({"Samantha": 2, "Fred": 1, "Jean-Claude": 1})
You can then update the Counter
with further iterables, or other instances of
Counter
.
Once you have a collection or a generator the next stage is to do something useful with it. Python contains many useful functions for collection handling that can improve how you write your code.
any()
and all()
A common pattern to establish something about a collection is the following:
any_green = False
for hat in hats:
if hat.colour == "green":
any_green = True
break
This can be replaced with a call to any()
using a generator:
any_green = any(hat.colour == "green" for hat in hats)
This returns whether any item in the generator satisfies the condition, and will
shortcut the evaluation just as the above for loop does with the break
statement.
all()
behaves similarly, but returns whether all items satisfy the condition.
zip()
You often find yourself with multiple collections which you need to iterate over, collecting data together from each one.
for name, email in zip(names, emails):
create_person(name, email)
This steps through both collections at the same time, returning a tuple
consisting of elements from each one. Here we have unpacked the tuple into
separate values. zip()
can take as many collections as you like, but will stop
when the shortest one is exhausted. If you'd like to deal with all values you
can use itertools.zip_longest(*iterables, fillvalue=None)
. This will carry on
until all collections are exhausted, filling in missing values with the
fillvalue
.
One neat use of the zip()
functions is to iterate over pairs of elements in
the same collection. For example:
differences = [next_elt - elt for elt, next_elt in zip(items, items[1:])]
If you need to iterate over multiple collections one at a time, you can use
itertools.chain()
.
for name in itertools.chain(first_name_list, second_name_list):
create_person(name)
This iterates through the first collection until it is exhausted, then moves on to the next and so on.
There are lots more interesting and advanced functions in the itertools
module, and we will return to them in a further blog post.