There are possibly as many programming languages out there as there are presidential candidates in the early stages of U.S. elections. All of them have been developed with a reason – some have been tweaked to specific purposes, such as the scientific languages R, Julia, or Fortran; some have been developed as improvements over previous attempts, such as C++, or TypeScript. And while some serve only the purpose of general amusement, such as Rockstar (if you want to have a laugh, I recommend this great video by its creator, Dylan Beattie), some have become the foundation of whole scientific fields, such as Python.
Naturally, all of these languages solve certain problems differently. One of these problems is how to deal with repetitive tasks. “Repetitive tasks?,” you might wonder now. Yes, I am talking about functions. Every programming language has a concept of a function that takes some input, does something with it, and returns an output. Take the following example:
def read_file (path): with open(path, 'r') as fp: return fp.readlines()
This function simply takes a file path, opens it, and returns a list of lines in that file. This works fine, and there is nothing to worry about. But now, imagine you don’t have small files, but actually files that are several Gigabytes large. Then you have a problem.
The Problem: Memory and Big Data
Imagine a file that contains 4GB of text. This is an amount that we regularly see in machine learning tasks. And now imagine that your computer has the (current) default of 8GB of RAM. And now imagine you forgot to shut down Google Chrome and run this function.
What will happen is that Python will open that file, and read it in its entirety. And when it has finished it, it will return you a list of all the lines in that file, possibly smiling at you innocently.
Now, what will happen on any computer where you have about 2GB of RAM left, and attempt to push 4GB of data into it? Exactly: Your operating system will begin to write something of that RAM onto your hard drive so that there is enough space for your big file (this is what the
swap partition on Linux systems is for. Windows has a similar concept, but worse – obviously). And this will cause the whole function to take a lot of time, because your operating system needs to write data from the fast RAM onto the not-so-fast hard drive. And if you are dealing with a lot of expensive computations anyway, you don’t want your very first step in the analysis to become the computational equivalent of super glue, right?
So what should we do? The simplest idea is: Just don’t read in the whole file at once, and try to keep all the RAM requirements for your whole code below that critical mass that would crash your Chrome tab number 42 and, subsequently, because we know that beast of a browser, your whole computer.
Introducing the Concept of Streams
“Yeah, sure, don’t read the full file. But we need the full file!,” you might think now. And yes, over the course of our data analysis, there is a very high chance that we need the whole file. But not at once. Rather, we want one line at a time.
Luckily, this problem has been solved many times before us. A file is basically like a staple of paper sheets. If you need all of them, you will, one after another, take a piece of paper from that staple, work with it, and put it somewhere else, until the whole stash has been worked through. If you open a file, you don’t actually read the file, but rather tell your computer “Hey, I want to use that file!” And in a second step, you then read the file itself. And you can do this using streams.
A stream is basically the computational equivalent of a bureaucrat sifting through piles of forms, processing one at a time. In terms of file parsing, this means that the stream will give you back one part of a file at a time – a line, for instance (or a certain amount of characters in it). Under the hood, Python’s file reader does the same: It opens the file, reads one line, then the next, and so forth.
However, if you call
readlines(), this will tell Python “give me that full file! Now! Entirely!” So what do we need to do?
A generator is also a function, but one that only processes a small piece of data at once, returning that piece to you and waiting for you to request the next piece of data.
The file object itself (that one that you get when you call
open()) is a generator. And we can make use of this generator to turn our file reader into a generator itself! We just need to change two lines of code. Look at the following example:
def read_file (path): with open(path, 'r') as fp: for line in fp: yield line
Note: This code works only in Python 3, because in Python 2, the file object was not a generator. So don’t be mad at me if your Python barks at you.
What has changed to our example from the beginning is that now we are not returning a list containing all your lines at once, but rather return one line at a time. (Note the
yield-keyword. It’s basically the same as
return, but better. Bear with me.)
The good thing is that the for loop is pretty good in dealing with generators, because a generator is iterable. That means, you can iterate over a generator just like you can iterate over a list. You could use above function as such:
def process_lines (file_path): for line in read_file(file_path): words_in_line = line.split(' ') # Do something with the words here yield words_in_line
But, to go one step further, you really want to do something with the lines, right? For example, feed them to some classifier to train it for some task. Imagine the following code:
def make_batch (file_path): batch_size = 25 batch =  for word_list in process_lines(file_path): batch.append(word_list) if (len(batch) == batch_size): yield batch batch =  if len(batch) > 0: yield batch def train_model (): training_data = '/path/to/data.txt' model = CreateClassifier() for batch in make_batch(training_data): model.train(batch) # At this point you have a trained model
What the function
make_batch is doing is it will create an empty list, to which it will append lists of words until it reaches the size
batch_size. Then, it will give you that whole batch, and create a new, empty batch. And the classifier is then trained on just one batch at a time.
Main Takeaway: This means that, at any time while running the program, you will have _at most
batch_sizesentences of the big file in memory!_ The memory usage is generally determined by that one part of your program, that will “keep” the most data in memory. In our function this is
make_batch(), which fills a list up to a certain point, and only if that point is reached will it return the whole list at once. All other functions only return smaller pieces of data. If you would now increase the
batch_sizeto equal the amount of lines in the big file, then you end up where we started: After
make_batch()is done with one (that is: the only) batch, Python will have put 4GB of data in your computer’s memory and you will know that because Google Chrome will be very angry with you.
To understand it further, here is how this program will run:
- You call
make_batch()and request one batch of training data
make_batch()will then call
process_lines()for (in our example) 25 times, retrieving 25 processed lines.
process_lines()will, for every line of those 25 we request, call
read_file()will, for every requested line, call
fp, which is, as we remember, the “original” generator.
You will have noticed that, instead of
return I have written
yield. This is basically just a keyword that tells Python: “Instead of returning content and then leaving the function, return the content but keep the function as it is, and the next time I call the function, do not start all over, but rather continue where you were.”
yield just tells Python to convert your function into a generator.
And this involves a lot of magic. Are you ready for more?
Bonus Round: Magic Functions and What The F*** is Going On?!
I might have said it already in one of my earlier posts, but I don’t really like Python. The main reason is that I actually miss curly braces. Indentation with spaces also looks clean, but if you really want to know which lines belong to which loop, it just takes longer to figure this out as opposed to when you have curly braces. Because as soon as you see
} you know that some expression, loop, or if/else statement is over. This has opened the gates for really bad coding habits, yeah, but I won’t let this argument count. Curly braces are much safer. Period. There, I said it.
Code Conversion: List Comprehension
Another reason why I don’t like Python is that it produces a lot of side effects that are invisible while writing your program. Python will, for instance, convert a lot of the stuff you have written into different code before running it. Let me show you two examples. The second one will be generators, but first, I want to tackle a concept called “list comprehension.”
# List comprehension means to write for-loops # on one single line. # This means, Python will convert this ... count = sum(1 for element in my_list) # ... into this: def __function (some_list): new_list =  for element in some_list: new_list.append(1) return new_list count = sum(__function(my_list))
As you can see: It’s much shorter to write (and there are probably a lot of optimisations going on under the hood), but it basically alters your code. And this requires you to understand why this works, if you really want to bring list comprehension to good use. It’s basically all about turning more code into less so that it fits on one line. Probably to justify why Python doesn’t use curly braces, right? RIGHT?!
And now the second example: generators. Take our example from above:
def read_file (path): with open(path, 'r') as fp: for line in fp: yield line
Internally, Python will convert this cute little function into the following monstrosity that doesn’t need to hide behind Cthulhu:
class read_file(object): def __init__(self, path): self.path = path self.fp = open(self.path, 'r') def __iter__(self): return self def __next__(self): try: line = self.fp.__next__() return line except StopIteration: raise StopIteration()
Note that the whole
try/except-block is not really necessary, but I included it here so that you understand it: As
fp itself is a generator, it will behave like the generator that I have written here. You probably have a lot of questions right now. Let’s answer them one by one.
What are those Underscore-Functions?
First, this class features several functions that are surrounded by four underscores:
__init__. These are called Magic Methods. They are called “magic,” because you normally don’t see them, but they are quite useful. For instance, a
for-loop won’t work on everything, it only works on variables that are iterable. This means, internally, a
for-loop will take the variable and take a good look at it. If the variable is an object and exposes a function called
for-loop will be happy and call it. Next, the
for-loop will call the
__next__ function, and write whatever comes out of this into your variable. So if you have a for loop like this:
for element in iterator: print(element)
Python will call
iterator.__next__() and put whatever this function returns into
element. When there are no more elements to return, an iterator must
raise an error (
throw in other programming languages). The
for-loop will automatically catch this exception and simply stop looping.
Main Takeaway: Magic methods are called magic, because they enable you to do stuff that will happen automatically. “Magic” functions are less magically, and more simply conventions: The Python developers have decided that there is a bunch of those magic functions (not just the few you see here), and if they exist, Python will automatically call them for you. For instance, the
Okay, now how do Generators Work?
For this, you need to understand how classes work in general. If you do, then it should be relatively self-explanatory. First, such a generator-class will take everything before the loop as internal variables. This is called the state of the generator object. In our case, it’s just the file pointer. Then, the generator class will create such a
__next__ magic function and in that replace your
return. Every time the
__next__-function is called, it will read one line and return that from your file pointer.
Because your file pointer is also a generator, it will raise an exception,
StopIteration, as soon as it hits the end of the file. This is an additional convention. For instance, if we did not re-raise the
StopIteration error, we would have created an endless generator, so the
for-loop would never end.
That all of this is working so flawlessly is just because of a lot of conventions and the magic methods. For instance, if you would omit the
for-loop would give you an error that the object is not iterable (because it does not have this
But there is one other reason for why you might want to use generators: To create an endless stream of some data. Consider the following code:
def get_id_number(): number = -1 while True: number += 1 yield number
This is an endless generator, so it will never stop running. This means: Do not use it in a
for-loop! Rather, what you could do is do the following:
id_generator = get_id_number() do_something() if we_need_a_new_id: my_id = id_generator.__next__()
This way you have this one function and it will guaranteed give you a unique ID every time you call it. You could write other code as well, but it might be that such a generator is just an easier way of doing so.
Note: Of course such an ID generator will not generate unique IDs forever. Specifically, it will stop creating random numbers if it hits the maximum size of the number type of Python. In Python 2 that is 9,223,372,036,854,775,807. This is quite large, but not infinite. As soon as
numberreaches this amount, and you increase it by one, you will have what is called an Integer Overflow and the number will reset itself to zero (if you are lucky).
So what do we learn? Generators can be pretty decent if you have to create or process an amount of data but don’t know how much you actually need, or have. If you don’t know how large your file is, or if you don’t know how many numbers you need, that’s a good sign a generator is for you. But don’t forget that a generator is basically just a neat wrapper around stuff you could also do by hand, if you really wanted to.
Nevertheless, generators allow you to think better about what your code is doing. For instance, I like to think of a generator as a well. A well has some amount of water in it, and everytime you call
__next__ you take a bucket of water out of it. But even though you don’t know how much water is in there, it’s a limited amount. And at some point, your well will be drained, that is, have no more water in it. The
StopIteration exception is when you realise that there is nothing left.
And this metaphor of “draining the well” can be applied to generators as well. I’ll leave you today with a final piece of code that will “force-drain” your generator, meaning that it will pull out every inch of data out of your generator until it reaches the end:
drainage = [element for element in generator]
The above statement will forcefully call
__next__ immediately until your generator is completely empty and raises the
StopIteration exception. And there you have it: All of the somewhat dubious concepts of the programming language that I introduced in this blog post – list comprehension and generators – in one single line!