Humans operate by heuristics. Much of our daily lives are driven by heuristics, where we can infer from past experience how a current situation will most likely play out. Instead of having to consciously think about, say, calling the elevator, we already know what that button does, so we waste no conscious thought on this little action. Instead, we can reserve our mental capacities to new and unseen problems. In the social sciences, this has been referred to as the “System I/System II” approach. You have likely already heard of Daniel Kahneman’s book “Thinking, fast and slow.”
One thing we rarely think about, however, is that every heuristic is always also an assumption. And in science, assumptions are risky decisions. The reason I am writing all of this – in case you wondered (which you probably did, right…?) – is because recently I had the unfortunate experience of a heuristic turning into a wrong assumption.
Background: Reproduction
First, some background. For an R&R, I recently wanted to reproduce my main results using a different data set. As a reminder, I work with U.S. Congress, and so the data was obviously additional congressional floor speeches. I essentially had to download some new data for more recent congresses, and mold it such that it looked somewhat like my existing data. Then, all it took was copying over my old code, checking that it worked, and using it to run the same analysis on the new data.
In this context, I experienced both a heuristic (which was clever), and an assumption (which of course did not hold true). So, let me share my experience.
Heuristic
First the heuristic. One issue I had was that the new data did not have any references to Congresses, but instead actual dates. My old data had references to the congress in which a speech was held, so I somehow needed to get that for the new data, too. I could have indeed extracted the exact start and end dates of each Congress, written some logic to check if a speech’s date lies between a congress’s start and end dates, and called it a day.
That would have been perfectly precise. But also a headache to implement, because dates can be messy. So instead I decided to go the lazy route and use a heuristic. To check which congress a speech was held in, I came up with this genius contraption:
def year_to_cong (year: int) -> int:
return ceil((year - 1788) / 2)
What?! Well, this function is a heuristic turned to code. Each congress is two years long, and the first congress convened in 1789. So, if we want to know which congress a speech on November 11, 2011, has been held in, we just take the year (2011), subtract 1789–1 from it (221), divide it by 2 (111.5) and round it up (112). A quick check tells us that a speech held on Nov. 11 of that year would have indeed fallen in the 112th U.S. Congress. Why minus 1, you ask? Well, if we wanted to know the congress of the year 1789, just using 1789 that would give us $\frac{1789-1789}{2} = \frac{0}{2} = 0$, which is incorrect. However, using one less gives us $\frac{1789 - 1788}{2} = \frac{1}{2} = 0.5$, which, when rounded up, is $1$. A neat side effect of this is that the second year of that congress, 1790, will give us $\frac{1790-1788}{2} = \frac{2}{2} = 1$, which remains $1$ even after rounding up.
Now, if you know anything about U.S. Congress, you may now have thought to yourself “Wait a moment, that doesn’t work out! Congress doesn’t convene on January 1st! Also, this has changed over time! Until the early 20th century, congresses started in March, a whole quarter year in!”
And yeah, that’s correct! And that’s what makes this a heuristic: I can rely on two additional pieces of context. First, the new data only encompasses congresses 111 until 119 (the current one). So these all convened on a January 3rd. This means that at most two days will be “wrong.” And second, this is only for an interesting tidbit, and not for the main results. Which means that it does not have to be perfect.
In this particular instance, this is a perfectly reasonable assumption to make.
Which brings me to an important insight: Every heuristic is also an assumption. If we were talking about the main results of this paper, the reviewers would be absolutely correct to slap my wrist for doing something like this. But since I knew I wasn’t going to lose a quarter of a year with this assumption, but rather only 2 days of an entire congress (which is $\frac{2}{365 * 2} \approx 0.003 = 0.3%$), it felt like a reasonable tradeoff to make.
But as with any assumption, these can go wrong.
Assumptions
After I verified several times that the shape of my new data was the same as my old data, I copied over my old analysis code, only adjusting it for the newer dates I was dealing with. By keeping the code as close to the original as possible, I wanted to ensure that the steps performed on the new data are the same as the old one. And it all worked perfectly!
…until it didn’t.
You see, I was relying on the numpy library for a lot of numerical calculations in the analysis. And numpy is nice enough to let you know if there is any division by zero.1 And this happened in my code. But there was also a different warning, one that I haven’t seen before: “overflow encountered in scalar.” Wait, what?
I dug a bit deeper, but could not find any scalar multiplication, so I dismissed it, because – again – the code itself just worked fine. But I checked the results, and I had impossible numbers that could not have been right. So I had to investigate.
After several hours of checking every line of my code, I was almost ready to give up, until I tried something that made these warnings disappear, and that fixed the results.
But I wanted to understand what happened. So let’s unwrap this.
The Malicious Code
I had a series of vectors with whole numbers (integers), and the central operation was to calculate cosine similarity on them. So my code looked essentially like this:
vec1 = np.array([0, 1, 1, 0, 2], dtype=np.uint16)
vec2 = np.array([1, 0, 1, 2, 1], dtype=np.uint16)
cos = 1 - distance.cosine(vec1, vec2)
If you have encountered this same issue already, you may already see the problem. To give you a hint, if you want to try to solve it yourself, here’s the solution that fixed the issue for me:
vec1 = np.array([0, 1, 1, 0, 2], dtype=np.uint16)
vec2 = np.array([1, 0, 1, 2, 1], dtype=np.uint16)
cos = 1 - distance.cosine(vec1.tolist(), vec2.tolist())
If you still wonder what is wrong with this, let me explain:
I instantiate the vectors with the data type uint16, that is, an unsinged 16-bit integer. I do so because I don’t need more. 16 bit are sufficient to store numbers from 0 until 65,535. This choice rested on the assumption that no element of these vectors could become negative (hence the “unsigned”), and that they would stay well below the 65,535 limit.
And these two assumptions were, indeed, right. So nothing wrong here.
Next, these vectors get passed to a cosine distance calculation. This function calculates the distance between two vectors, and it ranges from 0 (they are the same) to 1 (the vectors are orthogonal to each other). Since I needed the cosine similarity instead, I simply subtracted the distance from 1, which turns the distance into a similarity.
And this code is also correct. This is why it took me so long to find the error: There is nothing wrong with this code.
Understanding Memory Layouts
Instead, the issue lies somewhere else. To understand this, we have to properly understand how numpy works, because it differs quite tremendously from how Python works. Python does not have many types of numbers. Essentially, it knows the difference between integers and floating point numbers, but that’s mostly it. (There are some nuances, but these are not important here.) numpy, however, knows a lot more types of numbers. It knows signed and unsigned numbers, and various widths of these numbers (that is, how large they can be), ranging from 8 to 126 bits.
This is not because it would be mathematically necessary, but rather due to technical reasons. You see, math has infinite precision, but computers can only work in binary – zero or one. Furthermore, numpy aims to be a very fast library, because it is intended to work with large amounts of data, including LLMs. And once you need to ensure that some generative pretrained transformer (that is, ChatGPT or Gemini) runs reasonably fast, you need to start thinking heavily about optimization.
One of the most fundamental optimizations you can do is write your code in machine code directly. This is why numpy is actually not written in Python, but mostly C and Fortran. The central building piece of numpy is a numerical library called OpenBLAS. This makes the code much much faster than if it just used regular Python code.
A second optimization you can do is perform calculations not one by one, but instead by performing many calculations in parallel. This is one of the central reasons why numpy has made LLMs so fast on personal computers. But how does this work in detail? Well, I’m glad you asked!
Here is (a simplified version of) how numpy will store the array [0, 1, 1, 0, 2] in memory:
00000000 00000000
00000000 00000001
00000000 00000001
00000000 00000000
00000000 00000010
You will notice that each row is one number and each row has exactly 16 bits. That’s what happens if you tell numpy to store some data as a uint16 (“unsigned 16-bit integer”). Here’s the same numbers, but stored as IEEE 754 64-bit (double) floating point values, the default numpy type:
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000
00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
01000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Quite a bit different! First, you can clearly see that this representation of the same numbers uses four times as much space (320 bits instead of 80). And secondly, how the numbers are represented is wholly different. It does not matter too much how exactly this representation works, but if you are inclined, here is a demo that allows you to play around with some numbers.
What matters here is primarily that the uint16 standard requires less memory to store data than float64.
The False Assumption: numpy performs type checking
With that context, here is finally what goes wrong with the code above:
When we pass two uint16 arrays into the cosine similarity calculation, at some point, the function will perform a dot-product between the two. And this is where things go wrong: Because the dot-product involves multiplying scalar values with each other. For example, $3 * 4 = 12$. After that step follows a division.
And what can happen when you multiply or divide values? Exactly: you may end up with fractional values. And that’s why numpy will operate on these two arrays as if they were following the default floating point layout, and that is double-precision 64-bit integer. Essentially, since numpy just operates on the raw memory, it was trying to stuff a 64-bit floating point number into the memory space of a 16-bit integer, which it then tried to read as a 16-bit integer again. Or rather, it did not try, it actually did succeed — at least for the first result. All other results led to what is known as a buffer overflow (that is, the second number did not fit entirely, so the remainder got written into the void). This caused the first warning: scalar overflow.
Then, when the library tried to divide two such Frankennumbers, it could very well be that the denominator had only zeros and that is the textbook definition of a division by zero, which is undefined. That led to the second runtime warning.
Now the next question is: why did all of this get solved once I called the tolist function of each numpy array before passing them to the function? Well, because that function turns the numpy array into a regular Python list. You will rarely see this, because for any number that needs more than 64-bit, this will be a lossy procedure (native floats in Python usually use 64-bit double precision, so 128-bit or 256-bit numbers will simply be cut off). But in this case it worked, because it removed an assumption from my code.
Essentially, numpy tries to be fast, and as such it relies on the user to ensure everything is correct.2 numpy itself assumes quite a lot of things, and if you – the user – trust numpy a bit too much (which I did), it will fail you. If numpy actually checked every individual array, this would severely slow down the operations, and would make things such as locally running AI almost impossible. Thus, it foregoes the type checking and just assumes that, if you give it a numpy array, that it will “just work,” because it hopes that you have done the thinking, and it can just start churning numbers.
But in my case, that was wrong. Once I gave numpy a regular Python list, I essentially forced it to convert that list to a numpy-array so that it could do its magic. And by default, numpy uses – you guessed it – a 64-bit floating point double! That’s what fixed that code.
Investigating Further
Finally, why did this become a problem for my data reproduction step, but was not an issue while I was calculating my main results? Well, for the main analysis I actually saved down the data to CSV files and then loaded them in after the fact. And, since at one point I also did try to work with floating point arrays, not simple integers, here’s the function that loads in the arrays from disk before passing them to the cosine function:
vector = np.array([float(l) for l in loadings], dtype=np.float64)
Do you spot it? Right: I loaded the data from disk and cast them directly to float 64. This works with both integers (because there’s enough space for them), and floating point values. If you pass an array of these types of numbers into the cosine function, the memory layout of the numbers is as expected and nothing goes wrong.
Here’s my mistake: For the new data, since I already had the code, I decided never to save down the data in between steps, because it’s a much smaller dataset. And as such, I decided in my infinite wisdom to use uint16 because it saves on memory. Well, now you see what I got from trying to be smart!
Lessons Learned
So, what do we learn from this? I personally have learned quite a few things:
- Optimizing code too heavily is usually dangerous. In my example, using 64-bit floats would not have dramatically increased my memory consumption (after all, the data isn’t all that big), but because I was afraid it would, I violated an assumption of
numpy. - If you are working with an interpreted language such as Python or R, it is easy to forget about the intricacies of how everything is stored in memory. Since your programming language typically does the heavy lifting for you and will decide whether to store something as a float or an integer, it is easy to forget about this. But
numpyisn’t Python, it is just very well hidden C code. Because it wasn’t obvious, I did not consider memory layout (even though the explicit usage of thedtypeargument would have you assume I did). - You will overlook the fine nuances between what you did with your old code and what the changes you do to your new code will actually imply. That’s why it is always important to (a) check your results, and (b) trust your gut. If your gut tells you that some numbers can’t be, do not give up until you find the issue.
- Pay attention to errors. If there is an error – or even just a warning, and you’re dealing with scientific code, do not dismiss it. It will come back to bite you.
- Raise errors where errors are due! This one goes out to the developer who decided that buffer overflows and divisions by zero should not raise an exception by default (forcing you to fix them before your code runs). What were you thinking? It is very easy for humans to ignore warnings, but very hard to ignore errors. A division by zero is by definition undefined. It is impossible to return a value in that case, not even
inf. If you still do, you violate mathematical rules. By making this a mere warning and returning a nonsensical number, you endanger the trust users have in your library. Indeed, the first thing I did after understanding the issue was go back to my old analysis code to see if that would now also raise these warnings. Fortunately, it didn’t, and I found the reason for it. But I actually did continue to use my wrong data until the next step in the analysis in R raised an exception that forced me to finally go back and fix these warnings that I ignored at first. Ignoring warnings is indeed convenient. - And, lastly: data is invisible. When you write code, you have the delicate task to write explicit statements that deal with data that will flow through these statements, without you usually observing that happening. You will have to trust your abilities and do a lot of debugging to make your code work. And this can lead to wrong assumptions, as in this case. So always try to remember: your data is invisible in your code.
Let’s see how many weeks that insight lasts until I make this mistake again ;)
Thanks for reading!
Nota bene: Who decided that this should only print a warning and return some default value, instead of raising an exception?! In what world is a division warranting a little wave with the finger and returning anything other than
undefined(or,None, in the case of Python)? ↩This is sometimes referred to as a “footgun.” Because it allows you to conveniently shoot yourself in the foot. ↩