Imagine I told you that earlier today I observed people at the checkout at my local supermarket. Two people were buying ten and twelve apples respectively, and a third person bought a bread. Just ignore for a second the urging question of why one would do that in the first place, and imagine that I now told you to write that info down for me. How would you go about this? You would probably create a table:

Person | Item | Amount |
---|---|---|

1 | Apple | 10 |

2 | Apple | 12 |

3 | Bread | 1 |

Tables are everywhere. We use them to represent data, calculate stuff, and we even used them to create websites back in the 90s. We are so adept at working with tables that we sometimes not even realize that we are in fact dealing with tables. One could even say: we are thinking in tables.

In this text I will be going to argue that this is a big problem, and it will become an even bigger problem going forward. After introducing why the representation of tables is so natural to us, I will outline a few instances in which this natural idea of representing data with table structures has turned into big problems. Then, I will guide you through abstracting away from a table representation into more suitable data structures depending on the task before concluding with some remarks about why we all might want to understand computers better before beginning to play with large amounts of data.

# Tables as Data Representation

Humans make sense of the world by thinking in symbolic terms. We do not think in essences or ontologies of things. When we see a stone on the street, we don’t care about whether it’s technically a metal, a crystal, or a salt. It looks rigid to us, we need a lot of force to break it, and if we were to throw it at someone, that would probably hurt them. If we are a geologist, however, we may in fact be interested in what that stone actually is (“in essence”). But since most of us are not, we just call it “stone”.

This example already points at a problem we are all facing: how do we describe something? With regard to the stone, it’s very easy for most of us, but if we are interested in geology, this might look different. But neither the geologist who tells us that it’s in fact a salt nor we, who demand to call it a “stone”, are technically wrong. Both terms describe the thing we found on the street equally well.

Turning to data, the same problem occurs. If we have ten people in a room, we might ask some researchers to describe these people to us. The sociologist would probably create a survey and ask them about basic demographics such as gender, age, place of residence, and occupation. The economist might ask them about their salaries, capital investments, and last tax rate. The biologist may record more physical properties such as height, weight, and hair color. You, finally, may then say “These are my friends I wanted to go to the concert over there. Could you now please stop harassing Eric and let us go enjoy the show?”

Two observations follow from this. First, most disciplines, if they are to record data, will naturally think about representing it in the form of tables. The only thing that differs will be what kind of information is of interest. And second, sometimes the datum of interest can be summarized without any table whatsoever.

When we – as data scientists – deal with data, the type of data representation that we choose is a huge deal. Depending on the format we choose to represent our data, some conclusions will come more naturally than others. If we are interested in finding out how high people are on average, recording a table with all measurements will make absolute sense. But if we want to summarize a book, no table of frequency counts will help us. Likewise, if we want to find out common shapes in images, again, no data table will absolve us.

Most of us, however, have been trained to represent data solely with the help of tables, because that is the most natural representation of the type of regular statistical data that we all know very well. Each observation is a row, and each column contains a single variable that we observe. Many statistical procedures have been adapted to work well with this representation, and when we, e.g., calculate mean and standard deviation of a single column within such a data table, we just *know* what this means, and what the implications are.

# Everything is a Table

The way we have represented data ever since the inception of the statistical discipline somewhere in pre-modern times has had such a large impact on how we think about data that it engraved the table-representation not just in our minds, but also in the tools we use.

Stata’s own programming language, for instance, has been modeled to fit the table representation of data. The same holds true for the more general language R. When you load data, e.g., by calling `read.csv`

, R will always give you a data frame object and you really have to force it to give you back anything else. Other data types, such as arrays or objects, are second-class citizens, and working with them can be a real pain.

When I started switching to Python for data analysis, the first thing I learned was pandas. Pandas is basically the Python equivalent of R’s data frames. You load a CSV file, and then you get back a table, with which you can run all the analyses that you already know. That was great, because for me in my role as a data scientist, everything was a table. Everything I *knew* was a table.

As with most misconceptions about the world, we are rarely forced to reflect on them, except for when they break down. And data tables break down very quickly once you leave the shallow waters of more traditional data analysis and a modicum of data.

# When Tables Break Down

Tables work great. As long as you don’t have too many observations and/or too many variables, data access and calculating metrics is fast and enjoyable. However, I noticed that there may be limits to tables early on.

When I was a student assistant at the Bonn center for development research (ZEF), I had this dataset of approximately 1.5 GB of data. All I had to do was to crunch the data to derive new variables. At the end of the project and writing about 1,000 lines of Stata code, the full code ran for about two hours. Back then I did not know how to actually optimize my code, but it dawned upon me that maybe crunching that amount of data shouldn’t take *that* long.

I had a very similar experience recently, when I tried to load in a big chunk of the metadata for my own PhD as a pandas data frame. Calculating a single metric suddenly took seconds – and that on a fairly fast computer. That felt eerily similar to my experience years ago. This time, I already knew how to speed up my code, and with a few changes – one of them was to completely drop pandas and load the data differently – the code ran blazing fast again.

But I’m not the only one noticing such limits of table representations of data. Recently, a student ran a topic model on a relatively small corpus of text, and they reported that one model run – using a very small amount of topics – took more than a day of constant computing. At first I was confused, since even slow and cheap laptops should be capable of running a topic model of that size within at most two to three hours. But not 24 hours. So what was going wrong? The student was using R to run the topic model, and in R, everything is a table. And that is what broke the neck of the laptop when it came to training a topic model.

You may now think: “Yeah, well, what do you expect when you work with *text*? Text is always very resource-demanding!” But then you are wrong. A topic model does not really work with text. Rather, it works with what is commonly referred to as a document-term matrix (DTM). And that thing literally does not contain any text. Rather, each word is represented as an index in that matrix, and the cell at position $i, j$ contains a one if document $i$ contains word $j$, otherwise a zero. So a topic model works – in a very traditional sense – just with numbers.

But then, what was making that whole thing so slow? It was, in fact, the same three causes that also made my own experiences with tables awful: the memory footprint, data access times, and the versatility of tables.

## Tables have an insane memory footprint

The first cause of the sluggishness as experienced by me and the student has to do with the memory footprint of tables, and there are two contributing factors to this. First, a table always must be read in its totality. You cannot read in partial tables. So if you have a data file, for example, that is 1.5 GB large, then the resulting table representation of that data in your program will take up *at least* these 1.5 GB. And if your computer does not sport 32 or even 64 GB of system memory, that is likely already too much for your computer. Always remember that this data table is not the only thing in your memory: so is your data analysis software, the text editor you have open on the side to take notes, and your web browser (I have written extensively about this in this article). All of this makes the available system memory run out much faster than one might believe.

But there is a second factor that contributes to the memory footprint of a data table: the amount of information that needs to be stored in order for the program to actually fulfill your access requests. For example, when you want to access a column in the data, you may write something along the lines of this: `df[["Variable One", "Variable Seven"]]`

. How that data is then actually accessed surely depends on the implementation of your specific program, but since a data table cannot know whether you want to work with a slice of *rows* or a slice of *columns*, it needs to store an excess of information about where that data actually resides in memory. In order to not make you wait forever until it has retrieved your data, it needs to store much more information in redundant ways.

If you have enough money, you can certainly circumvent these memory problems and make sure you don’t run out of space. But even then, there is a factor that will make data tables very sluggish once you go past a certain threshold of data that it can easily work with.

## Accessing data in tables takes time

This factor is much more difficult to understand and very technical, but nevertheless an important factor in how fast you can work with data tables: access times. Let us assume you have solved the memory problem with the remaining funds of your PI’s project grant and all of your data sits in memory comfortably. How does a program actually access your data? For you, it will seem very natural: When you write `df[["Variable One", "Variable Seven"]]`

, of course you just want it to pluck those two columns out of your imaginary table and give them back to you. However, as easy as it is to *visualize* what will happen, this is merely a bad analogy. In system memory, there is no table. All there is is an awfully long line of memory addresses. In other words: Your computer stores *everything in one single long list* that is either eight, sixteen, or thirty-two billion (= giga) bytes long.

The reason that your computer isn’t always painfully slow is that – as long as you know the correct memory address of your data – accessing the data is instantaneous. It does not matter if you need some data from the first Gigabyte of system memory or from the last. As long as you can tell your computer “I need *that* there!” it will comply in an instant. When it comes to accessing those two columns, however, you are at the mercy of the implementation. If the implementation stores all your information column by column you may be lucky, because the program can just tell the computer “Please give me the data between address X and Y” and that will then be the full column. But if your implementation stores the data row by row, you may be out of luck, since instead of requesting one single chunk of memory it needs to request 1,000 small chunks. And that will be exactly 1,000 times slower than the former.

## Data Tables Can Store Anything

The third factor contributing to the sluggishness of data tables is that you can basically store anything in a table. Do you have strings of characters and numbers at the same time? No problem, a data frame will deal with this easily. Some implementations even support arbitrary binary data such as images or videos.

However, that contributes both to a bigger memory footprint as well as slower access times. First, a data table needs to remember what kind of data is stored in memory, and that takes up additional space. When you have a column full of numbers these have to be treated differently by the program than a column of strings. Remember that in a computer everything is just zeros and ones, but what you expect those zeros and ones to *represent* makes a huge difference. Take the numbers 32 and 33. These will be stored as `00100000`

and `00100001`

respectively. If you now want to add up those numbers, you expect it to become `01000001`

or 65. But what if those numbers should actually represent strings? In memory, they will look *exactly the same*, because your computer has no concept of “text”. A program merely defines that the number 32 represents a space, and 33 represents an exclamation mark. The *numerical* sum of these numbers, 65, can be represented as an A. If you want to add those numbers, what you want the computer to *actually* do is simply concatenate them, since it would be hard for you to convince anybody that `_ + ! = A`

.

These shenanigans of data types also means that access times will be slow. Remember that, if you tell the computer exactly what memory address you want, it will return you the data stored there instantaneously. But if you have differing data types, then calculating that memory address will be more difficult. The program would have to first take the length of one single value of the first column times the number of rows *plus* the length of a single value in the second column times the number of rows, etc. This is much slower than if the data frame just had one single type of data — say, an integer.

This leads to our first step away from tables: matrices.

# From Tables to Matrices

If you only have numbers as values in your data, you do not need to utilize the full might of a data table. Rather, you can utilize a matrix. A matrix is much better on all accounts than a data frame. So much better, in fact, that if there is any way of re-coding your data into only numbers, it is probably much easier to utilize a matrix to begin with and actually go through all that re-coding hassle. Remember the document-term-matrix from earlier? This is such a way where some non-numerical data has been re-coded so that it can be represented with a matrix rather than a data table.

A matrix has a much smaller memory footprint, because there is only one single data type inside the table: integers, for example, or floating-point numbers. A matrix doesn’t have to store multiple data types at once. This means that it doesn’t have to store redundant data since the speed differential between accessing a row and a column is for the most part negligible. Also, accessing data is fast, because the program just has to compute $i * j * \text{size}$ to precisely know where your value is in memory.

When you re-code your data, all you have to keep in memory is a small lookup table that maps indices inside the matrix to column names, i.e.: index 0 is always ‘apple’, index 1 is always ‘bread’, etc. Whenever you need to know, say, the mean of the column for the amount of apples, you can just look up the index of the column in the matrix where that information is stored and then calculate the mean of that column. Lastly, since matrices are normally optimized for calculation, getting the mean from a column of a matrix is many times faster than calculating the mean from a table column.

“Hold on!”, you may say now. “Earlier you mentioned that this one student had problems with a *matrix*, not with a data table! So are matrices now also bad, or what?”

Very good point. Matrices are still not the one-shot solution for everything that ails data table representations. When dealing with text, as in the document-term-matrix above, the matrices will generally be very large. Think about it: unlike data tables, whose number of *columns* normally does not change because we are observing one fixed set of variables, such DTMs grow exponentially. Since text follows Heaps’ Law, adding more documents (=rows in the matrix) to your corpus will exponentially increment the amount of words (=columns in the matrix). Additionally, since we have to store information on the *non-occurance* of a word as well, the actual size of the matrix does not just grow quadratically, *it grows quadruply*! And at some point, you will run into the same problems with matrices as you do with data tables: they become too large and your computer begins to swap data from memory, making your analysis come to a crunching halt.

Luckily, solving these problems with matrices is so much easier than solving them with data tables. The first good idea to implement when dealing with large matrices is to actually split them up. As long as each column index in every partial matrix always represents the same word, you can repeatedly push only, say, 1,000 documents through your algorithm, rather than the whole corpus. Most algorithms are designed to work iteratively, precisely to avoid these size problems.

But even a 1,000 by 20,000 matrix will still be quite large. There is a second property of a document-term matrix that we can utilize: such a DTM contains many more zeros than ones. In fact, one of the DTMs I constructed consisted of almost 99% zeros! There is surely something we can do about that, right?

Of course! And that is called: Sparse matrices (on which I have written extensively here).

# Entering Deep Waters: Sparse Matrices

Now we are going another step away from the known, shallow waters next to shore, into the deep sea of primitive data types. If you have dealt with matrices in school, you may already be vaguely familiar with the mathematical distinction between sparse and dense matrices. In mathematics, a sparse matrix is one where most elements are zero, whereas a dense matrix is one where most elements are non-zero.

In engineering, however, the distinction between sparse and dense is different. Every regular matrix that you are going to use is dense by definition, regardless of the ratio of zeros in it. The reason it is called dense is because *every element is stored in memory*. That means that every single zero within a (mathematically) sparse matrix will be actually stored. But engineers quickly discovered that it’s kind of wasteful to store every single zero. So they devised sparse matrices.

A sparse matrix in the engineering sense is defined as a matrix that only stores the non-zero elements of a matrix. So when you have 99 % zeros in a matrix, that matrix will be up to 99 % smaller when you store it as a sparse matrix format. However, as with every optimization in computing, “There is no free lunch”. That means that, depending on what your aim is, you likely need different algorithms to construct such sparse matrices. Some algorithms are very efficient for *building* such a sparse matrix, but they perform awfully when you want to *access* the matrix’ elements. For other algorithms, it’s the other way around. A sparse matrix in LiL format (“linked list”), for example, is very good when you need to quickly build a document-term matrix, but it is a good idea to afterwards convert that LiL-matrix into a CSR-matrix (“compressed sparse rows”), since the latter format offers very fast access speeds. And when you perform topic modeling, you only need to create the matrix once, but then access it oftentimes, so the time it takes for the computer to convert the matrix back and forth is negligible.

# Thinking With Your Data

All of this arrives at the key point I want to make today: We have to learn to think *with* our data, not against it. If we have a mental model of a data table in our mind and try to think of ways of how to force some data into the schema of a table, this can do utter violence to your data. And in fact tables make up only a very tiny fraction of the data our computers are dealing with every day.

The tools we devise always reflect what we need them to do. We wouldn’t use a hammer to drill a screw into a wall. However, data tables have become the hammer of statistics, and with the large data sets that are more and more on the forefront, it becomes more and more difficult to think of every problem as just another nail. We have to acknowledge an important fact: For a computer, dealing with a table representation of data is painful.

In order to provide you, the researcher, with an interface that corresponds to your mental model of the data as a table, it has to perform a lot of magic under the hood, and that has a serious impact on speed. In an age of ever increasing amounts of data, it is about time that we – not our computers – rethink the way we think about data. The times when all we had was a small table with a few observations are coming to an end. Even the traditional datasets for sociology – panel data or cross-sectional surveys – are growing and growing, since it becomes easier and cheaper to ask more people about their opinions.

It turns out that the biggest problem we have with data is neither the amount of data nor the speed of our computers. No, it is the *representation* of our data that we choose.

Once we’re in the gigabyte realm of dataset size, this mental model we have curated for centuries begins to break down. And sometimes, a few hundred megabyte suffice to turn an exciting analysis into an excruciating nightmare of waiting until the computer is finally done crunching the numbers.

All it takes to quit the nightmare is to rethink our mental models, and start to think with our data instead of forcing it into the shape of a nail. And you don’t even have to learn that many new mental models. A few heuristic rules suffice for the most part:

- Never use data tables unless you have a small amount of data and just want to do a quick initial analysis of it
- If you frequently have to look up certain values or just check whether some value is contained in a list, use a lookup table or a hash map (dictionaries in Python, hash map in R)
- If you actually need to work with a long list of data and, e.g., transform it, use a regular list (vector in R) instead
- If that list consists only of numbers, use a
`numpy`

vector instead (only applies to Python where you have both`numpy`

as well as regular lists) - If you have too much data for your system memory (as a rule of thumb: anything above 100 Megabyte generally qualifies), try to see if you can load it in row by row (tuples in Python) and utilize data streaming

Anything beyond these few rules counts as actual code optimization and requires deep knowledge of the programming language and its quirks to perform, which is much harder to achieve. In general just remember: Depending on what representation you choose for your data, your analysis will be either an enjoyable ride through the depths of your data set, or a hellish nightmare of crashed computers and waking nights.

Always remember: There are many data structures available, and if you think with your data and choose an appropriate structure, this will make your life a lot less hellish.

# It’s called *Computational* Social Science

Let me end today’s article with a call to arms. As social science transitions from an era of tabular data, we have to put much more emphasis on the “computational” part of computational social science.

In the article “Analytical Sociology and Computational Social Science”, published in the very first issue of the Journal of Computational Social Science, Marc Keuschnigg, Niklas Lovsjö, and Peter Hedström have explained:

Powerful computers in combination with the digitalization of the social world allow for rigorous empirical analyses of large and complex social systems. These new tools and data sources make it possible to address the traditional core questions of the discipline in an equally rigorous fashion as survey-based researchers were able to answer questions about the behavior of independent individuals. (Keuschnigg, Lovsjö, and Hedström 2018, 7)

This is absolutely correct as I have affirmed in another article. *However*, this perspective lacks one crucial point: Not just do we need the right *tools* for the job. If our mental models of the data are not updated accordingly, we cannot utilize those tools to their fullest extent. And that will have real consequences.

Remember the story of the student who had to wait for a full day until their analysis was done? This can be a deal breaker. Even if you don’t have any strict deadline approaching fast, having to wait a full day until you get possibly unusable first results can be a strong motivational killer.

Even though we are committed to rigorous analysis, we are all but human. And if some part of our analysis starts to feel more like a burden, we will subconsciously be driven towards searching for easy ways out. And we may just search for simpler analyses that work better with the one mental model we have, rather than overthink deeply-held convictions of how our data is “supposed” to look like.

We do not just have to adapt new tools and data sources, we also have to understand how those tools and data work. And for that, I fear, there is no way around putting a little less focus on the “social science” part, and more focus on the “computational” part. Because only when we know how computers see our data can we benefit from the new available tools and perform research without the feeling of powerlessness that arises when we watch a progress bar fill up for the second night in a row.

# References

- Keuschnigg, Marc, Niclas Lovsjö, and Peter Hedström. 2018. ‘Analytical Sociology and Computational Social Science’.
*Journal of Computational Social Science*1 (1): 3–14. https://doi.org/10.1007/s42001-017-0006-5.