A Rant | Hendrik Erz

Abstract: R sucks, and in this – very opiniated – article, I collect a selection of the various issues I have with this language. Do not get me wrong: R is great for statistics, but for everything else it really, really sucks.


R sucks. And I will stop pretending that it’s due to my lack of knowledge about it. R is not a programming language. R is a specialized statistical analysis language akin to Stata. Nothing more, nothing less. Do not get me wrong: R is very capable for the tasks it is intended to solve. But at this point, many people are locked inside the walled garden of Base R, the tidyverse, and CRAN. I get the feeling that, before using a language whose ecosystem may already provide the necessary libraries for their task, they will go and produce yet another library for their problem, losing a lot of time in the process. R is deeply engraved in the canon of the social sciences, and this is perfectly fine. But as a field, we have to stop pretending that R can do everything, — because it can’t.

R is a hammer, but we have more things to put into walls than nails. Sometimes we just need the power drill. And R is not a power drill. Even though, under the hood, it is effectively mostly C++. Want to analyze text? Run it through Python first, produce a CSV, and load that into R. Want to write a paper/book/essay and include figures? Use LaTeX, Markdown, or literally Word (!)1 instead of RStudio. Need to digest more data than you have system memory? Use a different language (Julia, Python, or, I don’t know, Fortran) to preprocess it instead of hoping that R will not crash (because it will).

The timing of this Rant is not random: In the past two months I have written few lines of Python code, and instead focused all my mental energy on learning the intricacies of R. And now that I know them, I can definitely say: R sucks for many things that we social scientists hope to use it for.

This is a rant, not a deep dive into what is all wrong with R, so please read it with a grain of salt. I do not attack anyone particularly; I exclusively focus on the language. Yes, I am frustrated, but I know that it is not due to my lack of skills. And I’m not alone in this: Many of our master students have expressed the wish to learn more Python instead of treading the R-mill over and over because they have the feeling it will empower them to do more. And I agree.

So without any further ado, let’s go in. Buckle up!

What R is Good At

Let me start with outlining what R is good at, because of course R is not Satan itself. It is evil and frustrating if you try to use it for every task imaginable. But if we – just for a moment – focus on what R actually is – a statistical analysis language –, then it is actually quite good.

Working with already existing data frames

R is excellent when it comes to working with data frames. Since those are first-class citizens in R, working with data frames is quite enjoyable. Generating new variables, summarizing, and grouping data works pleasant, and throwing all of this into a wide variety of models is fast and efficient. You can quickly generate labels, recode variables, and prepare any kind of data for analysis. Also, joining data is pleasant, too. We frequently have multiple data frames that we have to glue together, and R makes this kind of work much more pleasant than Python.

The requirement for all of this, however, is that the data is already somewhat cleaned. You can absolutely do some additional data cleaning in R, but what I figured in the past weeks is that R cannot actually create data frames from scratch. If you have some actual raw data, getting a proper data frame out of it will become frustrating fairly quickly. It turns out that, just because R can read arbitrary text files, how it works with them is an entirely different story. My entire work started off with piles of raw data. There was no CSV in there, no cleaned data source, I had to produce everything from scratch. And it turns out that, working with the amounts of data that I had, would’ve been impossible had I exclusively used R. I will come back to this.

Working with small-scale data

R is very versatile. Because it tries to pose as if it was a proper programming language, you can do quite some stuff with it, and with small scale data it is also quite possible to do many things – including text analysis. As long as your data is reasonably sized, R can do almost anything.

However, after a certain size, you will notice that optimizing R code is sometimes just not possible, so if you have larger data sets (read: if you are analyzing any data that has been produced after 2000) you will face a lot of stopgaps that are not due to your inability, but due to limitations in R. I will come back to this.

Running statistical models

This is hands down the most enjoyable fact of R. Running a model is ergonomic, and indeed, the comparison I made between R and Stata above was not accidental: To me, working with R feels like working with Stata back in the day. Indeed, I believe that to make R enjoyable, you must realize that it is basically just an Open Source version of Stata, and not an alternative to Python.

Compare how running a linear regression works in Python vis-à-vis R, for example:

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.read_csv("data.csv")
model = LinearRegression()
model.fit(df['predictor'], df['outcome'])
print(model.coef_) # View the coefficient
print(model.intercept_) # View the intercept
# TODO: Do a ton of work producing a regression table, plotting, etc.

And now the same in R:

df = read.csv("data.csv")
model = lm(outcome ~ predictor, df)
summary(model) # Prints out a regression table

R has been made with statistics in mind, and I think this shows you exactly what this leads to: Running any type of model in R is as straight forward as it can possibly be, while in Python you have to do a ton of manual work that R already does for you under the hood. Do you know how to add additional predictors in a Python linear regression? It’s difficult, but in R it's as easy as adding + other_predictor to the formula.

This is literally the only thing where I can hands-down say that no “real” programming language like Python, PHP, Rust, C or others comes even close to the ergonomics of running models. Typing lm(dependent_variable ~ IVone + IVtwo + interaction * IVthree) is something you can’t reproduce with any other language. And that’s why you should use R when you want to run statistical models. It just works.

What R is Bad At

As you can see, R is not all bad. But it is very bad for a lot of other things. And now it is time to face reality. One note beforehand, however: When I now punch R, I do not ignore that all other programming languages also suck in their own right. But this article is on R, and not other languages, so I’m going to focus my efforts here.

R Cannot Produce flexible Descriptives and Regression Tables

The biggest irritation that I stumbled upon is the fact that R cannot reliably produce robust (HTML, Markdown, or LaTeX) tables for descriptives or regression coefficients. Yes, there are packages like stargazer out there. But none of them work reliably. Whenever I have to produce such a table, I more often than not find myself having to manually clean out some errors that these tables have. This is fine if you do that once, but this is not how science works. We run a regression dozens of times, and it is not feasible having to re-apply the same patches to the erroneous code just to get a proper LaTeX export.

I have had a long discussion with a colleague once who is the R-wizard at our institute, and he confirmed that (a) there are (too) many packages for producing such tables out there; (b) none of them work reliably across cases; and (c) it is sometimes impossible to find them because R package authors are more concerned with finding package names including a pun than being descriptive about what their package tries to do. I will come back to this.

R is not Scalable

The next drawback of R is that it is not scalable. The hundreds of textbooks using R to demonstrate techniques work fine because they all work with dummy data. But we are in an age of big data, and when the amount of data becomes large, R becomes slower and slower until it inevitably crashes. The reason is that R’s memory management is not really great. And, to add insult to injury, R makes it decisively difficult to steer against this.

Most programming languages I came across over the years also try to be “smart” when it comes to memory allocation and, like in R, it works reasonably well. But there are always instances where you want to do something that the compiler/interpreter can’t really figure out, and it will work less efficient than it could. The difference between R and other languages is that in other languages it is mostly trivial to re-write parts of the code to account for that and increase the speed of the code. R makes this very difficult, and sometimes impossible.

This is not least due to the fact that data types in R have blurry demarcations. A data.frame is also a matrix, and a matrix is really just a two-dimensional vector. Moving between the various data types is often an implicit action because a function may expect a matrix, but happily accepts a data.frame and silently converts it without you knowing. This is great if you don’t want to think about what data types you have. But if you want to ensure that certain optimizations happen (because – I don’t know – you have 12 GB of data), this “convenience” becomes a big liability.

Also, there are many tasks that require multi-processing because they’re CPU-bound. And with parallel, R even features a base package for that! Great! … is what I would say if it would actually work as expected. You know, multi-processing works by spawning a series of threads that run on different cores of your CPU. However, due to security considerations, each thread must have its own separate memory block. This means that all data that a thread needs to have access to must be copied for each thread.

Now, tell me what happens if you have about 1 GB of data in memory and spool up ten threads to do some expensive calculations. Right: The entire environment will be copied ten times so that each thread has access to it. And you know what happens then in many cases? Exactly, R crashes because it flooded your system memory. Other implementations are much more transparent to this. Python, for example, tells you that it will basically run the same script ten times in different threads, so you know that each thread should just explicitly read in the data it needs instead of hoping that some environment variables will be available. Yes, this involves understanding mutexes and inter-process-communication, but it’s a little overhead that will save you days of waiting for your computer to finish.

This is something that R lacks: It doesn’t force you to explicitly do things which make it easy for you to see where there might be an issue. It tries to be convenient, but turns out to be a liability due to this. R is simply not scalable because of this.

R cannot handle text

This has been a long-time pet peeve of myself. Basically all programming languages have a somewhat convenient way of working with text … except R.

Take just a few examples: Printing out interpolated text with two variables in various languages:

  • JavaScript: Some text ${var1} and other text: ${var2}
  • PHP: "Some text $var1 and other text $var2"
  • Python: f"Some text {var1} and other text {var2}"
  • Rust: println!("Some text {} and other text {}", var1, var2)
  • R: print(paste0("Some text ", var1, " and other text ", var2))

This doesn’t just hold true when it comes to printing out debug messages, this also applies to working with text more generally. Since R is biased towards numerical operations, text handling is always somewhat cumbersome. Of course, any reasonable R package will implement actual text analysis work in C++ instead of base R, but a lot of what we have to do runs in base R.

Also, writing such C++ libraries effectively means that the authors have to write a C++ package first, and the corresponding R bindings second. And this intermingling of two languages provides ample soil for bugs. I will come back to this. Using R for text analysis or anything that involves text is cumbersome, and you should consider using another language for that.

“Everything is an Object”

Now, this is an accusation that many other languages also have to listen to; among the most popular ones Java and JavaScript, but Python is also guilty here. In many languages, everything is an object, and running "Some text".lower() is a common occurrence.

Normally, this isn’t too much of a problem, but for R, it becomes very difficult as there is no good way to distinguish different objects from each other. As I mentioned above, the demarcations between data.frame, matrix, vector, and list are very blurry and there is a lot of implicit conversion happening under the hood. But in addition, do you know what data type a vector has? Exactly: It depends on the data type of the values it stores! Effectively, this means that a vector “hides” itself from your own code. You yourself may know which variable is a vector and which is a primitive value, but your code has little way of knowing. Here’s an example:

> class(2.3)
[1] "numeric"
> class(c(2.3, 1.2))
[1] "numeric"
> typeof(2.3)
[1] "double"
> typeof(c(2.3, 1.2))
[1] "double"

Have fun writing code that distinguishes between vectors and primitives.

Let me rephrase the section’s heading: The problem is less that everything is an Object in R, it is more that R is extremely sloppy with its data types. Do you have a list but need a vector? No problem: as.vector(list). Or do you need a matrix? Just do it. And if you don’t even care, R will care for you. This makes it extremely difficult to track down problems with your variables, because more often than not it does matter whether something is a vector or a list.

Lastly, accessing an element inside a list requires you to use double brackets (my_list[[2]]), but the same syntax also works for vectors without being a syntax error, even though vectors only require single brackets for indexing. This also makes you yourself very sloppy. Do you remember which of your elements in R are lists, and which are vectors? Since both accept pretty much the same syntax, it’s hard to keep track.

The R Garbage Collector is Broken

Something that I found out more recently is that R’s Garbage Collector is broken. Quick refresher: A garbage collector’s job is to look through all the memory that R has allocated and free any that it doesn’t use anymore. Basically, it tells the operating system “Do you see this gigabyte of memory? I don’t need that anymore”. This is especially helpful when you are dealing with large objects, since you can forcefully free up memory to make space for the next large object and have more control over your code’s memory consumption.

However, as I had to find out the hard way, calling gc() doesn’t necessarily free up all unused memory. I don’t know where the bug exactly originates, and I have more important things to do than track down obscure bugs, but it could be anything from just faulty memory size estimation to an issue with the reference counter or just a faulty implementation of gc. I don’t know. The only thing I do know is that this has exhausted my memory more than once in the past 48 hours, and I do not like that.2

R doesn’t know pointers

This problem is even exacerbated by something that I deem a very big hindrance to a better efficiency of the language: R doesn’t know pointers. Specifically, you cannot pass-by-reference in R. This means when you have some big matrix x and you want to extract some chunk of it, calling y <- x[:10000], will copy that entire chunk. Which also means: If you need to do this several hundred times in a row, you should never forget to consistently call rm() and gc() on the intermediary variables to ensure that you aggressively free up memory to not make R crash once again – if those functions don’t bug out on you.

This is much better in Python. Even though I don’t like writing statistical models in there, matrix operations are much more efficient thanks to numpy and that Python usually passes-by-reference. Specifically, there are two concepts in numpy that make this kind of operation efficient: views and broadcasting. When you create a “view”, you effectively create a variable that only holds memory addresses to some chunks of data. Writing subset = matrix[10:100, :] will store in subset only the memory addresses of all affected elements, and when you perform an operation on subset (essentially “broadcasting” a mathematical operation over the entire view), it will mutate the data in matrix.

“This is bad!”, you may now say – and two years ago I would’ve agreed to you. But how often do you actually need immutability when you deal with a single, large matrix? The biggest part of our business of getting data into shape requires mutating the same data over and over. And having two to three copies of (almost) the same matrix floating around in memory is giving you a bad time.

“But what if I made a mistake and accidentally deleted several thousand rows of data?!” Well, go ask an SQL programmer who forgot to start a transaction before calling DROP TABLE. This is a general problem and not entirely avoidable. Immutability is good because it doesn’t destroy data, but there is a balance to be struck between that and having your computer crash when dealing with Gigabytes of data. Comparing the benefits and drawbacks, data loss is 99% avoidable with proper pipeline-style programming. Just split up the data preparation into multiple steps, all of which generate an intermediary CSV file from which you can always re-start again. Then the only thing that you lose is at most a few seconds that it takes to reload that file. If you’re still afraid: I am a heavy Python user and I almost never clone an object. I have had almost zero data loss in the past three years, and possibly even less than in the few weeks that I have actively used R since.

R Scopes are a Joke

Another issue I have with R is that its scoping is a joke. All programming languages have some concept of a scope, usually a global scope, a module scope, a block scope, and/or a local scope. For example, if you define a variable in a JavaScript block, that variable is not visible outside that block. Rust is a programming language that basically centers its whole existence of managing this scope problem.

And R? Well. First, it doesn’t have a global scope, it has environments. The “global” scope is just the default environment (the thing you see in RStudio to the top-right in the default layout). Any variable you declare anywhere in your code will just be slapped onto that. A few days ago, I wanted to better compartmentalize my data frame build module by using environments, in the hopes that variables declared in one script file won’t bleed into other code. And guess what: Exactly, it did not work.

A variable defined somewhere deep inside a function will afterward pop up in your global environment and there is little you can do against this. It turns out, R is really littering its data everywhere.

R’s Syntax is Inconsistent

Another issue is that R’s syntax is inconsistent. It is common for programming languages to give you multiple ways to achieve the same outcome, and that is perfectly fine – after all, a programming language is intended to give you control over your computer. But in R, this also happens at places where it really shouldn’t.

I think this is something where an actual argument could be had for whether that is bad or not. But remember the list/vector argument from above: Lists require double brackets for accessing its contents whereas vectors only require single brackets. But vectors also accept double brackets and pretend that it’s just single brackets. This reduces the mental load that you bear, but this also nudges you to become sloppy. This feels especially egregious for me as R is very picky about commas in functions. In many languages, you can list arrays and end with a comma without a value, and they won’t complain (JavaScript, PHP, and Python all allow that). But R will give you an obscure error if you do this.

And it feels very inconsistent to me that it would be so picky about commas, but happily lets you intermix various data types.

RStudio Sucks

Another scorn in my eyes is RStudio. I hate that with a passion. The two main issues I have with it are (a) that it’s single-window and doesn’t support split views, and (b) its editor – ACE. That thing is notoriously bad when it comes to keyboard handling. I have similar gripes with WinterCMS that also uses ACE under the hood. But at least, WinterCMS is currently trying to move to a more modern code editor.

Here, I mainly have an issue with how RStudio treats code insertion. My list of complaints includes, among others:

  • You cannot move lines up and down using Alt+ArrowKey, or copy them with the Shift key, which is a time saver for many situations. All other code editors I’ve used allow that.
  • You cannot properly use the Cmd-key to delete an entire line on macOS.
  • Selecting text and wrapping it with quotes or brackets removes the selection and moves the cursor to after the wrapped quote/bracket, making it extremely cumbersome to efficiently move back to the start of your original selection.
  • Generally, moving through text is a hassle, because almost any character is considered a word boundary, so holding the Alt key (macOS) or Ctrl key (Windows/Linux) to move word-by-word requires many more presses of the arrow keys to end up in the wanted situation than with VS Code (which uses the Monaco editor) or any app that implements CodeMirror.

Then, the design is stuck in the 1990s. You can’t open more than one file at the same time to copy/paste/compare two files; you cannot customize the layout of the app aside from some very basic changes; and opening more than one window is more of a utopian dream than reality.

However, it’s also somewhat difficult to switch to VS Code, simply because its R support is less than ideal, and the only thing I actually like about RStudio – the ability to quickly view a data frame – works subpar. Thus, despite all these issues, it is still quicker to work with RStudio than with VS Code. And that sucks.

The R Community Has no Standards

By “standards” I don’t mean that they have no manners, but rather that they don’t have any conventions. It is really apparent when you compare various packages, that most people who write R libraries do not have a software background. Let me just give you three concrete examples.

Verbosity Where no Verbosity is Needed

Recently, I had to implement some code that calculates a bunch of Kullback-Leibler divergences between various distributions. After some googling I found that apparently the best package for that was philentropy. It implements a bunch of divergence and distance metrics, one of which was KL-divergence. It offers one basic function distance() that allows you to specify which metric to use and gives you a few options. Additionally, it offers “shortcut” functions that call distance with reasonable defaults, among them KL.

So I went, installed the package, and implemented calls to KL across matrices of several thousands of distributions. Then, I ran the code and … it turns out that KL will always print a message saying that it just calculated a KL-divergence across two distributions:

> KL(rbind(c(0.8, 0.15, 0.05), c(0.5, 0.2, 0.3)))
Metric: 'kullback-leibler' using unit: 'log2'; comparing: 2 vectors.
kullback-leibler 
       0.3509538 

Why does it do that? If I call the function KL, I know that I am calculating a Kullback-Leibler divergence. Why does it need to tell me that? You can imagine what happens to your console when this happens thousands of times a second. The default behavior not just across most R packages but across all programming languages is that library functions don’t log anything unless you tell it to (e.g., for debugging purposes). So I went through the documentation to see how I could turn that off, and indeed, there is a parameter mute.message that defaults to FALSE. You know what the best is? This parameter is not exposed in any of the derived functions. In other words, you can suppress this message when you call distance directly, but not when you call any of the derived functions, such as KL.

So I was convinced this was a bug and went to their GitHub repository. And indeed, I found a discussion on this issue. It was closed and the dead-serious answer by the maintainer was “just use suppress.messages() and the message will go away”. And, sure thing:

> suppress.messages(KL(rbind(c(0.8, 0.15, 0.05), c(0.5, 0.2, 0.3))))
kullback-leibler 
       0.3509538 

Cool. Problem solved, right? Well, no. Because (a) this makes the code chunky and less readable; (b) this still should not be standard behavior; and (c) this exacerbates the memory and efficiency issues that R already has!

Consider the following: Printing a message to console and then suppressing it is similar to running echo "This message will be suppressed" > /dev/null in the terminal. But you know what this means? R will allocate some memory, interpolate the variables into the message string, then send this to the standard output, only for the message to be captured and discarded. All the CPU-cycles it took to generate that message and shoot it off are gone for good. This may not be an issue when you calculate a single KL-divergence once in a while, but if you have to do it a million times in repetition, this loss of efficiency will accumulate.

Implicit Type Conversions Will Screw You

At this point, I originally included a story on the R package seededlda and a lesson on implicit type conversions. Specifically, a master student I co-supervise has reported issues with this package, namely its parameter batch_size.

You may already know that running an LDA is frequently done in batches since fitting an entire corpus of text at once is likely going to exhaust your system memory. Therefore, all text analysis methods offer you the ability to determine the batch size, i.e., how many documents are processed at any time. So does seededlda.

“But what’s the problem?” you may now ask. Well, take a look at the documentation of the batch_size parameter:

\item{batch_size}{split the corpus into the smaller batches (specified in
proportion) for distributed computing; it is disabled when a batch include
all the documents \code{batch_size = 1.0}. See details.}

This caught me by surprise. Specifically, I have never heard of an LDA being split up across multiple threads since that seems very difficult to do, both because LDA requies a lot of sequential processing, and two because the usual way of dealing with this is to split up a corpus into batches of documents to be processed one after another. It turns out that, while fitting an entire corpus to an LDA model can take a very long time, splitting the corpus into batches of x documents heavily increases the speed of LDA which I have confirmed on the Python side of things (which, arguably, runs the same code because the LDA models there all are implemented in C++ as well).

This library is different, since they do not expect a batch size of x documents, but rather a proportion of the corpus, e.g., 0.01 (a.k.a. 1%). When following this parameter down into the underlying C++-functions, I realized that, what had been a double in R code, suddenly became an integer in C++. And if R would have actually done an implicit type conversion from double to int, it would have turned any value less than 1.0 into a 0, since as.integer(0.1) is 0.

However, a reader of this blog thankfully pointed me to the line of code in the R code of the package that actually calculates the batch size parameter that will be passed to the underlying C++-function: batch <- ceiling(ndoc(x) * batch_size).

So the parameter does indeed what it says on the tin: Split up the corpus in proportional packages using whatever amount of threads you assigned to the library. So my original interpretation that there is implicit type conversion was wrong. The point of the heading – that implicit type conversion will screw you – still holds in many cases, but it turns out that my example was not the right one to make this point.

But while the original type conversion-argument no longer holds, this is still a good example for my "standards" argument: It is very unusual to define batch size parameters as "proportions of something". And this correctly made me suspicious. The reason they do this is because of a 2009 paper that they did not reference in the documentation, but rather somewhere in the source code. I didn't spot this reference until I revisited the code again, but basically what it says is that you can parallelize LDA by approximating the (sequential) Gibbs sampling process. I do not want to go into technical details too much because this is a post about R and you can read this up by yourself if you would like to, but I still would like to rest my case on the notion that defining a batch size parameter as a proportion with no real explanation is not ideal.

The main reason why people rarely seem to suspect issues in R’s packages — and also why I immediately suspected an issue in a package — I believe, is that CRAN (the R repository) gives the impression of being of high-quality overall when there are really two sides to quality, and CRAN can only check for one.

The one side of the quality-coin is bug-freeness. I often hear colleagues complain that one of their R libraries is in danger of being removed from CRAN because some automated unit tests failed for their code. CRAN really seems to be very strict on this side – which is good!

But what these automated tests don’t capture is code style, conventions, and the general usability of libraries. This other side of the quality-coin is just as important as the first. Of course, having no bugs is important for widely distributed packages, but being easy to use is just as much, since improper handling can also induce bugs. There is a saying about unit tests and test-driven development (TDD) in that all of this is naught when your unit tests do not capture user intentions.

A total of three people did not understand the package's parameter "batch size", and it took a fourth person — a reader of this blog and heavy user of this package — to finally explain what is happening. And this is at least two people too many that it should take to understand a simple parameter.

Painting Code With magrittR

The last problem is more meta and concerns the naming practices. R users like to give their packages fancy names as opposed to names that convey what they’re doing. Sometimes this is okay, but if every small library developer with 10 users tries this, it is getting ridiculous.3 There are so many different packages that it’s impossible to remember those nondescript names, and instead you always have to search the repository to find the correct package for your use case. Except you can’t because there’s no search function in CRAN!

You can even play bullshit-bingo with library names. Let’s try it out:

Which package implements the %>% operator?

  1. pipeR
  2. operatRs
  3. magrittR
  4. coelho

Right, it’s magrittR. What René Magritte has to do with any of this? Je ne sais pas ¯\_(ツ)_/¯

Next: Which package implements descriptives and regression tables for R?

  1. tableR
  2. stargazer
  3. tidyR
  4. reporter

Again correct: It’s stargazer.

Final question: Which package implements coefficient plots (sometimes called “dot-whisker plots”)?

  1. dotwhisker
  2. telescopR
  3. fruitbasket
  4. docore

That’s a trick question: it’s dotwhisker. But you get my point.

Reticulate has no reason to exist

A final problem I have with R is the existence of reticulate. reticulate’s job is pretty easy to summarize: Offer a way to call Python code from within R.

Now, I understand that some functions are obviously only implemented in Python and not in R. Then, there are two options. The reasonable idea would be to just learn Python, write a few lines of code to accomplish what you want, and load the data back into R to continue where you left off. Or you can be completely unreasonable and develop a package to interface with Python.

Now, reticulate is widespread, so by writing these lines I realize that I am putting myself in the line of fire, but hear me out. I do get that it is natural to assume that learning a new language takes longer than setting up a package that promises to do all the other language stuff for you. For the longest time, I was fine with this: If it is efficient to set up some package instead of setting up an entire Python environment, there are benefits to never leaving R, especially when it comes to your mental models of the data structures.

However. Recently, it has come to my attention that reticulate requires you to do the only thing that is actually genuinely annoying with Python: Setting up an environment. This means: The only thing it saves you from doing is to actually run Python code. You still have to set up an environment yourself. And at this point, I believe that the reason for reticulate to exist has ceased to be valid.

Exchanging data between R and Python is extremely simple and can be done in four lines of code:

# Save down DataFrame from R
write.table(df, "df.csv", row.names = FALSE, sep = ",")
# ...
# Load modified DataFrame into R
df <- read.csv("df_from_python.csv")

And in Python:

# Read in DataFrame into Python
import pandas as pd
df = pd.read_csv("df.csv")
# ...
# Save down modified DataFrame from Python
df.write_csv("df_from_python.csv")

That’s it. It doesn’t cost any time, and then you have your data from R available in Python, can do everything you have to do with Python code, and export it again. As I mentioned in the beginning, R is really bad at working with text data or creating data from scratch, and after an initial learning phase of maybe a week, it will be actually faster to just switch back and forth between Python and R to achieve whatever you need with no layer in between. Because, remember: Even reticulate requires you to set up a Python environment, and when you have already done so, just running Python on your computer is a no-brainer.

Where to Go From Here

So what do we make of this mess?

Accept That Not All Our Problems Can be Solved by R

I can’t help myself but get the feeling that scientists are afraid of other programming languages or exploring the space of potential that opens up once you leave the sandbox of R-code. In my years in academia I have seen people use a TensorFlow implementation in R, which just doesn’t work with R’s syntax and produces completely unmaintainable code. I have seen people spend hours trying to get reticulate to work just so that they don’t have to write a few lines of Python code. I have seen people losing weeks of valuable research time trying to get a Latent Dirichlet Allocation model to run in R because it is just so insanely inefficient.

I have only two potential explanations for why this happens. Either, people believe that the fact that their code isn’t working (as it should) is exclusively due to their own inability, rather than a problem with the language. Or people believe that learning new programming languages would take even more time than they are losing trying to get their code to work. I don’t know, but really want to understand why.

Use the Proper Tools for the Problem

There are three ingredients to any successful programming language: its syntax, its ecosystem, and its community. What you will always see in front of your eyes is the syntax: Once you know a language, you become faster and faster at writing working code in it. Understanding the syntax and its intricacies is a large part of being an efficient coder.

But you also require a tight community that evolves around a language. R is so widespread in the social sciences because the R community consists mostly of social scientists. If you have a problem with R, it’s easy to just go to your colleague across the office, and you will likely get an informed answer. If you use Python, however, you may have to knock on the door of the computer science department next door.

Lastly, you require a thriving ecosystem that works for your use-case. R’s ecosystem features amazing libraries for statistics and network analysis. But the text analysis part of R’s ecosystem is just subpar. All packages that I have come across have many issues and just don’t work as expected. But when you go to Python, you will see that its ecosystem is the gold standard for text analysis. And at this point this is just historical coincidence. It could’ve been different, but it isn’t. So we better accept that fact. Yes, Python has a different syntax, and its community has far fewer social scientists. But in my opinion, the ecosystem outweighs the lack of community when you do text analysis.

Remember in the first section when I wrote on the beauty of writing formulas: It works so well in R because a formula is its own data type in R. R understands the syntax to write regression models. No other language does so. On the other hand, however, R has no concept of regular expression syntax, which makes writing them cumbersome and error-prone. JavaScript, on the other hand, is the language to work with regular expressions, since they are part of the actual syntax. Writing a regular expression in JavaScript is just as enjoyable as writing a formula in R. Even Python has semi-native support for regular expressions with “r-strings”. In short, a very good heuristic to determine if a language is intended for a certain use-case, look for what the syntax itself supports. R just is not made for text, and any attempt at using it for that is futile.

In short, use the proper tools for the problem you’re facing. And since we’re in an age of big data, we cannot rely exclusively on R anymore. R is good for statistics, but for all the preparatory work before we can run some model, it becomes more and more of a liability.

Learn Other Languages

The second conclusion we can draw is: For god’s sake, learn another language! R is fine, and I will never tell you to forget it. As I have written, there are some use-cases for which no other language is as suited as R. But there are a lot of use-cases where social scientists use R when really other languages shine.

There is no reason to be afraid. Once you know one programming language, you already know large parts of all other languages. Programming languages are formal languages that do not leave any room for ambiguity. If we read some text that is missing a verb, our brains can insert that to make the sentence legible. In a programming language, that’s a plain syntax error and your code won’t run. This stiffness also means that the syntax of programming languages isn’t as diverse as natural language. Learning another programming language takes a few days, whereas we all know how difficult it is to learn another natural language.

I invite you to try out things. There are still languages that I personally would like to learn more, like Bash for automating a lot of my manual code-calling. And every time I encounter a problem for which Python just isn’t made, I try to learn more of a language that is made for it.

Remember Who You Are

In conclusion, in order to draw the most out of R and do your job efficiently, you should abandon R for many of the tasks you are probably using it for. Because in the end, your job is to pose research questions, collect data that can help you answer those, theorize the data-generating process, run an analysis and write down your results. That is your job. Not trying to get R to work.

Social scientists are often spending days or even weeks trying to fix some problems in R code that stops them from answering research questions. However, since the social sciences aren’t about programming per se, there is no culture of searching for better solutions to our problems. If you have no measure for how much time you would save by re-implementing your code in a different language, then it’s no wonder that “spend two weeks fixing a single issue” is perceived to be the fastest route.

In other words, what I am saying is that by learning one or even two more programming languages, you will net out spending less time trying to get code to work, and more time working on the actual core of your job. You want to do research, not bend a statistical language to your will. Trying to get a regular expression to work in R may take hours, but learning how to run regular expressions in Python and then figuring out a proper expression will take less time in total. And the more often you do that, the less time it takes.

R is a fantastical language for the purposes it is intended for. But for everything else, R sucks.

Update Apr. 10: A reader wrote me an email and said that the section on implicit type conversion was based on a false interpretation of the package's parameter and pointed me to a few sources that confirmed that my initial reading of it was wrong. I have adapted the corresponding section to now correctly describe what is happening.


1 If you are a recurring reader of this blog, you know how bad things must be if I recommend Word.
2 “But maybe your data is just too large to handle for your computer,” you may now shout. But don’t worry, in Python I can load the exact same data and add a few additional dictionaries to it without the interpreter crashing. Don’t worry, I checked that: It’s really R, and not my computer.
3 To be clear: This is also common in the Python environment, where authors frequently try to hide a “py” inside their package name. But there is one important difference: In the Python community, if you can’t think of a name that includes both “py” and a description of what the package does, people usually choose the descriptive name instead of the pun.

Suggested Citation

Erz, Hendrik (2024). “A Rant”. hendrik-erz.de, 5 Apr 2024, https://www.hendrik-erz.de/post/a-rant.

Did you enjoy this article? Leave a tip on Ko-Fi!

← Return to the post list