New Open Access Paper on Text as Data: Variable Extraction via LSTM networks | Hendrik Erz

Abstract: In today's article I want to share with you a quick TL;DR on a novel paper on text as data that I recently published. Continue reading for some fascinating insights into computational text analysis!

Today, I am more than delighted to share some good news with you all: I have just published a paper on a novel text analysis method in Open Access!

In this article, I want to give you a very quick rundown of what the paper is about, and why it may be interesting to you – even if you are not part of the “text as data” crowd. The paper can be downloaded from the DGS website here. While it is unfortunately in German, I think that you may want to cite it nevertheless. If you have any questions or are unsure whether it suits your research, please drop me a mail – I’m happy to help you out!

First the TL;DR.

In the paper “Text als Daten: Extraktion von Variablen mittels LSTM-Netzwerken” (Text as Data: Variable Extraction via LSTM networks) that has recently been published as part of the 2022 conference proceedings of the German Association of Sociology (DGS), my colleague and co-author Anastasia Menshikova and I demonstrate a novel method to extract latent information from text. Specifically, we implement an LSTM-type neural network to extract gender information of subjects and objects in English sentences. Verifying that our training data was roughly equally divided into male- and female-led sentences, we can show that articles from the New York Times have a surprisingly well-balanced mixture of male and female subjects. Similarly, we can show that in U.S. Congressional speeches there are not only many more male subjects involved, but the Congressional speeches are also heavily about non-human subjects (such as bills, regulations, or companies, all of which do not have an assigned gender).

Furthermore, we can show that the accuracy of the network surpasses 75 % even with only a small amount of training data and that it reasonably well generalizes to various types of text.

Now, while the empirical research question is interesting in itself, the real juice is in the method that we implemented here. Let me guide you through the methodological implications here.

Our Method Makes LLMs Available Even on Old Hardware

The first highlight I would like to bring to your attention is that our method shows that you don’t need transformer models to extract variables from text.

Now, of course that depends a bit on the context. There are research questions that necessitate the usage of transformer models, and in those cases you may be worse off with our smaller network. Additionally, transformer models are currently considered state-of-the-art, so if you want or need to squeeze the last inch of performance out of your methods, transformers are still the go-to solution.

However, LSTM networks – the smaller sister of transformers – can perform reasonably well, especially for social scientific research. As long as extremely high F1-scores aren’t explicitly warranted, using an LSTM network can enable you to run the code on older hardware just fine.

As a baseline: I’ve run the models multiple times on an older Intel-based MacBook Pro with a 2.3GHz dual-core Intel Core i5 processor. Each model run has taken at most an hour, which is absolutely reasonable for training your own neural network.

So this method is especially suitable if your institution does not have access to some HPC cluster (High Performance Computing). While compute access does increase, there are still many researchers without access to such resources. Therefore, this much smaller network gives you access to state-of-the-art text analysis methods, regardless of where you are.

Our Method Helps Avoid Methodological and Theoretical Issues in the Model Selection Process

One benefit that I didn’t even realize our method has until earlier this year is that this approach – unlike transformer models – makes it easier for you to avoid potentially grave ethical, methodological, and theoretical issues. Let me elaborate.

We already know that the pre-training of the very large transformer models that are currently considered state-of-the-art often involves unethical behavior. For example, many of the large pre-training datasets include copyrighted material, infringe on personal rights, or utilize low-paid wage labor. An LSTM requires you to train it fully yourself. While this does add some overhead to your research, it ensures that – unless you want to use ethically questionable data – you are less likely to run into ethical issues.

But even aside from ethical concerns, there are two more issues that pre-trained language models may come with that I believe our method can circumvent reasonably well. While the specifics are part of another paper I am currently working on, the short version is that pre-trained language models come with their own set of theoretical and methodological assumptions that may run counter to your own assumptions. If that happens, there are no guarantees that the results you obtain are even usable.

Our LSTM-based method enables you to make use of feature engineering. In other words, one big weakness of LSTMs is actually a benefit when looking at it from a methodological or theoretical angle. The weakness is that you need to think about features in the language (such as grammar) or other covariates to improve the performance. This adds some mental overhead when constructing your model, but it has a benefit: if your theory predicts that certain covariates or linguistic features help prediction, with LSTM networks, you can actually test that hypothesis. This is much more difficult to achieve with transformer models which are more precise but also less customizable.

To put this another way, transformers are very precise, but they don’t give you much customizability. Thus, if your specific research question unluckily runs counter to assumptions the creators of the model have had, you can’t use the transformer, because it violates your methodological or theoretical assumptions. Cases where this can happen range from the age of the text in your own data vis-à-vis the transformer’s pre-training data to violations of statistical independence. As I said, specifics are part of another paper that I can hopefully share in the beginning of next year (view a poster presentation of it here).

An Aside: Grammar is Predictive of Gender Even in English

One interesting finding that didn’t manage it into the paper but which I find fascinating is that we found that grammar is predictive of the gender in English. This comes to a surprise since English doesn’t have a grammatical gender as French (la and le) or German (der and die) have. Yet, in the process of assigning the gender of subject and object the LSTM did align the grammatical dependencies of the words in a peculiar manner.

Specifically, the subject classifier, i.e., the network that was tasked with assigning the gender to subjects, sorted the various dependency of the pronouns “he” and “she” to the opposing edges of the embedding space. The object classifier, on the other hand, likewise created two large groups of embeddings, albeit it did not sort the pronouns themselves.

The reason this didn’t make it into the paper is that we didn’t have the time to double-check and verify this finding. But I wanted to share this with you, because I find it highly interesting.


I do hope that you find the paper and its implications as fascinating as we do, and I hope that these quick bits about it have shown you its value. If you have any additional question in regard to this paper, don’t hesitate to drop me a mail. I would love to hear from you!

Suggested Citation

Erz, Hendrik (2023). “New Open Access Paper on Text as Data: Variable Extraction via LSTM networks”., 11 Nov 2023,

Ko-Fi Logo
Send a Tip on Ko-Fi

Did you enjoy this article? Leave a tip on Ko-Fi!

← Return to the post list