Today, I have a riddle for you. Imagine you had a text corpus in front of you with millions of documents – way too many for any single person to read. Furthermore, let’s assume there is some signal in there that you want to extract. For example, let’s assume you want to label each document in your corpus according to whether it’s talking about cats or dogs.
Since it’s impossible to label every document without exploiting too many poorly paid student assistants, you want help; ideally from a computer because (a) you don’t have to pay it and (b) it doesn’t suffer due to the boredom of the task. Therefore, you decide to utilize “Active Learning”.
Active Learning is easy to explain. First, you draw a small sample from your corpus, say, 1,000 documents, and hand-code them into either the category “cats” or “dogs”. Second, you train some language model – for example BERT or RoBERTa – on that sample, and have it subsequently label your entire corpus. The model will probably have no idea what it’s doing, so you extract another small subset of the corpus that has now been labelled, say, an additional 200 documents, and hand-code them again. This you repeat until your model is good enough to distinguish cat-people from dog-persons. Then, you let that final model run over your corpus once again, and you have finished the task without any exploitation. Hooray!
However, there is one problem: How do you choose which documents to draw for the resampling? Since you want to minimize the amount of hand-coding, you want to know what new documents to draw in order to make your model to become better at detecting cat and dog content quickly.
What Selection Metric Do You Need?
This riddle is not fictional. It is precisely what I needed to solve recently. Recently a highly intriguing paper utilized it in order to categorize speeches by candidates in U.S. Presidential Elections into “populist” and “not populist” (Bonikowski et al. 2022), and I used that as a kind of blueprint for my own implementation. The issue with that paper is, however, that it’s not good at actually explaining how they performed their iteration of Active Learning. And it turns out that there are tons of small decisions to make along the way that can have a real impact on the results.
And one of these decisions is to decide on how to sample new documents. My initial idea was to simply select those documents where the model was especially uncertain because these documents indicate problematic cases for the model. My supervisor, however, disagreed and recommended I used those documents with the highest score for a given category, i.e., where the model was especially certain.
So I needed to find out which strategy to use. I performed an initial run of the model and had it predict my entire corpus. From these predictions, I then drew two samples: one contained those documents where the model was especially certain, indicated by high probabilities, whereas the other contained the other set of documents, indicated by a low Kullback-Leibler-Divergence from a uniform distribution.1 To me, the results looked pretty clear: The sample drawn by uncertainty had a lot more documents that were relevant to my category, whereas the sample with high probabilities contained a lot of garbage.
Case settled — I went with implementing the KL-Divergence metric to select new documents to sample. I ran the model again, looked at the new sample … and, well, let’s say it didn’t look good. The newly sampled documents were anything but relevant to the categories of interest to me. Having my supervisor’s advice in mind, I quickly exchanged the selection method with high-probability, and re-ran the model. And, lo and behold: while the newly sampled documents were still not spot-on, I could already see that they were a lot more relevant to what I was interested in than the previous one.
Why the Selection Metric Matters
So, what happened? First, a quick reminder about how language models work: They must under all circumstances classify a document. There is no way for them to say “Uhm, this document is too spicy, I’d rather not classify it at all.” That leaves us two options to go with for a corpus.
The one situation is if you have a corpus in which you want every document labelled, and every label is important to you. Taking up our example from above, let’s assume you know your corpus only has documents talking about cats or dogs, so every document in your corpus can be labelled with either of the two. Then, everything is relevant.
The other situation is if you are only interested in a few documents within your corpus. Let’s assume you only care about cat-documents, but your corpus contains documents about all kinds of animals. Then you have the situation where you still need two categories, but only one of the two matters to you. The second category, the “trash bin”, so to speak, is not of interest to you. It should now become much clearer that it’s harder to tell the model what is not relevant (quite a lot) than what is actually relevant (only a few documents).
Depending on which of the two situations you face, you need to choose an appropriate metric to select documents to resample. If you are only interested in a subset of documents, then KL-Divergence will bring you nothing; especially if the subset is much smaller than the subset of documents that are of no interest to you. In that case, you need to sample documents that have the highest scores in the category you are interested in, and ignore the trash can. Anything that doesn’t look interesting will end up there anyway. If you have more than one category, you can resample from all those categories equally. (For example, if you are interested not just in cats, but also dogs, you’d have three categories, and you can sample high-scoring documents from the cat and dogs category equally.)
But if you have a corpus that you want to divide up into equally important categories for you, then KL-Divergence is the way to go. Then, documents with high uncertainty are what you want to go for, because you want to tell your model that everything is significant and that it should make a decision.
In other words: If there are only a few documents of interest, you want to iteratively tell the model “This document that you have labelled as interesting is not actually interesting, don’t do that!” Likewise, if everything is of interest, you want to iteratively tell the model “This document you were unsure about belongs to that category.”
This goes to show you how essential appropriate metrics are. Depending on your use-case, you want metrics that allow you to select documents where the gain is going to be highest. If you need to extract only a specific signal, you need a metric that marks the relevant documents, and if you need to label everything, then you need a metric that tells you which documents were the hardest to sort.
So one crucial step during Active Learning is to decide which metric to use for resampling documents, and this decision is anything but trivial.
Lessons Learned / Concluding Remarks
One final question is still open, however: why did I get so good results with the wrong metric the first time? The reason, I figure, is simply: it was a bug. Not a bug that would throw an error, but rather a logical bug where something just didn’t compute in a way I hoped it would.
What does this example now tell us? First, it shows how important it is for scientists to actually publish the analysis code they’ve used alongside their papers: so that other people can understand the seemingly minor decisions that they have made along the way. Because as programmers will frequently tell you: Just because it was obvious to you does not mean it is obvious to everyone.
Second, and maybe more importantly, trust your supervisor, but also don’t take everything they say at face value. In this case, it was a simple matter of framing/perspective, since I simply did not understand what my supervisor meant with their initial email that read:
Using the maximal probability is a classic trick that allows you to go for the low-hanging fruits, and to produce a lot of annotations with limited efforts. By contrast with the entropy approach, which favors nuance, this one reinforces the classifier by emphasizing what it knows.
It wasn’t clear how divergence measures favor nuance (and, I would even argue that they don’t in this instance as per my explanation above).
It turned out that my supervisor was right in the end, but without proper understanding of why that is the case, I couldn’t progress in my academic knowledge. As I showed, my initial results seemed to contradict my supervisor, and thus before I took a deep dive into the results both metrics looked equally good. But now I’m smarter – and I hope you, too!
- Bonikowski, B., Luo, Y., & Stuhler, O. (2022). Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in U.S. Presidential Campaigns (1952–2020) with Neural Language Models. Sociological Methods & Research, 51(4), 1721–1787. https://doi.org/10.1177/00491241221122317
To disentangle these techy-sounding terms, Kullback-Leibler-Divergence, or KL-Divergence is a metric that calculates how similar two distributions are. Take the classic example of the counterfeit die. In a perfect world, a die would land on each of its six sides with equal probability. How do you now detect a counterfeit die? Now, it will probably diverge from that ideal distribution. KL-Divergence is a measure of this. It ranges from zero to some positive number, where zero indicates that two distributions are identical, and then the divergence increases. The maximum value is determined by the distributions you have. Therefore, a perfect die would have a KL-Divergence of zero from the uniform distribution of probabilities that it lands on any of the six sides, whereas a counterfeit die would have a skewed distribution, i.e., not uniform. In the case at hand, to know how KL-Divergence can help in machine learning, you need to know that language models that classify some input will not output just a number, but a probability distribution over all possible categories. Normally, you just take the largest probability and fix that as the assigned category. If a model now is perfectly sure about the category of interest, the probability distribution with three categories will look like $[0.01, 0.93, 0.06]$, but if the model is very uncertain, the probability distribution would look more like $[0.30, 0.36, 0.34]$. The latter distribution is much closer to a uniform distribution, hence it will have a much smaller KL-Divergence from the uniform distribution as the first one. ↩