As old as the fields of computer science and artificial intelligence are, so are the debates about where the limits of computers lie. Like in many other situations, there were partisans on several sides who accused the other of either under- or overestimating the abilities of computers. In 1948, the mathematician Claude Shannon published one of the most influential treaties of information science, “A Mathematical Theory of Communication” (Shannon 1948). In it, Shannon develops ideas about modeling information mathematically and as such pioneered ideas about how to measure the amount of information contained in a message. These ideas can be used to, for example, quantify the amount of information in text, speech, or images. His ideas form part of the basis for, e.g., compression algorithms.1
This prompted the linguist John Firth to declare: “You shall know a word by the company it keeps” (Firth 1957, cited in Eisenstein 2018, p. 326). What he meant is that, solely by looking at the context of words, you can extract a lot of information about what that word means. Take the following three sentences my computational linguistics teacher has shown us earlier this year:
Garrotxa is made from milk. Garrotxa pairs well with crusty country bread. Garrotxa is aged in caves to enhance mold development.
What is Garrotxa? You have probably never encountered the word (except you’re Spanish), but you will have no difficulties guessing that it’s potentially cheese. If you generalize from this example, you will see that this enables us to create a statistical model of language that is indeed able to correctly predict words in sentences – using something that is known as word embeddings today.
However, in the very same year that Firth uttered his famous sentence, another linguist cautioned against the rising popularity of statistical models of language. The linguist in question was Noam Chomsky (yes, he’s still alive) and he argued that a statistical model would be unable to distinguish between the grammatical sentence “Colorless green ideas sleep furiously” and an ungrammatical permutation of the words, such as “furiously sleep ideas green colorless” (Eisenstein 2018, p. 127, footnote 1).
Chomsky implied that statistical language models would have no understanding of language and as such were inferior to the “manual” work of human beings. But are they really doomed never to really understand language? Do colorless green ideas really sleep furiously? Reddit disagrees, and so does Peter Norvig, another luminary of computational linguistics.
Nowadays, nobody really believes that language models are useless and the quote by Noam Chomsky is rarely encountered in debates around linguistics. In fact, Chomsky himself was arguably never really opposed to dropping some statistical measures onto text. He was adamant on insisting that language is not solely self-contained. And that is something that other linguists and philosophers have also emphasized – just take the Saussurian double of the signifier and signified. But on the other hand, there are people like Ludwig Wittgenstein. In his Tractatus Philosophicus, he argued that all of our perception is confined by language, and he ended his work by saying “Whereof one cannot speak, thereof one must be silent.”
So the debate around the abilities of language models is less one about the capabilities of computers, and more one about whether language is completely self-contained or always relies on context. That is why I propose we ask the question again: Do colorless green ideas really sleep furiously?
The debate is certainly not over. To the contrary, just a few months ago, a very influential paper has been published which talks about “the Dangers of Stochastic Parrots” (Bender et al. 2021). The paper was so disruptive that it cost a few Google employees their jobs and sparked outrage across the deep learning community. The computer scientist Yoav Goldberg has published a lengthy critique of the paper which spiked quite a discussion in the comments. The central argument of Bender et al. is that in the past years, language models have just grown in size, not in understanding of language. Current flagship-models such as GPT-3 are dauntingly good at imitating human speech – but it does so by sheer size, not because it understands human language. Broken down, the argument says that modern language models have only managed to increase the resolution of text generation, not the finesse that humans are able to exercise when speaking.
This insight again points to the camp of Chomsky and Saussure: that language always requires something that is external to it. Was Firth wrong? Are we in fact unable to extract the meaning of a word just by looking at words that occur in their proximity?
This whole debate is circular: First, there is the empirical observation that statistical language models are able to model human language. Then, there is the theoretical insight that language is not self-contained. And then there is more empirical evidence that statistical language models can produce comprehensible text. And then there is, again, a theoretical rebuttal of that empirical evidence. No matter where you look: You can find equal amounts of theoretical evidence against, and empirical evidence for statistical language models. In fact, just by looking at the field of sociology, you can find so many papers that seemingly purport the evidence that statistical language models can accurately model language (Garg et al. 2018; Rudolph and Blei 2018; Kozlowski, Taddy, and Evans 2019; Stoltz and Taylor 2021; Bodell, Arvidsson, and Magnusson 2019; Nelson 2021).
But do they, though? All of these papers used word embeddings, that is, they trained a language model on huge corpora of text, and then analyzed their results. And that is the final clue: They analyzed them. How did they analyze them? By interpreting them. So in fact these studies show that both camps are right: Language models do not have an understanding of language, but they still work. They help us humans in sifting through excessive amounts of data, but the final interpretation of what that data means rests with us.
In fact, once we leave the debate and focus on the nuances of what the abilities of language models are exactly, we find another debate. There is an interesting essay by Magnus Sahlgren titled “The Distributional Hypothesis” (Sahlgren 2006). What Sahlgren outlines is that textual data contains enough information for language models to work with:
By grounding the representations in actual usage data, distributional approaches only represent what is really there in the current universe of discourse. When the data changes, the distributional model changes accordingly; if we use an entirely different set of data, we will end up with an entirely different distributional model. Distributional approaches acquire meanings by virtue of being based entirely on noisy, vague, ambiguous and possibly incomplete language data. (Sahlgren 2006, p. 15)
So, where does this leave us? First, it shows that language itself forms one of the nexus between the humanities and the sciences. Because language is interesting as a human product, it is being studied by linguists. But, ever since text became so ubiquitous that humans were rendered unable to comprehend every bit of it, computer scientists have also found an interest in the subject. This has led to the foundation of the field of computational linguistics and sparked new research into the question of what parts of textual data computers can handle, and where we as humans must take over.
Due to language models being so ubiquitous nowadays, it seems as if there is nothing new to learn. But we are only at the beginning of discovering the fusion of human language with machines. There is much more to see and in the next years, we will see more and more results.
The important part is that more engineers stop just applying language models without caring about the how and that social scientists begin applying more language models by excavating precisely that how.
- Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Virtual Event Canada: ACM. https://doi.org/10.1145/3442188.3445922.
- Bodell, Miriam Hurtado, Martin Arvidsson, and Måns Magnusson. 2019. ‘Interpretable Word Embeddings via Informative Priors’. ArXiv:1909.01459 [Cs, Stat], September. http://arxiv.org/abs/1909.01459.
- Eisenstein, Jacob. 2018. Natural Language Processing.
- Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. ‘Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes’. Proceedings of the National Academy of Sciences 115 (16): E3635–44. https://doi.org/10.1073/pnas.1720347115.
- Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. ‘The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings’. American Sociological Review 84 (5): 905–49. https://doi.org/10.1177/0003122419877135.
- Nelson, Laura K. 2021. ‘Leveraging the Alignment between Machine Learning and Intersectionality: Using Word Embeddings to Measure Intersectional Experiences of the Nineteenth Century U.S. South’. Poetics, March, 101539. https://doi.org/10.1016/j.poetic.2021.101539.
- Rudolph, Maja, and David Blei. 2018. ‘Dynamic Embeddings for Language Evolution’. In Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW ’18, 1003–11. Lyon, France: ACM Press. https://doi.org/10.1145/3178876.3185999.
- Sahlgren, Magnus. 2006. ‘The Distributional Hypothesis’. Stockholm.
- Shannon, Claude E. 1948. ‘A Mathematical Theory of Communication’. The Bell System Technical Journal 27 (October): 379–423, 623–56.
- Stoltz, Dustin S., and Marshall A. Taylor. 2021. ‘Cultural Cartography with Word Embeddings’. Poetics, May, 101567. https://doi.org/10.1016/j.poetic.2021.101567.
This is just a footnote since it does not push forward my argument, but I still find this insight fascinating: If you create a ZIP-archive, for example, what the computer will do is try to replace all duplicate occurrences of words with smaller entities, such as numbers. For example the word "information", when replaced with, e.g., the number 14 everywhere, will use up much less space. There is obviously more to it, but the central idea is that you can reduce the size of some text by reducing the entropy as much as possible. ↩