The Transformer Architecture: A Visual Guide

Hendrik Erz

Abstract: A few days ago, Grant Sanderson a.k.a. 3Blue1Brown started uploading videos offering a visual introduction to the transformer architecture. I think they are great and you should watch them. Also, they reminded me that two years ago I produced something very similar, but in PDF-format because I like to look at large images that both show the entire process in all its gory glory and the detail at the same time.

Published on Sunday, April 7th, 2024 by Hendrik | 2 min reading time

Five days ago, Grant Sanderson a.k.a. 3Blue1Brown has started uploading a series of videos concerning the functionality of today’s chatbots such as ChatGPT or Llama. First, he uploaded an explainer of what GPT is and how the broad architecture of the model is laid out, before today releasing a video that decisively focuses on the attention mechanism.

This is a perfect opportunity to do something that I haven’t done before despite wanting to. But it appears I just forgot until now. Specifically, two years ago I created a large poster that is hanging next to my desk in which I tried to explain to myself how a transformer works. Despite many having attempted to visualize the transformer architecture, they often left out a lot of detail that I personally would’ve loved to include. After having watched 3Blue1Brown’s videos, I am delighted that Sanderson has chosen a lot of the same visual metaphors for his explainer, so if you’ve already seen his videos, you should be very familiar with what the poster looks like.

There is one large difference, however: In his most recent video, Sanderson mentions how he will only focus on generative transformers, and not the “original” translation transformer from the 2017 paper “Attention is all you need”. 3Blue1Brown only focuses on the decoder-only architecture of GPT, not the encoder/decoder architecture that is used for translation.¹ And this is precisely what my poster focuses on. I meant to upload it ages ago, so it’s a great coincidence that I have been reminded about this with his videos.

Thus, without too much additional noise (because I did produce a very long text two days ago), here’s the poster for you!

Direct Download (PDF; ~500kb)

¹ These are terms he didn’t (yet) introduce, but it is relatively straightforward. An encoder takes some text and encodes it into a matrix that is supposed to contain the information from that text. A decoder then takes such a matrix and generates a next-word prediction. Translation models have both two encoder stages and one decoder stage. The encoders “encode” the input and the already-translated text, and the decoder “decodes” the resulting matrices to predict the next word. Chatbots such as GPT or Llama, on the other hand, are decoder-only, that is: they only focus on the text generation. They also need an encoder strictly speaking, but that is not the star of the show. Likewise, when you have a text classification model such as BERT, this is commonly called an encoder-only model since it mostly focuses on encoding some text into that value matrix. However, similar to generative models, BERT, too, has a decoder-stage, even though that is often just a simple feed-forward network to produce the classifications.

The Transformer Architecture: A Visual Guide

Suggested Citation

Send a Tip on Ko-Fi