Understanding Backpropagation A Bit Better

Hendrik Erz

Abstract: I work with neural networks on a daily basis. But one thing has always escaped my understanding: how backpropagation actually works. By coding an SGNS network from scratch, I forced myself to implement a backpropagation mechanism all by myself. This helped me understand this crucial part of neural networks. In this article, I show you what I learned, and hopefully you, too, will understand the backpropagation algorithm better.

Published on Saturday, September 28th, 2024 by Hendrik | 19 min reading time

Table of Contents

Over the past four years, I have continuously sought to achieve two goals. First, of course, to write a PhD dissertation on economic policymaking in U.S. Congress. But, secondly, I wanted to understand the methods that I use properly. Because one of the issues with AI methods – regardless of whether they are simple word2vec models or immensely large language models (LLM) – is that there is a lot of linguistic fluff around them; a.k.a.: marketing. And this hampers a public understanding of these neural network architectures, even though it is essentially just a big amount of math thrown onto software engineering.

The engineering of such neural networks is so simple, in fact, that most people could do it with a bit of training. A fortnight ago I saw a post by someone claiming that anybody could code a neural network from scratch, including their tech-illiterate mother. The big issue rather circles around who controls the data and trained models. I totally agree with the general premise, although I believe that my mother would not be able to code a neural network from scratch.

In today’s article, I want to share a relatively recent attempt by me to understand how neural network architectures work even better. Specifically, what I was sorely lacking until now was a deeper understanding of the central learning mechanism of neural networks: backpropagation using gradient descent. I do have a fairly good grasp on neural networks, but there are still a few things that I am missing. I have never coded an entire neural network completely from scratch, which would’ve forced me to do everything that’s required for a fully functional neural network, without the help of libraries — including backpropagation.

Accidentally proving the abovementioned post right, I managed to code a functioning word2vec model over the weekend within a few hours. I want to spend the following paragraphs to highlight a few things that I found interesting, being used to working with PyTorch a lot, and amend the explanations for the backpropagation mechanism commonly found online. Because I believe, these explanations are lacking a crucial (simple) detail.

Note: This article does not explain much of the underlying intuition of how to run a word2vec model. So to understand this in full, it helps knowing how word2vec works. For this, there are a ton of amazing resources out there.

An SGNS Model From Scratch

First, here’s the link to the GitHub repository with all the code: https://github.com/nathanlesage/node-sgns. I used JavaScript mainly because I didn’t want to have to think about data structures or syntax, and instead focus on understanding the mechanisms inside the network. The implementation itself is still somewhat buggy, and it is mostly a proof of concept, but it does work. And it works relatively well.

A skip-gram negative sampling (SGNS) model is a word2vec implementation that aims to produce a set of word embeddings, that is, numerical vectors that encode proximity and distance relationships of words. It does so by using a shallow neural network without any hidden layers (it is literally just a big fat matrix multiplication) and using as the prediction target (the loss function) a simple binary cross entropy loss. Therefore, the prediction target of the SGNS model is simply: “Do the two words that I provided you with occur close to each other in the corpus?” A simple, binary target, where the loss can be calculated straightforward (the difference between the true label, i.e., 0 or 1, and the model prediction). The “negative sampling” part of the name simply means that, for every positive sample – that is, two words that actually do co-occur – you add a number of negative samples – that is, randomly chosen words that do not co-occur.

At this point, I want to acknowledge the amazing work of Prof. Marco Kuhlmann from the IDA department at Linköping University. He has provided the Python implementation for the SGNS model that I followed, and who has provided the general scaffolding for the various necessary functions. I merely translated this from Python into JavaScript.

The Benefits of a Library

This already brings me to the first point that I found very interesting while I coded the network: The quality of life improvements that libraries – in this case, PyTorch and numpy – can bring. The first thing I was missing were some utility functions for working with vectors and matrix operations that exist in Python. In vanilla JavaScript, there is no such thing.

Vector Instantiation

When working with neural networks, you frequently have to instantiate vectors, e.g., with zeros or ones. In Python, this is simple: np.zeros((dim1, dim2)). In JavaScript, I had to provide a function implementation that … well, to put it this way: doesn’t look neat. It works, however, so this was simply an exercise in code golf:

function zeros (shape) {
  if (Array.isArray(shape)) {
    const thisDim = shape.shift()
    const emptyArr = Array(thisDim).fill(0.0)
    if (shape.length > 0) {
      return emptyArr.map(x => zeros([...shape]))
    } else {
      return emptyArr
    }
  } else {
    return Array(shape).fill(0.0)
  }
}

Broadcasting

The biggest gripe I had with vanilla JavaScript, however, was the lack of broadcasting. Broadcasting is something that numpy implements like so:

Imagine you have a long vector of numbers, and you want to add a simple scalar value to all of them; mathematically $\vec{x} + \alpha$. In Python, you literally write vec + my_num, and the library will ensure to run some efficient for-loop for you that does what you want. In plain JavaScript … well, you have to write the loops yourself. And, oh boy, I haven’t seen such a prevalence of nested for-loops three layers deep in a long time. But that’s what you have to do. Sure, there is likely room for optimization, but for a working sample, this is how you do it.

In short, without broadcasting, the code looks way messier than it has to, and I really recognized how nice this concept in numpy is.

Utility Functions

A more common benefit of a library, however, is the abundance of utility functions. Need to calculate a dot-product? Sure, here’s a function for it. Cosine similarity? It’s literally the next function. Or the simple availability of probability distributions. It turns out that writing algorithms to draw random numbers from distributions in JavaScript is hard. I now understand that you need to know the distribution’s probability density function (PDF) and somehow turn that into an algorithm, but I decided that this was not the focus of my exercise and used a “library” for that.

The Training Loop & Backpropagation

The core piece of what I wanted to understand better, however, are the training loop and backpropagation. Training a neural network always follows the same recipe. There is so little difference between training some word2vec model and a GPT-style generative LLM, in fact, that the huggingface library even provides helper functions for that. You simply define two functions, set some hyperparameters, and off you go. It’s ridiculous how straightforward this is.

The same goes for backpropagation. The actual training of a neural network consist of a forward run where you let the network predict some values, and a backward run, in which you adapt all the model’s weights based on how far off the predictions were. Again, PyTorch has got us covered here: You define an optimizer, such as an Adam, give it a learning rate, then you define a loss function, and then you follow three simple steps: (1) Predict with the model; (2) Calculate the loss; (3) Use this loss to update the model weights. Rinse and repeat. In code, it looks extremely straightforward:

predictions = model.forward(X) # Let the model predict y-hat
loss = loss_function.calculate(predictions, Y) # Compare y-hat with the true Y
loss.backward() # Calculate by how much each weight needs to be updated
optimizer.step() # Execute the update function

But what happens if you don’t have any optimizer or loss function? Well, you have to define it yourself.

The good thing about the SGNS model is that this is as simple as neural networks can get. It consists of one big matrix multiplication, has an insanely simple loss function and is general very straightforward. (Well, straightforward if you already know a lot about math and neural networks.)

Understanding SGNS Back-to-Back

So let’s understand what the SGNS does back to back, including the training loop and, most importantly, the backpropagation.

Let us first start with why SGNS looks like it does. It all starts with an almost 70-year-old insight: “You shall know a word by the company it keeps” (Firth 1957, cited in Eisenstein 2018, p. 326). The insight that John Rupert Firth uttered in this statement is that the meaning of words is defined in relation of words in whose context they usually occur. In simple terms: Two words that often co-occur in everyday language have more similar meaning than words that rarely co-occur.

What Mikolov et al. (2013) did in their paper that introduced word2vec was simply put this insight into a mathematical model. So to understand SGNS, we first have to understand what it’s aim is: Its aim is to assign a vector to every word in such a way that similar words have also similar vectors. The next issue is: How do you find these vectors? For this, Mikolov et al. first had to convert this insight into math. Mathematically, two vectors are close if their dot product is close to 1, and they are distant if their dot product is close to 0. So what you need is to assign, for example, two vectors to the words “cat” and “kitten” whose dot product is similar to 1, mathematically: $\vec{\text{cat}} \odot \vec{\text{kitten}} \approx 1$. Similarly, the vectors of “cat” and “dog” should be close to zero, mathematically: $\vec{\text{cat}} \odot \vec{\text{dog}} \approx 0$.

Next, when you create a neural network, you need to start off with random vectors that don’t encode anything. Then, you need to gradually adapt the numbers in these vectors until they fulfill these criteria. Creating such vectors is straight forward, even in JavaScript (note that there are better random distributions that one could use):

export function makeRandomEmbedding (dim) {
  return zeros(dim).map(x => Math.random() - 0.5)
}

Then, the first thing you do is let this neural network predict something based on these random vectors. This is then the so-called “forward” pass through the network. Again, this is merely a big matrix multiplication (called “bmm” for batch matrix multiplication):

forward (w, c) {
  const target_embeddings = w.map(idx => this.w[idx])
  const context_embeddings = c.map(context_row => {
    return context_row.map(idx => this.c[idx])
  })

  const D = bmm(context_embeddings, target_embeddings)

  for (let i = 0; i < D.length; i++) {
    D[i] = softmax(D[i])
  }

  return [D, target_embeddings, context_embeddings]
}

What this function does is, first, in a very awkward way, retrieve the (random) vectors for the words passed in vector $\vec{w}$, and the accompanying context words in vector $\vec{c}$. The aim for the network will be to calculate a dot-product of one for each target-context-word pair that actually co-occur, and zero for every pair that doesn’t actually co-occur in the data. (This is the “negative sampling”: To make the word embeddings more precise, you want to also train the network to correctly detect if two words never co-occur in your data. So you add a few randomly selected word-pairs to the training data.)

Then, it calculates the bmm of the various target-context word pairs and stores the result in matrix $D$. It is a matrix because in every row we ask it to calculate multiple dot-products for multiple context-words and a single target word: one actually co-occurring context word, and (usually) five of such negative samples.

Finally, we turn each of these “prediction rows” into a probability distribution by calculating a softmax on top of them. Then, we return the matrix as well as the involved target and context embeddings. (The latter is only necessary in my clunky implementation.)

At this point, we have a set of dot-products-turned-probabilities for whether two words co-occur in the training data. Now we have to check if that is actually true. This is where the loss-function comes into play. The loss-function is basically just a way to mathematically define what we think is a correct prediction, and what we think is an incorrect prediction. For SGNS, this is always the binary cross entropy loss. The function looks like so:

export function binaryCrossEntropy (predicted, trueLabels) {
  const loss = zeros(predicted.length)
  const batchSize = predicted.length
  for (let i = 0; i < batchSize; i++) {
    const p = predicted[i]
    const q = trueLabels[i]

    let tally = 0
    for (let j = 0; j < p.length; j++) {
      tally += q[j] * Math.log(p[j] + 1e-6) + (1 - q[j]) * Math.log(1 - p[j] + 1e-6)
    }
    loss[i] = -tally / batchSize
  }

  return loss
}

What this does is calculate the divergence of the predicted probability distribution vis-à-vis the correct one. In the standard SGNS-implementation, the correct probability distribution is always ${1, 0, 0, 0, 0, 0}$, and if the predicted distribution diverges from this one, the binary cross entropy loss quantifies the amount by which the network was wrong. So we calculate the loss on the predicted probabilities. In total, a forward-pass during training looks like this:

// Step 1: Forward pass through the model, retrieve D
const [D, t, c] = model.forward(targetVec, contextMat)

// Step 2: Prepare the true probability distribution: [[1, 0, 0, 0, 0, 0], ...]
const labels = zeros([D.length, 1 + numNS])
for (let i = 0; i < D.length; i++) {
  labels[i][0] = 1
}

// Step 3: Calculate the binary cross entropy loss -> divergence of "D" to "labels"
const loss = binaryCrossEntropy(D, labels)

So, you calculate the loss and then … well, yes: What do you do with it then?

This was the first realization that got me while understanding backpropagation: To train a neural network, you don’t have to actually calculate the loss. It is just a convention (and a good one at that) to calculate and visualize the loss of a neural network. When you manually implement backpropagation, you could completely skip the loss calculation.

It took me some while to understand it, but it makes sense. And this brings us to the missing piece in my understanding of the backpropagation mechanism.

Here’s how backpropagation is usually explained in online-tutorials: You have some neural network and, provided some input, you want it to correctly predict the true labels of it. Because you instantiate all the model weights randomly (because you don’t know what they need to be), you want to adapt them over and over until the network is good at predicting. You do so by calculating the gradient descent. You need to imagine that there is an optimal magnitude for all the model weights at which the network works flawlessly. To find it, you adapt the weights gradually. This is how far most explainers go.

The intermediate explainers explain this further: You have to imagine the way from the model’s input to its predictions as a function. By calculating that function’s derivative, you can calculate partial updates for each of the layers of the neural network that get applied one after another until you are back at the start of the network. So you can imagine the gradient descent as a trip down to the valley of a mountain. You start with a full set of something – let’s say flowers –, and in regular intervals on your way down you place one on the wayside. The bouquet of flowers is the entire difference between the predictions and the true labels (the entire “entropy” you can distribute in your network), and you divide this amount of “difference” up between the layers of the network. So let’s say your loss is 3, and you have 3 layers; then all the weights in the network are updated with 1. In other words, you distribute the “magnitude” of loss you have with these partial derivations.

But how does it actually work in a bare metal application? This took me quite a while, but I finally got it. In the case of the SGNS it works like so:

You define that the network becomes better if the binary cross entropy loss becomes smaller. This is your definition, this is why you have to always choose the loss function.
Calculating the loss then gives you an idea of how good or bad the network actually is, but you don’t need that calculation for the backpropagation. Instead, you need to know the formula for calculating that loss. And from that function you then need to know the partial derivative. Someone smarter than me has done this (I have forgotten how derivatives work) and has come up with this: $(y - \hat{y}) X$. That’s it: the partial derivative of the binary cross entropy loss is the difference between true and predicted label times the input matrix.
This is why you need the loss function: It determines what the gradient for the backpropagation looks like. This is what happens when you type loss.backward() in PyTorch.

Now armed with this understanding, how does calculating the gradient actually work in the case of this SGNS? As a reminder, it’s as basic as it gets and is actually just a big matrix multiplication. To calculate the values by which to adapt the model weights, you need the difference in predicted and true labels ($y - \hat{y}$), which is easy to calculate:

const diff = zeros([D.length, D[0].length])
for (let i = 0; i < D.length; i++) {
    for (let j = 0; j < D[0].length; j++) {
        diff[i][j] = D[i][j] - labels[i][j]
    }
}

Second, you need to know what $X$ is. This is the next tricky part: There are actually two $X$, and to find this, you need to know that it is convention to denote the input to a model with $X$, and the model’s weights by $W$. Further, you need to know that the basic operation of a neural network is to multiply the input with the weights. In the case of SGNS, you have two matrices: A set of word embeddings (usually $w$), and a set of context word embeddings (usually $c$). Which of those is the input? Well, that’s the trick: both!

So you have to calculate two gradients: First one to update the context embeddings, treating the actual word embeddings as the input, and then one to update the word embeddings, treating the context embeddings as input. So you take the embeddings that you have actually used in the given forward pass, multiply them with a matrix of all the differences between true and predicted labels, then scale them using the learning rate (because we want to do mini-steps instead of one big step) and subtract them from the other set of embeddings:

// a1 == the update matrix for the word embeddings, using the context embeddings as inputs.
const a1 = zeros([D.length, embeddingDim])
for (let i = 0; i < D.length; i++) {          // For each batch
  for (let j = 0; j < diff[i].length; j++) {  // For each num_ns/loss diff
    for (let k = 0; k < embeddingDim; k++) {  // For each embedding scalar
      a1[i][k] = c[i][j][k] * diff[i][j] * lr // Embedding scalar times num_ns diff times lr
    }
  }
}

// a0 == the update matrix for the context embeddings, using the word embeddings as inputs.
const a0 = zeros([D.length, numNS + 1, embeddingDim])
for (let i = 0; i < D.length; i++) {          // For each batch
  for (let j = 0; j < diff[i].length; j++) {  // For each num_ns/loss diff
    for (let k = 0; k < embeddingDim; k++) {  // For each embedding scalar
      a0[i][j][k] = t[i][k] * diff[i][j] * lr // Embedding scalar times num_ns diff times lr
    }
  }
}

// Adapt the context embeddings using a0
for (let i = 0; i < D.length; i++) {
  for (let j = 0; j < diff[i].length; j++) {
    for (let k = 0; k < embeddingDim; k++) {
      c[i][j][k] = c[i][j][k] - a0[i][j][k]
    }
  }
}

// Adapt the word embeddings using a1
for (let i = 0; i < D.length; i++) {
  for (let k = 0; k < embeddingDim; k++) {
    t[i][k] = t[i][k] - a1[i][k]
  }
}

And then you simply repeat this procedure for all the batches and all the epochs you want to run the network for, and then you end up with properly trained word embeddings. A final word to the concept of a learning rate: You usually scale down these “update matrices” that define by how much the embeddings need to be adapted to avoid the gradient descent from accidentally “jumping over” some maximum.

This is it: The next time you run the forward-pass with the updated word embeddings, they will be a bit more aligned. Repeat this a few million times and your word embeddings will encode the accurate proximity-distance-relationships of the words in your training data.

Final Thoughts

Implementing a neural network is both conceptually easy and engineering-wise hard to do. The concept of a neural network is indeed easy, but if you also want to implement them, you need to know how training such things work. And this is where you need much more math than with simply using it.

Implementing the SGNS model in vanilla JavaScript was definitely a fun and insightful exercise. I was a bit surprised at how well it worked and how fast it was. Sure, I know a lot about neural networks already, but still. I had to implement everything from scratch, and the result is a fully functioning neural network that runs on the CPU, is still pretty fast, and comes with a benefit almost no JavaScript code nowadays has: It has zero dependencies. And the one it has (the Node.JS runtime) could arguably be removed, too, so that it can run in the browser.

The next step is to improve upon this implementation, fiddle around a bit to understand the mechanisms still a little better, and then maybe I will try to understand how backpropagation works with multiple layers. Because that’s the next fun thing: How do I calculate the partial derivatives of each layer, given that their linearity is broken with the activation functions in between? And then, maybe, at some point in the future, I can write an LSTM from scratch. But that’s a story for another day.