Autobots, roll out! - Building a Transformer
This post is a detailed explanation of the Transformer model, as described in the paper Attention is All You Need by Vaswani et al. (2017). I have tried to be concise with my mathematical notation, but if you find any errors, please let me know. This post is meant to be a reference for myself, and I hope it helps you too.
Preprocessing
Tokenization
Define the units () of your model - this can be a character, subword, word, depending on your use case.
Further, we perform a few housekeeping steps to standardize the model, including:
- Train-test-split
- Creating batches of size with units:
- and created through next-unit loops (bigram matrix)
- Token embedding layer created: is a token embedding matrix that takes in a unique list of units () and learns a -dimensional vector for each
- Positional embedding layer created: is a positional embedding matrix that takes in a list of possible positions and learns a -dimensional vector for each. This is necessary because the self-attention mechanism in transformers doesn't have any inherent sense of position or order of the tokens.
- Final embedding layer:
Encoder layer
Attention
The self-attention mechanism allows each token to look and learn from a number of other tokens (in theory, infinitely many). Self-attention is the method the Transformer uses to bake the “understanding” of other relevant units into the unit that we are currently processing.
A single head (Query-Key-Value to Score)
The first step is to generate three vectors from each of the input vectors, which are contained in .
A head size is chosen as a hyperparameter, which signifies the dimensionality of each head's output.
Three weight layers are defined,
- Query weights ()
- Key weights ()
- Value weights ()
Then, the individual vectors are created for each input :
- Query:
- Key:
- Value
The second step is to calculate a score for each query and key. So, for example, with a query vector and will get a score vector
Similarly, all will get their own score vector
Then, divide with the square root of the head size and pass through a softmax function as:
This is then multiplied to the value vector , so overall it becomes:
Multi-headed attention
Now, this can be further refined by adding multiple such heads and allowing them to communicate. This will help the model to learn about relationships between positions even more by expanding the representation subspace, i.e., projecting it to a higher dimension to capture granularity.
Thus, with number of heads, each head can independently learn the weight matrices , and produce a set of scores .
Then, the set of scores is concatenated to create a score matrix
The output of this layer is then multiplied with another weight matrix to produce the final output matrix
Residual pathways and LayerNorm
After the self-attention is computed, it passes through an "Add and Normalize" layer which does:
The above (pre-norm) is taken directly from the original paper. However, in Andrej Karpathy's video (YouTube), he mentions changing the formula to a more recent version:
Feed-forward layer
The model takes the self-attention for each vector in the input and passes it through a feed forward layer, which is comprised of:
- Linear layer:
- ReLU activation
- Linear layer:
After this, the output passes through another add and norm layer:
Stacking
The block defined above can now be stacked times such that the outputs from one block become the inputs for next one, increasing the number of parameters available.
Once done, the output passes into the decoder layers.
Decoder layer
The first part inside a decoder layer is exactly similar to the encoder layer, with the small change of processing output vectors instead of input vectors. In other words, the decoder layer starts by taking in the outputs and passing them through the "self-attention-feed-forward" block.
Self-attention
One key distinction in the self-attention mechanism between the encoder and decoder block lies in the treatment of future tokens. The decoder block is only allowed to attend to earlier positions in the output sequence since this would be the case during inference. This is achieved by masking the future tokens using a lower triangular matrix.
Cross-attention
For the "encoder-decoder" attention block, the idea remains the same. However, here the key and value matrices are taken as an input from the encoder block, while the query matrix is taken from the decoder block below (the one with self-attention)
Residual pathways and LayerNorm
Again, this section of the block remains the same. Each attention layer is preceded with a layer normalization and then added to the original layer via a residual connection, then sent into the feed forward layer.
Stacking
The decoder block (with self-attention and cross-attention) can be stacked together times in a similar fashion to increase the complexity of the model. The final layer will output a vector of size
Final linear and softmax layer
With the output of the decoder stack, this final linear layer projects it back into a vector equal to the size of the vocabulary
Then, a softmax layer normalizes the values to probabilities (all add up to 1.0) and chooses the element with the highest probability as the next output unit.'
Training
When training a model, it is important to have a function that keeps track of how well the model is performing. In other words, this means giving an objective function to the model that it should try and solve, instead of mindlessly trying to assign random values. This is called the "loss function".
Given the and batches, the data flows through the encoder/decoder blocks and into the final softmax layer to produce a probability distribution. Since we know the actual value that should come next (from ), we can compute the loss using the cross-entropy function, and instruct the model to choose the parameter weights such that the loss is minimized.
Inference
Once the model is trained using the chosen hyperparameters and an acceptable loss value, the next step is to actually use the model to generate some outputs.
This is done by giving it a starting context, and then letting the model generate outputs up to a specified number of units or some characters, depending on the use case.
Hyperparameter summary
To summarize, the non-exhaustive choice of hyperparameters when building a transformer are:
- Vocabulary size (usually decided through the tokenization mechanism and input data)
- Batch size
- Block size
- Number of embeddings
- Head size
- Number of heads
- Number of blocks
- Inner-layer dimensionality
Conclusion
In this post, we've explored the intricate architecture of the Transformer model, delving into the mechanics of its encoder and decoder layers, and the pivotal role of the self-attention mechanism.
As we've seen, the Transformer's ability to handle sequential data without the constraints of sequential computation, as well as its parallelization capabilities, make it a powerful tool in the machine learning toolkit. The influence is evident in the numerous models that have since been built upon its architecture (most notably GPT-3.5 and GPT-4), pushing the boundaries of what's possible in machine learning.
I've learned a lot from writing this post, and I hope you've found it useful too. If you have any questions or suggestions, please feel free to reach out to me on Twitter.
References
The following were of immense help in understanding the architecture: