Transformer Architecture
Tokenization
Byte Pair Encoding (BPE)
The dominant subword tokenization algorithm, used by GPT, LLaMA, and most modern LLMs.
Starts with individual characters as the initial vocabulary. Iteratively counts all adjacent character pairs in the training corpus and merges the most frequent pair into a new token. Repeats until the vocabulary reaches a target size (typically 30K to 100K tokens). At inference, words already in the vocabulary are kept whole. Words not in the vocabulary are broken down into the largest matching subword units, falling back to individual characters for anything completely unknown. For example, "unhappiness" might split into "un", "happiness" if "happiness" is a known token but "unhappiness" is not.
BPE handles any input text without "unknown token" failures by decomposing unfamiliar words into known subword pieces. This balances vocabulary size against sequence length: a small vocabulary means more tokens per sentence (slower), while a large vocabulary means fewer tokens but a bigger embedding table. BPE finds a practical middle ground that works across languages and domains.
WordPiece
A subword algorithm used by BERT and related encoder models, similar to BPE but using a likelihood-based merge criterion.
Like BPE, WordPiece starts with individual characters and iteratively merges pairs. However, instead of selecting the most frequent pair, it selects the merge that maximises the likelihood of the training corpus. Subword tokens that continue a word (rather than starting one) are prefixed with ##, so "playing" might tokenize as ["play", "##ing"]. The vocabulary is typically around 30K tokens.
The likelihood-based criterion tends to produce linguistically meaningful subwords, because it favours merges that make the overall corpus more probable. The ## prefix convention makes it straightforward to reconstruct original words from token sequences, which is important for tasks like named entity recognition where word boundaries matter.
SentencePiece & Unigram
A language-agnostic tokenization framework that operates directly on raw text without pre-tokenization, used by T5, LLaMA, and many multilingual models.
Treats the input as a raw byte or character stream with no language-specific rules such as whitespace splitting. The Unigram model starts with a large candidate vocabulary (hundreds of thousands of tokens) and iteratively prunes tokens whose removal least increases the overall corpus loss. At inference, it finds the most probable segmentation of the input using the Viterbi algorithm. Spaces are encoded as a special Unicode character so the tokenizer is fully reversible.
Fully language-agnostic, so one tokenizer handles English, Chinese, code, and emoji without special pre-processing rules. The probabilistic framework also enables sampling multiple valid segmentations of the same text, which can serve as a form of data augmentation during training.
Architecture Overview
Input Embeddings
Converts raw tokens into dense vector representations that the model can process.
Token embedding lookup table maps each vocabulary token to a learned vector. These vectors capture semantic relationships, so similar words end up with similar representations. The embedding dimension (typically 768 to 12,288) determines the expressiveness of each token's representation.
First layer of the transformer. Converts discrete token IDs into continuous vectors that subsequent layers can manipulate through linear algebra.
Positional Encoding
Injects sequence order information into token representations, since attention has no inherent notion of position.
Original transformers used fixed sinusoidal functions at different frequencies. Modern models use learned position embeddings or Rotary Position Embeddings (RoPE), which encode position through rotation matrices applied to query and key vectors.
Added to (or combined with) input embeddings before the first transformer layer. Without this, the model would treat "the cat sat on the mat" and "mat the on sat cat the" identically.
Feed-Forward Network
Two-layer neural network applied independently to each token position after attention.
First linear layer projects up to a larger dimension (typically 4x the model dimension), applies an activation function (ReLU or SwiGLU in modern models), then a second linear layer projects back down. Each token is processed independently.
Follows the attention sub-layer in each transformer block. While attention mixes information across positions, the FFN processes each position's aggregated representation, adding non-linear transformation capacity.
Layer Normalisation & Residuals
Stabilises training and enables deep stacking of transformer blocks.
Layer normalisation scales activations to zero mean and unit variance within each token's representation. Residual connections add the input of each sub-layer to its output (x + sublayer(x)). Pre-norm (normalise before sub-layer) is now standard over post-norm.
Wraps both the attention and feed-forward sub-layers. Residual connections allow gradients to flow directly through the network, enabling models with hundreds of layers to train effectively.
Self-Attention Mechanism
Query, Key, Value Projections
Three learned linear transformations that create the inputs to the attention computation.
Each token's embedding is multiplied by three separate weight matrices (W_Q, W_K, W_V) to produce a query vector, a key vector, and a value vector. The query represents "what am I looking for", the key represents "what do I contain", and the value represents "what information do I provide if selected".
This separation allows the model to learn different representations for matching (Q and K) versus information retrieval (V). A token can be a good match for a query without the retrieved information being identical to the matching signal.
Scaled Dot-Product Attention
The core computation that determines how much each token attends to every other token.
Multiply queries by the transpose of keys to get raw attention scores. Divide by the square root of the key dimension to prevent the dot products from growing too large. Apply softmax to convert scores into a probability distribution. Multiply by values to produce the weighted output. The formula is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V.
The scaling factor prevents softmax from saturating into hard one-hot distributions as dimensions increase. Without scaling, gradients become vanishingly small and training stalls.
Multi-Head Attention
Runs multiple attention operations in parallel, each learning to focus on different types of relationships.
Split Q, K, V into multiple heads (typically 8 to 128). Each head operates on a smaller dimension (d_model / num_heads). Run attention independently per head. Concatenate the outputs and project through a final linear layer. Different heads learn to attend to different things: syntactic relationships, semantic similarity, positional patterns, or coreference.
A single attention head can only focus on one type of relationship per position. Multiple heads allow the model to simultaneously capture syntax in one head, semantics in another, and long-range dependencies in a third.
Causal Masking
Prevents tokens from attending to future positions during autoregressive generation.
Apply a triangular mask to the attention scores before softmax, setting future positions to negative infinity. After softmax, these positions receive zero attention weight. Token at position t can only attend to positions 0 through t. This ensures the model can only use information available at generation time.
Without causal masking, the model would "see the answer" during training by attending to future tokens. The mask forces left-to-right learning, which is essential for text generation where each token is predicted from only the preceding context.
Encoder-Decoder Variants
Encoder-Only
Processes the full input bidirectionally, where every token can attend to every other token.
Stack of transformer blocks with bidirectional self-attention. No causal mask. The model builds a rich contextual representation of the entire input. Typically used with pre-training objectives like masked language modelling (predict missing tokens) or next sentence prediction. The output is a contextualised representation per token rather than generated text.
BERT, RoBERTa, ELECTRA. Classification, named entity recognition, sentiment analysis, semantic similarity, extractive question answering.
Decoder-Only
Processes tokens left-to-right with causal masking, generating one token at a time.
Stack of transformer blocks with causal (masked) self-attention. Each token can only attend to itself and preceding tokens. Pre-trained with a next-token prediction objective. At inference, generates autoregressively: predict the next token, append it to the sequence, and repeat. The dominant architecture for modern LLMs.
GPT family, LLaMA, Mistral, Claude. Text generation, chat, code generation, reasoning, general-purpose language modelling.
Encoder-Decoder
Separate encoder processes the full input bidirectionally; decoder generates output autoregressively while cross-attending to the encoder's representations.
Encoder stack with bidirectional self-attention reads the input. Decoder stack uses causal self-attention plus cross-attention layers that attend to encoder outputs. Cross-attention allows the decoder to selectively focus on relevant parts of the input at each generation step. The original transformer architecture from "Attention Is All You Need".
T5, BART, mBART, Flan-T5. Machine translation, summarisation, question answering with long inputs, sequence-to-sequence tasks.
Fine-Tuning Approaches
Instruction Fine-Tuning
Trains the model to follow natural language instructions by learning from instruction and answer pairs.
Start with a pre-trained base model. Train on a labelled dataset where each example pairs an instruction (or prompt) with the expected response. The model learns to generalise instruction-following behaviour, not just memorise specific answers. Often followed by alignment techniques (RLHF, DPO) to further refine behaviour.
Each example is an instruction-response pair. The instruction describes a task in natural language ("Summarise this article", "Translate to French", "Write a function that..."). The response is the expected output. Datasets may include multi-turn conversations. Examples: FLAN, Dolly, Alpaca, OpenAssistant.
Turning a base model into an assistant. Enabling zero-shot generalisation to new tasks. Building chat-capable models. When you need the model to follow diverse instructions it has never seen before.
Classification Fine-Tuning
Adapts a pre-trained model to predict discrete class labels for input texts.
Add a classification head (typically a linear layer) on top of the model's output representations. Train on a labelled dataset of texts paired with class labels. For encoder models, use the [CLS] token or pooled output. For decoder models, use the final token's representation. The classification head maps the representation to the number of classes. Cross-entropy loss drives training.
Each example pairs an input text with a class label. Binary classification: spam/not-spam, positive/negative. Multi-class: topic categories, intent detection, language identification. Multi-label: multiple tags per text. Examples: SST-2 (sentiment), AG News (topic), MNLI (natural language inference).
Sentiment analysis, content moderation, intent classification, document categorisation, any task requiring discrete label prediction from text input.
