Transformer Architecture

token

Tokenization

Byte Pair Encoding (BPE)

The dominant subword tokenization algorithm, used by GPT, LLaMA, and most modern LLMs.

How it works

Starts with individual characters as the initial vocabulary. Iteratively counts all adjacent character pairs in the training corpus and merges the most frequent pair into a new token. Repeats until the vocabulary reaches a target size (typically 30K to 100K tokens). At inference, words already in the vocabulary are kept whole. Words not in the vocabulary are broken down into the largest matching subword units, falling back to individual characters for anything completely unknown. For example, "unhappiness" might split into "un", "happiness" if "happiness" is a known token but "unhappiness" is not.

Why it matters

BPE handles any input text without "unknown token" failures by decomposing unfamiliar words into known subword pieces. This balances vocabulary size against sequence length: a small vocabulary means more tokens per sentence (slower), while a large vocabulary means fewer tokens but a bigger embedding table. BPE finds a practical middle ground that works across languages and domains.

WordPiece

A subword algorithm used by BERT and related encoder models, similar to BPE but using a likelihood-based merge criterion.

How it works

Like BPE, WordPiece starts with individual characters and iteratively merges pairs. However, instead of selecting the most frequent pair, it selects the merge that maximises the likelihood of the training corpus. Subword tokens that continue a word (rather than starting one) are prefixed with ##, so "playing" might tokenize as ["play", "##ing"]. The vocabulary is typically around 30K tokens.

Why it matters

The likelihood-based criterion tends to produce linguistically meaningful subwords, because it favours merges that make the overall corpus more probable. The ## prefix convention makes it straightforward to reconstruct original words from token sequences, which is important for tasks like named entity recognition where word boundaries matter.

SentencePiece & Unigram

A language-agnostic tokenization framework that operates directly on raw text without pre-tokenization, used by T5, LLaMA, and many multilingual models.

How it works

Treats the input as a raw byte or character stream with no language-specific rules such as whitespace splitting. The Unigram model starts with a large candidate vocabulary (hundreds of thousands of tokens) and iteratively prunes tokens whose removal least increases the overall corpus loss. At inference, it finds the most probable segmentation of the input using the Viterbi algorithm. Spaces are encoded as a special Unicode character so the tokenizer is fully reversible.

Why it matters

Fully language-agnostic, so one tokenizer handles English, Chinese, code, and emoji without special pre-processing rules. The probabilistic framework also enables sampling multiple valid segmentations of the same text, which can serve as a form of data augmentation during training.

neurology

Architecture Overview

Input Embeddings

Converts raw tokens into dense vector representations that the model can process.

Key components

Token embedding lookup table maps each vocabulary token to a learned vector. These vectors capture semantic relationships, so similar words end up with similar representations. The embedding dimension (typically 768 to 12,288) determines the expressiveness of each token's representation.

Role in the stack

First layer of the transformer. Converts discrete token IDs into continuous vectors that subsequent layers can manipulate through linear algebra.

Positional Encoding

Injects sequence order information into token representations, since attention has no inherent notion of position.

Key components

Original transformers used fixed sinusoidal functions at different frequencies. Modern models use learned position embeddings or Rotary Position Embeddings (RoPE), which encode position through rotation matrices applied to query and key vectors.

Role in the stack

Added to (or combined with) input embeddings before the first transformer layer. Without this, the model would treat "the cat sat on the mat" and "mat the on sat cat the" identically.

Feed-Forward Network

Two-layer neural network applied independently to each token position after attention.

Key components

First linear layer projects up to a larger dimension (typically 4x the model dimension), applies an activation function (ReLU or SwiGLU in modern models), then a second linear layer projects back down. Each token is processed independently.

Role in the stack

Follows the attention sub-layer in each transformer block. While attention mixes information across positions, the FFN processes each position's aggregated representation, adding non-linear transformation capacity.

Layer Normalisation & Residuals

Stabilises training and enables deep stacking of transformer blocks.

Key components

Layer normalisation scales activations to zero mean and unit variance within each token's representation. Residual connections add the input of each sub-layer to its output (x + sublayer(x)). Pre-norm (normalise before sub-layer) is now standard over post-norm.

Role in the stack

Wraps both the attention and feed-forward sub-layers. Residual connections allow gradients to flow directly through the network, enabling models with hundreds of layers to train effectively.

visibility

Self-Attention Mechanism

Query, Key, Value Projections

Three learned linear transformations that create the inputs to the attention computation.

How it works

Each token's embedding is multiplied by three separate weight matrices (W_Q, W_K, W_V) to produce a query vector, a key vector, and a value vector. The query represents "what am I looking for", the key represents "what do I contain", and the value represents "what information do I provide if selected".

Why it matters

This separation allows the model to learn different representations for matching (Q and K) versus information retrieval (V). A token can be a good match for a query without the retrieved information being identical to the matching signal.

Scaled Dot-Product Attention

The core computation that determines how much each token attends to every other token.

How it works

Multiply queries by the transpose of keys to get raw attention scores. Divide by the square root of the key dimension to prevent the dot products from growing too large. Apply softmax to convert scores into a probability distribution. Multiply by values to produce the weighted output. The formula is Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V.

Why it matters

The scaling factor prevents softmax from saturating into hard one-hot distributions as dimensions increase. Without scaling, gradients become vanishingly small and training stalls.

Multi-Head Attention

Runs multiple attention operations in parallel, each learning to focus on different types of relationships.

How it works

Split Q, K, V into multiple heads (typically 8 to 128). Each head operates on a smaller dimension (d_model / num_heads). Run attention independently per head. Concatenate the outputs and project through a final linear layer. Different heads learn to attend to different things: syntactic relationships, semantic similarity, positional patterns, or coreference.

Why it matters

A single attention head can only focus on one type of relationship per position. Multiple heads allow the model to simultaneously capture syntax in one head, semantics in another, and long-range dependencies in a third.

Causal Masking

Prevents tokens from attending to future positions during autoregressive generation.

How it works

Apply a triangular mask to the attention scores before softmax, setting future positions to negative infinity. After softmax, these positions receive zero attention weight. Token at position t can only attend to positions 0 through t. This ensures the model can only use information available at generation time.

Why it matters

Without causal masking, the model would "see the answer" during training by attending to future tokens. The mask forces left-to-right learning, which is essential for text generation where each token is predicted from only the preceding context.

schema

Encoder-Decoder Variants

Encoder-Only

Processes the full input bidirectionally, where every token can attend to every other token.

Architecture

Stack of transformer blocks with bidirectional self-attention. No causal mask. The model builds a rich contextual representation of the entire input. Typically used with pre-training objectives like masked language modelling (predict missing tokens) or next sentence prediction. The output is a contextualised representation per token rather than generated text.

Examples & use cases

BERT, RoBERTa, ELECTRA. Classification, named entity recognition, sentiment analysis, semantic similarity, extractive question answering.

Decoder-Only

Processes tokens left-to-right with causal masking, generating one token at a time.

Architecture

Stack of transformer blocks with causal (masked) self-attention. Each token can only attend to itself and preceding tokens. Pre-trained with a next-token prediction objective. At inference, generates autoregressively: predict the next token, append it to the sequence, and repeat. The dominant architecture for modern LLMs.

Examples & use cases

GPT family, LLaMA, Mistral, Claude. Text generation, chat, code generation, reasoning, general-purpose language modelling.

Encoder-Decoder

Separate encoder processes the full input bidirectionally; decoder generates output autoregressively while cross-attending to the encoder's representations.

Architecture

Encoder stack with bidirectional self-attention reads the input. Decoder stack uses causal self-attention plus cross-attention layers that attend to encoder outputs. Cross-attention allows the decoder to selectively focus on relevant parts of the input at each generation step. The original transformer architecture from "Attention Is All You Need".

Examples & use cases

T5, BART, mBART, Flan-T5. Machine translation, summarisation, question answering with long inputs, sequence-to-sequence tasks.

model_training

Fine-Tuning Approaches

Instruction Fine-Tuning

Trains the model to follow natural language instructions by learning from instruction and answer pairs.

How it works

Start with a pre-trained base model. Train on a labelled dataset where each example pairs an instruction (or prompt) with the expected response. The model learns to generalise instruction-following behaviour, not just memorise specific answers. Often followed by alignment techniques (RLHF, DPO) to further refine behaviour.

Dataset format

Each example is an instruction-response pair. The instruction describes a task in natural language ("Summarise this article", "Translate to French", "Write a function that..."). The response is the expected output. Datasets may include multi-turn conversations. Examples: FLAN, Dolly, Alpaca, OpenAssistant.

When to use

Turning a base model into an assistant. Enabling zero-shot generalisation to new tasks. Building chat-capable models. When you need the model to follow diverse instructions it has never seen before.

Classification Fine-Tuning

Adapts a pre-trained model to predict discrete class labels for input texts.

How it works

Add a classification head (typically a linear layer) on top of the model's output representations. Train on a labelled dataset of texts paired with class labels. For encoder models, use the [CLS] token or pooled output. For decoder models, use the final token's representation. The classification head maps the representation to the number of classes. Cross-entropy loss drives training.

Dataset format

Each example pairs an input text with a class label. Binary classification: spam/not-spam, positive/negative. Multi-class: topic categories, intent detection, language identification. Multi-label: multiple tags per text. Examples: SST-2 (sentiment), AG News (topic), MNLI (natural language inference).

When to use

Sentiment analysis, content moderation, intent classification, document categorisation, any task requiring discrete label prediction from text input.