GPT
Architecture Fundamentals
Decoder-Only Stack
A simplified transformer that uses only the decoder portion, processing tokens left-to-right.
Each layer contains causal self-attention followed by a feed-forward network, wrapped with layer normalisation and residual connections. There is no encoder and no cross-attention. The same architecture handles both understanding and generation by treating every task as text completion. Modern variants use pre-norm (normalise before each sub-layer) and SwiGLU activations.
The decoder-only design is remarkably versatile. By framing every task as "continue this text", a single architecture handles translation, summarisation, question answering, code generation, and reasoning without task-specific modifications.
Causal Self-Attention
Restricts each token to only attend to itself and preceding tokens, enforcing left-to-right processing.
A triangular mask is applied to the attention scores before softmax, setting all positions to the right of the current token to negative infinity. After softmax, these positions receive zero weight. At position t, the model sees tokens 0 through t and nothing beyond. This mask is applied identically during both training and inference.
Causal masking makes autoregressive generation possible. The model learns to predict the next token using only preceding context, which matches the generation setting exactly. Without this constraint, the model would "cheat" during training by looking at future tokens.
Token & Position Embeddings
Maps discrete tokens and their positions into continuous vector representations.
GPT-1 and GPT-2 use learned position embeddings (a separate embedding table indexed by position). Each token's input is the sum of its token embedding and position embedding. GPT-3 uses the same approach. Context length is fixed at training time (512 for GPT-1, 1024 for GPT-2, 2048 for GPT-3, 8K/32K/128K for GPT-4).
Learned position embeddings are simple and effective but limit the model to the context length seen during training. This motivated later innovations like ALiBi and RoPE in non-GPT models, though GPT-4 substantially extended context length through engineering.
Language Modelling Head
The final layer that converts hidden representations into next-token predictions.
The last transformer layer's output is projected through a linear layer with dimensions (hidden_size x vocabulary_size). Softmax converts the logits into a probability distribution over the entire vocabulary. During training, cross-entropy loss compares this distribution to the actual next token. Weight tying shares the embedding matrix with this output projection, reducing parameters.
The simplicity of next-token prediction as the sole training objective is what makes GPT's approach powerful. A single objective, applied to massive text corpora, produces models that learn grammar, world knowledge, reasoning patterns, and even code without any task-specific engineering.
Pre-Training & Scaling
Next-Token Prediction
The self-supervised objective that drives all GPT pre-training.
Given a sequence of tokens, predict the next token at every position. The loss is the average negative log-likelihood across all positions and all sequences in the batch. Training data is massive web text (Common Crawl, books, code, Wikipedia). No labelled data is required. The model learns to compress and represent the statistical structure of human language.
This objective implicitly teaches the model grammar, facts, reasoning, style, and even basic arithmetic. Predicting the next token in "The capital of France is ___" requires world knowledge. Predicting the next token in code requires understanding programming logic. The objective is universal.
Scaling Laws
Predictable relationships between model size, data, compute, and performance.
Research (Kaplan et al., 2020; Hoffmann et al., 2022) showed that loss decreases as a power law with increases in parameters, dataset size, and compute budget. The Chinchilla scaling law demonstrated that most large models were over-parameterised and under-trained, recommending roughly 20 tokens per parameter. This shifted the field toward training smaller models on more data rather than simply making models larger.
Scaling laws allow researchers to predict model performance before training, optimise compute allocation, and make informed decisions about model size versus training duration. They transformed LLM development from trial-and-error into a more principled engineering discipline.
Emergent Capabilities
Abilities that appear at certain model scales without being explicitly trained.
As models scale past certain parameter and data thresholds, they develop capabilities not present in smaller models. In-context learning (performing tasks from examples in the prompt without gradient updates) emerged strongly at GPT-3 scale. Chain-of-thought reasoning (solving multi-step problems by generating intermediate steps) became reliable in larger models. Few-shot prompting effectiveness scales with model size.
Emergent capabilities mean that scaling is not just about doing the same thing better. Larger models gain qualitatively new abilities. This is both exciting (new capabilities without new training objectives) and concerning (harder to predict what a model will or will not be able to do).
From Base Model to Assistant
Supervised Fine-Tuning (SFT)
Trains the base model on curated instruction-response pairs to produce helpful outputs.
Collect a dataset of prompts paired with ideal responses, often written by human annotators. Fine-tune the pre-trained model on this data using the same next-token prediction objective, but only on the response tokens (the prompt tokens contribute to context but not to the loss). Typical datasets range from tens of thousands to hundreds of thousands of examples.
SFT bridges the gap between a base model (which completes text) and an assistant (which follows instructions). A base model given "What is photosynthesis?" might continue with another question. After SFT, it provides a clear, helpful explanation.
RLHF (Reinforcement Learning from Human Feedback)
Aligns the model's behaviour with human preferences using a reward signal.
First, train a reward model on human preference data (annotators compare two model outputs and pick the better one). Then use the reward model's scores as a signal to optimise the language model with Proximal Policy Optimisation (PPO). A KL penalty prevents the model from drifting too far from the SFT baseline. The reward model learns nuanced preferences that are hard to specify as rules.
RLHF teaches models to be helpful, harmless, and honest in ways that SFT alone cannot capture. It handles subjective preferences (tone, detail level, safety) where there is no single "correct" answer. InstructGPT showed that RLHF with a small model could outperform a 100x larger base model.
Direct Preference Optimisation (DPO)
A simpler alternative to RLHF that eliminates the need for a separate reward model.
Use the same human preference data (pairs of outputs with a preferred choice) but optimise the language model directly. DPO reformulates the RLHF objective into a classification loss on preference pairs, proving mathematically that the optimal policy can be extracted without training a reward model. The model learns to increase the probability of preferred responses relative to dispreferred ones.
DPO significantly simplifies the alignment pipeline. No reward model training, no PPO instability, no KL tuning. It produces comparable results to RLHF with less compute, fewer hyperparameters, and more stable training. It has become the preferred approach for many open-source model developers.
GPT Lineage
GPT-1 (2018)
117M parameters. Demonstrated that unsupervised pre-training followed by supervised fine-tuning produces strong results across NLP tasks.
First large-scale application of transformer decoders to language modelling. Pre-trained on BooksCorpus (7,000 books). Showed that a single pre-trained model could be fine-tuned for classification, entailment, similarity, and question answering with minimal architecture changes.
Established the pre-train then fine-tune paradigm that became the foundation for all subsequent language models. Proved that generative pre-training (next-token prediction) learns useful representations, challenging the prevailing view that discriminative objectives were necessary.
GPT-2 (2019)
1.5B parameters. Demonstrated zero-shot task transfer without any fine-tuning.
Scaled to 1.5B parameters and trained on WebText (40GB of filtered Reddit links). Showed that language models can perform tasks like translation, summarisation, and question answering zero-shot by framing them as text completion. The largest model generated remarkably coherent long-form text.
The "too dangerous to release" framing sparked public debate about AI safety and responsible disclosure. More importantly, it proved that scale alone could unlock capabilities that smaller models could not achieve regardless of fine-tuning.
GPT-3 (2020)
175B parameters. Introduced in-context learning and few-shot prompting as a practical paradigm.
175B parameters trained on 300B tokens. Demonstrated that providing a few examples in the prompt (few-shot) enables the model to perform new tasks without any gradient updates. Performance scales smoothly with the number of examples (zero-shot, one-shot, few-shot). Introduced the API-based model-as-a-service paradigm.
Fundamentally changed how people interact with language models. Instead of fine-tuning for each task, users could describe what they wanted in natural language. This made LLMs accessible to non-researchers and launched the prompt engineering discipline.
GPT-4 (2023)
Multimodal model accepting both text and image inputs with substantially improved reasoning.
Multimodal input (text and images). Significantly improved performance on professional and academic benchmarks (bar exam, GRE, AP exams). Longer context windows (8K and 32K tokens). Better calibration and fewer hallucinations than GPT-3.5. System message support for customising model behaviour.
Demonstrated that LLMs could approach expert-level performance on specialised tasks. Multimodal capabilities expanded use cases beyond text. Became the backbone for ChatGPT, Copilot, and thousands of applications, establishing LLMs as general-purpose reasoning engines.
