GPT

hub

Architecture Fundamentals

Decoder-Only Stack

A simplified transformer that uses only the decoder portion, processing tokens left-to-right.

How it works

Each layer contains causal self-attention followed by a feed-forward network, wrapped with layer normalisation and residual connections. There is no encoder and no cross-attention. The same architecture handles both understanding and generation by treating every task as text completion. Modern variants use pre-norm (normalise before each sub-layer) and SwiGLU activations.

Why it matters

The decoder-only design is remarkably versatile. By framing every task as "continue this text", a single architecture handles translation, summarisation, question answering, code generation, and reasoning without task-specific modifications.

Causal Self-Attention

Restricts each token to only attend to itself and preceding tokens, enforcing left-to-right processing.

How it works

A triangular mask is applied to the attention scores before softmax, setting all positions to the right of the current token to negative infinity. After softmax, these positions receive zero weight. At position t, the model sees tokens 0 through t and nothing beyond. This mask is applied identically during both training and inference.

Why it matters

Causal masking makes autoregressive generation possible. The model learns to predict the next token using only preceding context, which matches the generation setting exactly. Without this constraint, the model would "cheat" during training by looking at future tokens.

Token & Position Embeddings

Maps discrete tokens and their positions into continuous vector representations.

How it works

GPT-1 and GPT-2 use learned position embeddings (a separate embedding table indexed by position). Each token's input is the sum of its token embedding and position embedding. GPT-3 uses the same approach. Context length is fixed at training time (512 for GPT-1, 1024 for GPT-2, 2048 for GPT-3, 8K/32K/128K for GPT-4).

Why it matters

Learned position embeddings are simple and effective but limit the model to the context length seen during training. This motivated later innovations like ALiBi and RoPE in non-GPT models, though GPT-4 substantially extended context length through engineering.

Language Modelling Head

The final layer that converts hidden representations into next-token predictions.

How it works

The last transformer layer's output is projected through a linear layer with dimensions (hidden_size x vocabulary_size). Softmax converts the logits into a probability distribution over the entire vocabulary. During training, cross-entropy loss compares this distribution to the actual next token. Weight tying shares the embedding matrix with this output projection, reducing parameters.

Why it matters

The simplicity of next-token prediction as the sole training objective is what makes GPT's approach powerful. A single objective, applied to massive text corpora, produces models that learn grammar, world knowledge, reasoning patterns, and even code without any task-specific engineering.

trending_up

Pre-Training & Scaling

Next-Token Prediction

The self-supervised objective that drives all GPT pre-training.

How it works

Given a sequence of tokens, predict the next token at every position. The loss is the average negative log-likelihood across all positions and all sequences in the batch. Training data is massive web text (Common Crawl, books, code, Wikipedia). No labelled data is required. The model learns to compress and represent the statistical structure of human language.

Why it matters

This objective implicitly teaches the model grammar, facts, reasoning, style, and even basic arithmetic. Predicting the next token in "The capital of France is ___" requires world knowledge. Predicting the next token in code requires understanding programming logic. The objective is universal.

Scaling Laws

Predictable relationships between model size, data, compute, and performance.

How it works

Research (Kaplan et al., 2020; Hoffmann et al., 2022) showed that loss decreases as a power law with increases in parameters, dataset size, and compute budget. The Chinchilla scaling law demonstrated that most large models were over-parameterised and under-trained, recommending roughly 20 tokens per parameter. This shifted the field toward training smaller models on more data rather than simply making models larger.

Why it matters

Scaling laws allow researchers to predict model performance before training, optimise compute allocation, and make informed decisions about model size versus training duration. They transformed LLM development from trial-and-error into a more principled engineering discipline.

Emergent Capabilities

Abilities that appear at certain model scales without being explicitly trained.

How it works

As models scale past certain parameter and data thresholds, they develop capabilities not present in smaller models. In-context learning (performing tasks from examples in the prompt without gradient updates) emerged strongly at GPT-3 scale. Chain-of-thought reasoning (solving multi-step problems by generating intermediate steps) became reliable in larger models. Few-shot prompting effectiveness scales with model size.

Why it matters

Emergent capabilities mean that scaling is not just about doing the same thing better. Larger models gain qualitatively new abilities. This is both exciting (new capabilities without new training objectives) and concerning (harder to predict what a model will or will not be able to do).

assistant

From Base Model to Assistant

Supervised Fine-Tuning (SFT)

Trains the base model on curated instruction-response pairs to produce helpful outputs.

How it works

Collect a dataset of prompts paired with ideal responses, often written by human annotators. Fine-tune the pre-trained model on this data using the same next-token prediction objective, but only on the response tokens (the prompt tokens contribute to context but not to the loss). Typical datasets range from tens of thousands to hundreds of thousands of examples.

Why it matters

SFT bridges the gap between a base model (which completes text) and an assistant (which follows instructions). A base model given "What is photosynthesis?" might continue with another question. After SFT, it provides a clear, helpful explanation.

RLHF (Reinforcement Learning from Human Feedback)

Aligns the model's behaviour with human preferences using a reward signal.

How it works

First, train a reward model on human preference data (annotators compare two model outputs and pick the better one). Then use the reward model's scores as a signal to optimise the language model with Proximal Policy Optimisation (PPO). A KL penalty prevents the model from drifting too far from the SFT baseline. The reward model learns nuanced preferences that are hard to specify as rules.

Why it matters

RLHF teaches models to be helpful, harmless, and honest in ways that SFT alone cannot capture. It handles subjective preferences (tone, detail level, safety) where there is no single "correct" answer. InstructGPT showed that RLHF with a small model could outperform a 100x larger base model.

Direct Preference Optimisation (DPO)

A simpler alternative to RLHF that eliminates the need for a separate reward model.

How it works

Use the same human preference data (pairs of outputs with a preferred choice) but optimise the language model directly. DPO reformulates the RLHF objective into a classification loss on preference pairs, proving mathematically that the optimal policy can be extracted without training a reward model. The model learns to increase the probability of preferred responses relative to dispreferred ones.

Why it matters

DPO significantly simplifies the alignment pipeline. No reward model training, no PPO instability, no KL tuning. It produces comparable results to RLHF with less compute, fewer hyperparameters, and more stable training. It has become the preferred approach for many open-source model developers.

timeline

GPT Lineage

GPT-1 (2018)

117M parameters. Demonstrated that unsupervised pre-training followed by supervised fine-tuning produces strong results across NLP tasks.

Key innovations

First large-scale application of transformer decoders to language modelling. Pre-trained on BooksCorpus (7,000 books). Showed that a single pre-trained model could be fine-tuned for classification, entailment, similarity, and question answering with minimal architecture changes.

Impact

Established the pre-train then fine-tune paradigm that became the foundation for all subsequent language models. Proved that generative pre-training (next-token prediction) learns useful representations, challenging the prevailing view that discriminative objectives were necessary.

GPT-2 (2019)

1.5B parameters. Demonstrated zero-shot task transfer without any fine-tuning.

Key innovations

Scaled to 1.5B parameters and trained on WebText (40GB of filtered Reddit links). Showed that language models can perform tasks like translation, summarisation, and question answering zero-shot by framing them as text completion. The largest model generated remarkably coherent long-form text.

Impact

The "too dangerous to release" framing sparked public debate about AI safety and responsible disclosure. More importantly, it proved that scale alone could unlock capabilities that smaller models could not achieve regardless of fine-tuning.

GPT-3 (2020)

175B parameters. Introduced in-context learning and few-shot prompting as a practical paradigm.

Key innovations

175B parameters trained on 300B tokens. Demonstrated that providing a few examples in the prompt (few-shot) enables the model to perform new tasks without any gradient updates. Performance scales smoothly with the number of examples (zero-shot, one-shot, few-shot). Introduced the API-based model-as-a-service paradigm.

Impact

Fundamentally changed how people interact with language models. Instead of fine-tuning for each task, users could describe what they wanted in natural language. This made LLMs accessible to non-researchers and launched the prompt engineering discipline.

GPT-4 (2023)

Multimodal model accepting both text and image inputs with substantially improved reasoning.

Key innovations

Multimodal input (text and images). Significantly improved performance on professional and academic benchmarks (bar exam, GRE, AP exams). Longer context windows (8K and 32K tokens). Better calibration and fewer hallucinations than GPT-3.5. System message support for customising model behaviour.

Impact

Demonstrated that LLMs could approach expert-level performance on specialised tasks. Multimodal capabilities expanded use cases beyond text. Became the backbone for ChatGPT, Copilot, and thousands of applications, establishing LLMs as general-purpose reasoning engines.