BERT

hub

Architecture Fundamentals

Input Representation

Converts raw text into a structured input the model can process, combining token, segment, and position information.

How it works

Text is tokenised with WordPiece, splitting unknown words into subword units. Three embeddings are summed for each token: a token embedding (vocabulary lookup), a segment embedding (marking sentence A vs sentence B), and a position embedding (encoding sequence order). Special tokens [CLS] (classification) and [SEP] (separator) mark sentence boundaries.

Why it matters

The segment embedding allows BERT to reason about pairs of sentences in a single forward pass, which is critical for tasks like natural language inference and question answering where two text inputs must be compared.

Bidirectional Self-Attention

Every token attends to every other token in the input, with no directional restriction.

How it works

Unlike GPT's causal masking, BERT uses full (unmasked) self-attention. Each token can attend to all positions, both left and right. This produces deeply contextualised representations where a word's meaning is shaped by its entire surrounding context simultaneously.

Why it matters

Bidirectionality lets BERT disambiguate words that depend on right-side context. For example, "bank" in "river bank" versus "bank account" requires seeing the full sentence. Unidirectional models must commit to a representation before seeing the disambiguating context.

Encoder Stack

The core of BERT is a stack of identical transformer encoder layers.

How it works

BERT-Base uses 12 layers, 768 hidden dimensions, and 12 attention heads (110M parameters). BERT-Large uses 24 layers, 1024 hidden dimensions, and 16 attention heads (340M parameters). Each layer applies multi-head self-attention followed by a feed-forward network, with layer normalisation and residual connections around each sub-layer.

Why it matters

Deeper stacks build increasingly abstract representations. Lower layers capture surface-level syntax, middle layers capture syntactic structure, and upper layers capture task-relevant semantics. This hierarchical feature extraction is what makes fine-tuning effective across diverse tasks.

Pooling & Output Representations

BERT produces two types of output used for different downstream tasks.

How it works

The [CLS] token's final hidden state serves as a pooled representation of the entire input sequence, used for classification tasks. Individual token hidden states provide per-token representations for tasks like named entity recognition or extractive QA. Some implementations add mean pooling or attention-weighted pooling over all tokens as an alternative to [CLS].

Why it matters

This dual output design makes BERT flexible. The same pre-trained model can be adapted for both sentence-level tasks (sentiment, entailment) and token-level tasks (NER, POS tagging) by choosing which output to use.

school

Pre-Training Objectives

Masked Language Modelling (MLM)

The primary pre-training objective that teaches BERT to understand language bidirectionally.

How it works

Randomly select 15% of input tokens. Of those, 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are left unchanged. The model predicts the original token at each masked position using the surrounding bidirectional context. The 80/10/10 split prevents the model from learning that [MASK] is always the signal to predict.

Why it matters

MLM forces the model to build representations that encode deep contextual understanding. Unlike next-token prediction, the model cannot simply memorise left-to-right patterns. It must learn to use context from both directions, producing richer representations.

Next Sentence Prediction (NSP)

A secondary objective that teaches BERT to understand relationships between sentences.

How it works

Given two sentences A and B, predict whether B is the actual next sentence after A in the corpus (IsNext) or a random sentence (NotNext). The [CLS] token's representation is used for this binary classification. Training uses 50% positive and 50% negative pairs.

Why it matters

NSP was designed to improve performance on sentence-pair tasks like question answering and natural language inference. Later work (RoBERTa) showed NSP may not be necessary, and some variants replace it with sentence order prediction or remove it entirely.

tune

Fine-Tuning Patterns

Sequence Classification

Maps an entire input sequence to a single class label.

How it works

Take the [CLS] token's final hidden state and pass it through a task-specific linear layer that projects to the number of classes. Apply softmax for single-label classification or sigmoid for multi-label. Fine-tune all BERT parameters plus the classification head end-to-end with cross-entropy loss. Typical learning rates are 2e-5 to 5e-5 with 2 to 4 epochs.

Example tasks

Sentiment analysis (SST-2), topic classification (AG News), natural language inference (MNLI), paraphrase detection (QQP), toxicity detection.

Token Classification

Assigns a label to each individual token in the sequence.

How it works

Take each token's final hidden state and pass it through a shared linear layer that projects to the label set. For WordPiece subwords, typically only the first subword token of each word is classified and the rest are ignored or labelled with a continuation tag. CRF layers can be added on top for structured prediction.

Example tasks

Named entity recognition (CoNLL-2003), part-of-speech tagging (Penn Treebank), chunking, slot filling in dialogue systems.

Extractive Question Answering

Identifies an answer span within a given context passage.

How it works

Encode the question and passage as a sentence pair: [CLS] question [SEP] passage [SEP]. Two linear layers predict the start and end positions of the answer span independently. The start layer scores each passage token as a potential answer start; the end layer does the same for answer end. The span with the highest combined score is selected.

Example tasks

SQuAD 1.1 and 2.0, Natural Questions, TriviaQA, reading comprehension benchmarks.

Sentence Pair Tasks

Processes two sentences jointly to predict their relationship.

How it works

Encode both sentences as [CLS] sentence_A [SEP] sentence_B [SEP]. Segment embeddings distinguish the two sentences. The [CLS] representation captures the cross-sentence relationship and is used for classification. The model can also compute per-token cross-attention between the two sentences through the self-attention layers.

Example tasks

Natural language inference (MNLI, SNLI), semantic textual similarity (STS-B), paraphrase identification (MRPC), answer sentence selection.

account_tree

BERT Variants & Evolution

RoBERTa

A robustly optimised BERT that improves pre-training by removing questionable design choices.

Key changes

Removes Next Sentence Prediction entirely. Uses dynamic masking (new mask pattern each epoch) instead of static masking. Trains with much larger batches (8K sequences) and more data (160GB vs BERT's 16GB). Trains for longer with more steps. Uses byte-pair encoding instead of WordPiece.

Trade-offs

Consistently outperforms BERT across all benchmarks, establishing that BERT was significantly under-trained. Requires substantially more compute for pre-training but the resulting model is strictly better.

DistilBERT

A smaller, faster version of BERT produced through knowledge distillation.

Key changes

6 layers instead of 12 (removes every other layer). Trained to match BERT's output distributions using a combination of distillation loss (soft targets from teacher), masked language modelling loss, and cosine embedding loss. Initialised from BERT's weights (taking alternating layers).

Trade-offs

40% smaller, 60% faster inference, retains 97% of BERT's performance. An excellent choice when deployment constraints matter more than squeezing out the last percentage point of accuracy.

ALBERT

Reduces parameters dramatically through two factorisation techniques.

Key changes

Factorised embedding parameterisation splits the large vocabulary embedding matrix into two smaller matrices. Cross-layer parameter sharing uses the same parameters for all transformer layers. Replaces NSP with Sentence Order Prediction (SOP), which is harder and more useful.

Trade-offs

ALBERT-xxlarge achieves state-of-the-art results with fewer parameters than BERT-Large, but inference is not faster because the same parameters are applied sequentially across layers. Parameter reduction is about memory, not speed.

ELECTRA

Replaces masked token prediction with a more sample-efficient discriminative objective.

Key changes

Uses a generator-discriminator setup. A small generator (like a small masked LM) produces plausible replacement tokens. The main model (discriminator) predicts for every token whether it is original or replaced. This "replaced token detection" provides a training signal for all tokens, not just the 15% that are masked.

Trade-offs

Significantly more sample-efficient than BERT. ELECTRA-Small trained on one GPU for 4 days matches GPT trained on a much larger compute budget. The discriminative objective means every token contributes to learning, not just masked positions.