Understanding Self-Attention in Transformers
March 28, 2026 · Intelligent Computing Lab
The Transformer architecture, introduced in Attention Is All You Need (Vaswani et al., 2017), discarded recurrence entirely in favour of a mechanism that directly models pairwise relationships between all positions in a sequence. At its core sits self-attention — a deceptively simple operation with profound consequences for representation learning.
The Attention Function
Self-attention maps a query , a set of keys , and a set of values to an output. All three are linear projections of the same input sequence. The output is a weighted sum of the values, where the weight of each value is determined by the compatibility of the corresponding key with the query:
Here is the dimensionality of the key vectors. The division by is a crucial stabilisation step: without it, the dot products grow large in magnitude as increases, pushing the softmax into regions of near-zero gradient.
Why Does Scaling Help?
Assume the components of and are independent random variables with mean and variance . A single dot product then has mean and variance , since
Dividing by restores unit variance, keeping gradients well-conditioned throughout training.
Multi-Head Attention
Rather than computing a single attention function, the Transformer projects , , and into different lower-dimensional subspaces and runs attention in parallel:
where , , , and . In the original paper, and .
Each head can specialise in a different type of relationship — syntactic dependencies, coreference, long-range semantics — while the concatenation recovers the full model dimension.
Positional Encoding
Self-attention is permutation-equivariant: shuffling the input sequence produces the same output (also shuffled). To give the model a sense of order, fixed sinusoidal positional encodings are added to the input embeddings before any processing:
The sinusoidal basis was chosen so that the model can generalise to sequence lengths not seen during training, and so that can be expressed as a linear function of for any fixed offset .
Complexity
The time and memory complexity of self-attention is , where is the sequence length and is the model dimension. This quadratic dependence on is the principal bottleneck when scaling to long contexts. A large body of subsequent work — sparse attention, linear attention, and state-space models — has sought to reduce this cost while preserving the expressive power of the full attention matrix.
From Attention to Understanding
The attention weight matrix has a pleasing interpretation: entry quantifies how much token attends to token . Visualising reveals that different heads capture qualitatively distinct patterns — heads that track subject–verb agreement, others that follow coreferent mentions across a paragraph.
This combination of mathematical simplicity, parallelisability, and interpretability is why self-attention has become the central primitive of modern deep learning. Understanding it from first principles is the starting point for almost every frontier research direction in the field.
Have questions or feedback? Reach us at the contact page.