All articles

Understanding Self-Attention in Transformers

March 28, 2026 · Intelligent Computing Lab


The Transformer architecture, introduced in Attention Is All You Need (Vaswani et al., 2017), discarded recurrence entirely in favour of a mechanism that directly models pairwise relationships between all positions in a sequence. At its core sits self-attention — a deceptively simple operation with profound consequences for representation learning.

The Attention Function

Self-attention maps a query QQ, a set of keys KK, and a set of values VV to an output. All three are linear projections of the same input sequence. The output is a weighted sum of the values, where the weight of each value is determined by the compatibility of the corresponding key with the query:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Here dkd_k is the dimensionality of the key vectors. The division by dk\sqrt{d_k} is a crucial stabilisation step: without it, the dot products grow large in magnitude as dkd_k increases, pushing the softmax into regions of near-zero gradient.

Why Does Scaling Help?

Assume the components of QQ and KK are independent random variables with mean 00 and variance 11. A single dot product qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i then has mean 00 and variance dkd_k, since

Var ⁣(i=1dkqiki)=i=1dkVar(qiki)=dk.\operatorname{Var}\!\left(\sum_{i=1}^{d_k} q_i k_i\right) = \sum_{i=1}^{d_k} \operatorname{Var}(q_i k_i) = d_k.

Dividing by dk\sqrt{d_k} restores unit variance, keeping gradients well-conditioned throughout training.

Multi-Head Attention

Rather than computing a single attention function, the Transformer projects QQ, KK, and VV into hh different lower-dimensional subspaces and runs attention in parallel:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,W^O headi=Attention(QWiQ,  KWiK,  VWiV)\text{head}_i = \text{Attention}(Q W_i^Q,\; K W_i^K,\; V W_i^V)

where WiQRdmodel×dkW_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiKRdmodel×dkW_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, and WORhdv×dmodelW^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}. In the original paper, h=8h = 8 and dk=dv=dmodel/h=64d_k = d_v = d_{\text{model}}/h = 64.

Each head can specialise in a different type of relationship — syntactic dependencies, coreference, long-range semantics — while the concatenation recovers the full model dimension.

Positional Encoding

Self-attention is permutation-equivariant: shuffling the input sequence produces the same output (also shuffled). To give the model a sense of order, fixed sinusoidal positional encodings are added to the input embeddings before any processing:

PE(pos,2i)=sin ⁣(pos100002i/dmodel)PE_{(pos,\, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)PE_{(pos,\, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

The sinusoidal basis was chosen so that the model can generalise to sequence lengths not seen during training, and so that PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos} for any fixed offset kk.

Complexity

The time and memory complexity of self-attention is O(n2d)O(n^2 d), where nn is the sequence length and dd is the model dimension. This quadratic dependence on nn is the principal bottleneck when scaling to long contexts. A large body of subsequent work — sparse attention, linear attention, and state-space models — has sought to reduce this cost while preserving the expressive power of the full attention matrix.

From Attention to Understanding

The attention weight matrix A=softmax(QK/dk)A = \text{softmax}(QK^\top / \sqrt{d_k}) has a pleasing interpretation: entry AijA_{ij} quantifies how much token ii attends to token jj. Visualising AA reveals that different heads capture qualitatively distinct patterns — heads that track subject–verb agreement, others that follow coreferent mentions across a paragraph.

This combination of mathematical simplicity, parallelisability, and interpretability is why self-attention has become the central primitive of modern deep learning. Understanding it from first principles is the starting point for almost every frontier research direction in the field.


Have questions or feedback? Reach us at the contact page.