Understanding Self-Attention in Transformers

March 28, 2026 · Intelligent Computing Lab

The Transformer architecture, introduced in Attention Is All You Need (Vaswani et al., 2017), discarded recurrence entirely in favour of a mechanism that directly models pairwise relationships between all positions in a sequence. At its core sits self-attention — a deceptively simple operation with profound consequences for representation learning.

The Attention Function

Self-attention maps a query $Q$ , a set of keys $K$ , and a set of values $V$ to an output. All three are linear projections of the same input sequence. The output is a weighted sum of the values, where the weight of each value is determined by the compatibility of the corresponding key with the query:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Here $d_k$ is the dimensionality of the key vectors. The division by $\sqrt{d_k}$ is a crucial stabilisation step: without it, the dot products grow large in magnitude as $d_k$ increases, pushing the softmax into regions of near-zero gradient.

Why Does Scaling Help?

Assume the components of $Q$ and $K$ are independent random variables with mean $0$ and variance $1$ . A single dot product $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ then has mean $0$ and variance $d_k$ , since

\operatorname{Var}\!\left(\sum_{i=1}^{d_k} q_i k_i\right) = \sum_{i=1}^{d_k} \operatorname{Var}(q_i k_i) = d_k.

Dividing by $\sqrt{d_k}$ restores unit variance, keeping gradients well-conditioned throughout training.

Multi-Head Attention

Rather than computing a single attention function, the Transformer projects $Q$ , $K$ , and $V$ into $h$ different lower-dimensional subspaces and runs attention in parallel:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,W^O

\text{head}_i = \text{Attention}(Q W_i^Q,\; K W_i^K,\; V W_i^V)

where $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ , and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ . In the original paper, $h = 8$ and $d_k = d_v = d_{\text{model}}/h = 64$ .

Each head can specialise in a different type of relationship — syntactic dependencies, coreference, long-range semantics — while the concatenation recovers the full model dimension.

Positional Encoding

Self-attention is permutation-equivariant: shuffling the input sequence produces the same output (also shuffled). To give the model a sense of order, fixed sinusoidal positional encodings are added to the input embeddings before any processing:

PE_{(pos,\, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos,\, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

The sinusoidal basis was chosen so that the model can generalise to sequence lengths not seen during training, and so that $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ for any fixed offset $k$ .

Complexity

The time and memory complexity of self-attention is $O(n^2 d)$ , where $n$ is the sequence length and $d$ is the model dimension. This quadratic dependence on $n$ is the principal bottleneck when scaling to long contexts. A large body of subsequent work — sparse attention, linear attention, and state-space models — has sought to reduce this cost while preserving the expressive power of the full attention matrix.

From Attention to Understanding

The attention weight matrix $A = \text{softmax}(QK^\top / \sqrt{d_k})$ has a pleasing interpretation: entry $A_{ij}$ quantifies how much token $i$ attends to token $j$ . Visualising $A$ reveals that different heads capture qualitatively distinct patterns — heads that track subject–verb agreement, others that follow coreferent mentions across a paragraph.

This combination of mathematical simplicity, parallelisability, and interpretability is why self-attention has become the central primitive of modern deep learning. Understanding it from first principles is the starting point for almost every frontier research direction in the field.

Have questions or feedback? Reach us at the contact page.