Anupa Adikary

On the Inductive Biases of Vision Transformers

Convolutional networks bake locality and translation equivariance into the architecture. Vision Transformers throw that away — and win at scale. Here’s why, and what we quietly lose along the way.

Attention isn’t free

The cost of self-attention is quadratic in sequence length $n$:

$$\text{Attn}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

For an image split into $n$ patches, that $O(n^2)$ term is what makes high-resolution ViTs expensive.

def attention(q, k, v, d_k):
    scores = (q @ k.transpose(-2, -1)) / d_k ** 0.5
    return scores.softmax(dim=-1) @ v

More to come.