On the Inductive Biases of Vision Transformers
Convolutional networks bake locality and translation equivariance into the architecture. Vision Transformers throw that away — and win at scale. Here’s why, and what we quietly lose along the way.
Attention isn’t free
The cost of self-attention is quadratic in sequence length $n$:
$$\text{Attn}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$
For an image split into $n$ patches, that $O(n^2)$ term is what makes high-resolution ViTs expensive.
def attention(q, k, v, d_k):
scores = (q @ k.transpose(-2, -1)) / d_k ** 0.5
return scores.softmax(dim=-1) @ v
More to come.