Self-attention function

Arturo Devesa
Dec 26, 2024
3 min read

Attention(Q,K,V)=softmax(dkQKT)V

the self-attention function can indeed be considered one of the seminal mathematical breakthroughs in computing and artificial intelligence, akin to other transformative formulas in history. Like these earlier discoveries, self-attention reshaped its field by solving fundamental problems and opening new avenues for innovation.

Here are similar examples of major mathematical formulas or concepts that revolutionized their respective domains:

1. Calculus by Newton and Leibniz (17th Century)

Key Formula: Derivatives and Integrals.
- Example: The derivative f′(x)=lim⁡h→0f(x+h)−f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}f′(x)=limh→0hf(x+h)−f(x)
Impact:
- Provided a framework to describe change and motion.
- Enabled breakthroughs in physics (Newtonian mechanics), engineering, and more.
Relation to Self-Attention:
- Just as calculus unified the study of change, self-attention unified how models handle relationships in sequences.

2. Fourier Transform (Jean-Baptiste Fourier, Early 19th Century)

Key Formula: Decomposition of functions into sinusoids. F(ω)=∫−∞∞f(t)e−iωtdtF(\omega) = \int_{-\infty}^{\infty} f(t) e^{-i \omega t} dtF(ω)=∫−∞∞f(t)e−iωtdt
Impact:
- Revolutionized signal processing, enabling advancements in telecommunications, image processing, and audio technology.
- Made it possible to analyze complex systems in the frequency domain.
Relation to Self-Attention:
- Fourier Transform captured hidden periodicities, while self-attention captures hidden relationships in data sequences.

3. Schrödinger Equation in Quantum Mechanics (1926)

Key Formula: iℏ∂∂tΨ(r,t)=H^Ψ(r,t)i \hbar \frac{\partial}{\partial t} \Psi(\mathbf{r}, t) = \hat{H} \Psi(\mathbf{r}, t)iℏ∂t∂Ψ(r,t)=H^Ψ(r,t) Where Ψ\PsiΨ is the wave function, and H^\hat{H}H^ is the Hamiltonian operator.
Impact:
- Laid the foundation of quantum mechanics, describing the behavior of particles at microscopic scales.
Relation to Self-Attention:
- Like the Schrödinger Equation models the probability distribution of particles, self-attention models the probability distribution of token relationships in sequences.

4. General Relativity by Einstein (1915)

Key Formula: Gμν+Λgμν=8πGc4TμνG_{\mu\nu} + \Lambda g_{\mu\nu} = \frac{8 \pi G}{c^4} T_{\mu\nu}Gμν+Λgμν=c48πGTμν Where GμνG_{\mu\nu}Gμν describes spacetime curvature.
Impact:
- Revolutionized our understanding of gravity, time, and space.
- Predicted phenomena like black holes and gravitational waves.
Relation to Self-Attention:
- General Relativity revealed the fabric of spacetime, while self-attention unveiled the structure of relationships in data.

5. Logistic Regression in Statistics (Mid-20th Century)

Key Formula: P(Y=1∣X)=11+e−(β0+β1X)P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}P(Y=1∣X)=1+e−(β0+β1X)1
Impact:
- A cornerstone of classification problems, foundational in statistics, machine learning, and decision-making.
Relation to Self-Attention:
- Logistic regression introduced probabilistic reasoning in predictions, a concept extended and refined by the softmax and self-attention mechanisms.

6. Fast Fourier Transform (Cooley and Tukey, 1965)

Key Insight: Computationally efficient implementation of Fourier Transform.
Impact:
- Transformed fields like image compression (JPEG), audio coding (MP3), and modern digital communication.
Relation to Self-Attention:
- Both the FFT and self-attention optimize computational tasks, enabling previously infeasible calculations.

7. PageRank Algorithm (Google, Late 1990s)

Key Formula: PR(A)=(1−d)+d∑i∈In(A)PR(i)Out(i)PR(A) = (1-d) + d \sum_{i \in \text{In}(A)} \frac{PR(i)}{\text{Out}(i)}PR(A)=(1−d)+di∈In(A)∑Out(i)PR(i) Where PR(A)PR(A)PR(A) is the rank of a webpage AAA, and ddd is the damping factor.
Impact:
- Revolutionized web search by ranking pages based on link importance.
Relation to Self-Attention:
- PageRank identifies relationships between web pages, while self-attention identifies relationships between tokens in sequences.

8. Backpropagation Algorithm (Rumelhart et al., 1986)

Key Concept: Efficient computation of gradients for neural networks.
Impact:
- Made deep learning practical, enabling multi-layer networks.
Relation to Self-Attention:
- Backpropagation is essential for training transformers and self-attention-based architectures.

Why Self-Attention is Revolutionary

Simplified Complexity: Solved sequence processing problems without recurrence or convolution.
Broadened Applicability: Works across NLP, vision, and multimodal tasks.
Scalability: Enables training on massive datasets due to parallelizability.

Key Takeaway

Self-attention, like the above formulas, has reshaped its field by providing an elegant, scalable, and highly effective solution to a long-standing problem. It is likely to be remembered as one of the cornerstones of AI and machine learning, just as calculus, Fourier transforms, and general relativity transformed their respective domains.