Self-attention function
- Arturo Devesa
- Dec 26, 2024
- 3 min read
Attention(Q,K,V)=softmax(dkQKT)V
the self-attention function can indeed be considered one of the seminal mathematical breakthroughs in computing and artificial intelligence, akin to other transformative formulas in history. Like these earlier discoveries, self-attention reshaped its field by solving fundamental problems and opening new avenues for innovation.
Here are similar examples of major mathematical formulas or concepts that revolutionized their respective domains:
1. Calculus by Newton and Leibniz (17th Century)
Key Formula: Derivatives and Integrals.
Example: The derivative f′(x)=limh→0f(x+h)−f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}f′(x)=limh→0hf(x+h)−f(x)
Impact:
Provided a framework to describe change and motion.
Enabled breakthroughs in physics (Newtonian mechanics), engineering, and more.
Relation to Self-Attention:
Just as calculus unified the study of change, self-attention unified how models handle relationships in sequences.
2. Fourier Transform (Jean-Baptiste Fourier, Early 19th Century)
Key Formula: Decomposition of functions into sinusoids. F(ω)=∫−∞∞f(t)e−iωtdtF(\omega) = \int_{-\infty}^{\infty} f(t) e^{-i \omega t} dtF(ω)=∫−∞∞f(t)e−iωtdt
Impact:
Revolutionized signal processing, enabling advancements in telecommunications, image processing, and audio technology.
Made it possible to analyze complex systems in the frequency domain.
Relation to Self-Attention:
Fourier Transform captured hidden periodicities, while self-attention captures hidden relationships in data sequences.
3. Schrödinger Equation in Quantum Mechanics (1926)
Key Formula: iℏ∂∂tΨ(r,t)=H^Ψ(r,t)i \hbar \frac{\partial}{\partial t} \Psi(\mathbf{r}, t) = \hat{H} \Psi(\mathbf{r}, t)iℏ∂t∂Ψ(r,t)=H^Ψ(r,t) Where Ψ\PsiΨ is the wave function, and H^\hat{H}H^ is the Hamiltonian operator.
Impact:
Laid the foundation of quantum mechanics, describing the behavior of particles at microscopic scales.
Relation to Self-Attention:
Like the Schrödinger Equation models the probability distribution of particles, self-attention models the probability distribution of token relationships in sequences.
4. General Relativity by Einstein (1915)
Key Formula: Gμν+Λgμν=8πGc4TμνG_{\mu\nu} + \Lambda g_{\mu\nu} = \frac{8 \pi G}{c^4} T_{\mu\nu}Gμν+Λgμν=c48πGTμν Where GμνG_{\mu\nu}Gμν describes spacetime curvature.
Impact:
Revolutionized our understanding of gravity, time, and space.
Predicted phenomena like black holes and gravitational waves.
Relation to Self-Attention:
General Relativity revealed the fabric of spacetime, while self-attention unveiled the structure of relationships in data.
5. Logistic Regression in Statistics (Mid-20th Century)
Key Formula: P(Y=1∣X)=11+e−(β0+β1X)P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}P(Y=1∣X)=1+e−(β0+β1X)1
Impact:
A cornerstone of classification problems, foundational in statistics, machine learning, and decision-making.
Relation to Self-Attention:
Logistic regression introduced probabilistic reasoning in predictions, a concept extended and refined by the softmax and self-attention mechanisms.
6. Fast Fourier Transform (Cooley and Tukey, 1965)
Key Insight: Computationally efficient implementation of Fourier Transform.
Impact:
Transformed fields like image compression (JPEG), audio coding (MP3), and modern digital communication.
Relation to Self-Attention:
Both the FFT and self-attention optimize computational tasks, enabling previously infeasible calculations.
7. PageRank Algorithm (Google, Late 1990s)
Key Formula: PR(A)=(1−d)+d∑i∈In(A)PR(i)Out(i)PR(A) = (1-d) + d \sum_{i \in \text{In}(A)} \frac{PR(i)}{\text{Out}(i)}PR(A)=(1−d)+di∈In(A)∑Out(i)PR(i) Where PR(A)PR(A)PR(A) is the rank of a webpage AAA, and ddd is the damping factor.
Impact:
Revolutionized web search by ranking pages based on link importance.
Relation to Self-Attention:
PageRank identifies relationships between web pages, while self-attention identifies relationships between tokens in sequences.
8. Backpropagation Algorithm (Rumelhart et al., 1986)
Key Concept: Efficient computation of gradients for neural networks.
Impact:
Made deep learning practical, enabling multi-layer networks.
Relation to Self-Attention:
Backpropagation is essential for training transformers and self-attention-based architectures.
Why Self-Attention is Revolutionary
Simplified Complexity: Solved sequence processing problems without recurrence or convolution.
Broadened Applicability: Works across NLP, vision, and multimodal tasks.
Scalability: Enables training on massive datasets due to parallelizability.
Key Takeaway
Self-attention, like the above formulas, has reshaped its field by providing an elegant, scalable, and highly effective solution to a long-standing problem. It is likely to be remembered as one of the cornerstones of AI and machine learning, just as calculus, Fourier transforms, and general relativity transformed their respective domains.