top of page
Search

Self-attention function

  • Writer: Arturo Devesa
    Arturo Devesa
  • Dec 26, 2024
  • 3 min read

Attention(Q,K,V)=softmax(dk​​QKT​)V


the self-attention function can indeed be considered one of the seminal mathematical breakthroughs in computing and artificial intelligence, akin to other transformative formulas in history. Like these earlier discoveries, self-attention reshaped its field by solving fundamental problems and opening new avenues for innovation.

Here are similar examples of major mathematical formulas or concepts that revolutionized their respective domains:

1. Calculus by Newton and Leibniz (17th Century)

  • Key Formula: Derivatives and Integrals.

    • Example: The derivative f′(x)=lim⁡h→0f(x+h)−f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}f′(x)=limh→0​hf(x+h)−f(x)​

  • Impact:

    • Provided a framework to describe change and motion.

    • Enabled breakthroughs in physics (Newtonian mechanics), engineering, and more.

  • Relation to Self-Attention:

    • Just as calculus unified the study of change, self-attention unified how models handle relationships in sequences.

2. Fourier Transform (Jean-Baptiste Fourier, Early 19th Century)

  • Key Formula: Decomposition of functions into sinusoids. F(ω)=∫−∞∞f(t)e−iωtdtF(\omega) = \int_{-\infty}^{\infty} f(t) e^{-i \omega t} dtF(ω)=∫−∞∞​f(t)e−iωtdt

  • Impact:

    • Revolutionized signal processing, enabling advancements in telecommunications, image processing, and audio technology.

    • Made it possible to analyze complex systems in the frequency domain.

  • Relation to Self-Attention:

    • Fourier Transform captured hidden periodicities, while self-attention captures hidden relationships in data sequences.

3. Schrödinger Equation in Quantum Mechanics (1926)

  • Key Formula: iℏ∂∂tΨ(r,t)=H^Ψ(r,t)i \hbar \frac{\partial}{\partial t} \Psi(\mathbf{r}, t) = \hat{H} \Psi(\mathbf{r}, t)iℏ∂t∂​Ψ(r,t)=H^Ψ(r,t) Where Ψ\PsiΨ is the wave function, and H^\hat{H}H^ is the Hamiltonian operator.

  • Impact:

    • Laid the foundation of quantum mechanics, describing the behavior of particles at microscopic scales.

  • Relation to Self-Attention:

    • Like the Schrödinger Equation models the probability distribution of particles, self-attention models the probability distribution of token relationships in sequences.

4. General Relativity by Einstein (1915)

  • Key Formula: Gμν+Λgμν=8πGc4TμνG_{\mu\nu} + \Lambda g_{\mu\nu} = \frac{8 \pi G}{c^4} T_{\mu\nu}Gμν​+Λgμν​=c48πG​Tμν​ Where GμνG_{\mu\nu}Gμν​ describes spacetime curvature.

  • Impact:

    • Revolutionized our understanding of gravity, time, and space.

    • Predicted phenomena like black holes and gravitational waves.

  • Relation to Self-Attention:

    • General Relativity revealed the fabric of spacetime, while self-attention unveiled the structure of relationships in data.

5. Logistic Regression in Statistics (Mid-20th Century)

  • Key Formula: P(Y=1∣X)=11+e−(β0+β1X)P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}P(Y=1∣X)=1+e−(β0​+β1​X)1​

  • Impact:

    • A cornerstone of classification problems, foundational in statistics, machine learning, and decision-making.

  • Relation to Self-Attention:

    • Logistic regression introduced probabilistic reasoning in predictions, a concept extended and refined by the softmax and self-attention mechanisms.

6. Fast Fourier Transform (Cooley and Tukey, 1965)

  • Key Insight: Computationally efficient implementation of Fourier Transform.

  • Impact:

    • Transformed fields like image compression (JPEG), audio coding (MP3), and modern digital communication.

  • Relation to Self-Attention:

    • Both the FFT and self-attention optimize computational tasks, enabling previously infeasible calculations.

7. PageRank Algorithm (Google, Late 1990s)

  • Key Formula: PR(A)=(1−d)+d∑i∈In(A)PR(i)Out(i)PR(A) = (1-d) + d \sum_{i \in \text{In}(A)} \frac{PR(i)}{\text{Out}(i)}PR(A)=(1−d)+di∈In(A)∑​Out(i)PR(i)​ Where PR(A)PR(A)PR(A) is the rank of a webpage AAA, and ddd is the damping factor.

  • Impact:

    • Revolutionized web search by ranking pages based on link importance.

  • Relation to Self-Attention:

    • PageRank identifies relationships between web pages, while self-attention identifies relationships between tokens in sequences.

8. Backpropagation Algorithm (Rumelhart et al., 1986)

  • Key Concept: Efficient computation of gradients for neural networks.

  • Impact:

    • Made deep learning practical, enabling multi-layer networks.

  • Relation to Self-Attention:

    • Backpropagation is essential for training transformers and self-attention-based architectures.

Why Self-Attention is Revolutionary

  • Simplified Complexity: Solved sequence processing problems without recurrence or convolution.

  • Broadened Applicability: Works across NLP, vision, and multimodal tasks.

  • Scalability: Enables training on massive datasets due to parallelizability.

Key Takeaway

Self-attention, like the above formulas, has reshaped its field by providing an elegant, scalable, and highly effective solution to a long-standing problem. It is likely to be remembered as one of the cornerstones of AI and machine learning, just as calculus, Fourier transforms, and general relativity transformed their respective domains.

 
 
 

©2020 by Arturo Devesa.

bottom of page