In a previous blog post, I showed the relationship between the calculus of variations and non-linear regression. In this post, I will delve into the link between the resulting regression model, Nadaraya-Watson regression, and the attention mechanism, central to the training step of the state-of-the-art large language models like GPTs, LLaMA, or Mistral.
Typically, solutions derived from the calculus of variations reveal deeper and more intriguing aspects of the problem at hand, rather than merely providing a solution. Consider, for instance, the Lagrangian of the arclength of a function:
\[\begin{align} J[y] = \int \sqrt{1 + y'} \, dx \end{align}\]whose solution yields the equation of a straight line, revealing that it represents the shortest path between two points in the absence of additional constraints; or, delving into a more complex realm, the Einstein-Hilbert Lagrangian density:
\[\begin{align} J[s] = \int \left(\frac{1}{16\pi G}R + \mathcal{L}_{\text{matter}} \right) \sqrt{-g} \, d^4x \end{align}\]whose solution yields the Einstein field equations, revealing the relationship of the geometry of spacetime with matter distribution (1), a key insight for the general theory of relativity.
In the case obtained equation from the previous blog, something deeper is going on, as relatively recently showed in an article by Utkin & Konstantinov 2022(2). Consider the prediction for the instance $\mu_{1}$ of the testing set:
\[\begin{align} f(\mu_{1}) & = \displaystyle \frac{ \sum_{i = 1}^{n} y_i \,\delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1)}{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) } \\ & = \begin{bmatrix} \displaystyle \frac{ \delta_{\sigma}(\mathbf{x}_1 - \mathbf{\mu}_1)}{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) } & \cdots & \displaystyle \frac{ \delta_{\sigma}(\mathbf{x}_n - \mathbf{\mu}_1)}{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) } \end{bmatrix} \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} \end{align}\]If we take the element $c$ from the left-hand side vector and apply some minor cancellations and transformations, we obtain:
\[\begin{align} \frac{ \delta_{\sigma}( \mathbf{x}_c - \mathbf{\mu}_1) }{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) } & = \frac{ \displaystyle \frac{1}{(2\pi)^{n/2}} \frac{1}{|\sigma^2 \mathbf{I}|^{1/2}} \exp\{ -\frac{1}{2\sigma^2} (\mathbf{x}_c - \mathbf{\mu}_1)^\top (\mathbf{x}_c - \mathbf{\mu}_1) \} }{ \displaystyle \sum_{i = 1}^n \frac{1}{(2\pi)^{n/2}} \frac{1}{|\sigma^2 \mathbf{I}|^{1/2}} \exp\{ -\frac{1}{2\sigma^2} (\mathbf{x}_i - \mathbf{\mu}_1)^\top (\mathbf{x}_i - \mathbf{\mu}_1) \} }\\ & = \frac{ \displaystyle \exp\{ -\frac{1}{2\sigma^2} \lVert \mathbf{x}_c - \mathbf{\mu}_1 \rVert_2^2 \} }{ \displaystyle \sum_{i = 1}^n \exp\{ -\frac{1}{2\sigma^2} \lVert \mathbf{x}_i - \mathbf{\mu}_1 \rVert_2^2\} }\\ & = \frac{ \displaystyle \exp\{ \frac{1}{2\sigma^2} s(\mathbf{x}_c, \mathbf{\mu}_1) \} }{ \displaystyle \sum_{i = 1}^n \exp\{ \frac{1}{2\sigma^2} s(\mathbf{x}_i, \mathbf{\mu}_1)\} } \\ \end{align}\]In the penultimate line, in the exponent of the denominator, we obtain the negative of the $\ell^2$-norm distance between the training instance \(\mathbf{x}_c\) and the testing instance $\mathbf{\mu}_{1}$ multiplied by a constant. This negative of the $\ell^2$-norm distance can also be seen as a function $s: \mathbb{R}^p \to \mathbb{R}$ that measures the similarity, as they are inversely proportional. This transformation can also be applied to the numerator.
Plugging the above result in the rest of the vector in the prediction of the instance $\mu_{1}$:
\[\begin{align} f(\mu_{1}) & = \begin{bmatrix} \frac{ \displaystyle \exp\{ \frac{1}{2\sigma^2} s(\mathbf{x}_1, \mathbf{\mu}_1) \} }{ \displaystyle \sum_{i = 1}^n \exp\{ \frac{1}{2\sigma^2} s(\mathbf{x}_i, \mathbf{\mu}_1)\} } & \cdots & \frac{ \displaystyle \exp\{ \frac{1}{2\sigma^2} s(\mathbf{x}_n, \mathbf{\mu}_1) \} }{ \displaystyle \sum_{i = 1}^n \exp\{ \frac{1}{2\sigma^2} s(\mathbf{x}_i, \mathbf{\mu}_1)\} } \end{bmatrix} \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} \\ & = \text{softmax} \left( \frac{1}{2\sigma^2} \begin{bmatrix} s(\mathbf{x}_1, \mathbf{\mu}_1) & \cdots & s(\mathbf{x}_n, \mathbf{\mu}_1) \end{bmatrix} \right) \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} \end{align}\]From here we can already see that the prediction of the given instance is simply a weighted average of the labels.
One way to measure the similarity between two vectors is via inner product. More generally we can define the following inner product:
\[s(\mathbf{x}_i, \mathbf{\mu}_j)=\mathbf{\mu}_j^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_i \in \mathbb{R}^{1 \times p} \times \mathbb{R}^{p \times t} \times \mathbb{R}^{t \times p} \times \mathbb{R}^{p \times 1}\]where $t$ is the number of testing instances and $p$ is the original vector dimension of \(\mathbf{\mu}_j\) and \(\mathbf{x}_i\). Matrices \(\mathbf{W}_{q} \in \mathbb{R}^{t \times p}\) and \(\mathbf{W}_{k} \in \mathbb{R}^{t \times p}\) are effectively transforming the vectors \(\mathbf{\mu}_j\) and \(\mathbf{x}_i\) into new dimensions of size $t$. We will come back later on how to get the elements of these matrices.
So far we have been working on the prediction of the single instance $\mathbf{\mu}_{1}$. Now we are prepared to work with the whole set of testing instances. Plugging the above definition of the similarity function for the whole matrix and applying the softmax function for each row (abuse of notation?), we have:
\[\begin{align} \begin{bmatrix} f(\mu_{1}) \\ \vdots \\ f(\mu_{t}) \end{bmatrix} & = \text{softmax} \left( \frac{1}{2\sigma^2} \begin{bmatrix} \mathbf{\mu}_1^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_1 & \cdots & \mathbf{\mu}_1^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_n \\ \vdots & \ddots & \vdots \\ \mathbf{\mu}_t^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_1 & \cdots & \mathbf{\mu}_t^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_n \end{bmatrix} \right) \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} \\ & = \text{softmax} \left( \frac{1}{2\sigma^2} \begin{bmatrix} - & \mathbf{\mu}_1^{\top}\mathbf{W}_q^{\top} & -\\ & \vdots & \\ - & \mathbf{\mu}_t^{\top}\mathbf{W}_q^{\top} & - \end{bmatrix} \begin{bmatrix} | & & | \\ \mathbf{W}_k\mathbf{x}_1 & \cdots & \mathbf{W}_k\mathbf{x}_n \\ | & & | \end{bmatrix} \right) \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} \\ & = \text{softmax} \left( \frac{1}{2\sigma^2} \underbrace{\mathbf{\mu} \mathbf{W}_q^{\top}}_{\mathbf{Q}} \underbrace{\mathbf{W}_k \mathbf{X}^{\top}}_{\mathbf{K}^{\top}} \right) \underbrace{ \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} }_{\mathbf{V}} \\ & = \text{softmax} \left( \frac{\mathbf{Q}\mathbf{K}^{\top}}{2\sigma^2} \right) \mathbf{V} \end{align}\]if we replace the constant $1/2\sigma^2$ with another constant, $1/\sqrt{d_{k}}$, where $d_{k}$ represents the dimension of the instance vector (in our case, simply $p$), it becomes more apparent that the predictions of Nadaraya-Watson turns into the equation of the attention mechanism when the function $s$ is utilized as a similarity function:
\[\begin{align} \begin{bmatrix} f(\mu_{1}) \\ \vdots \\ f(\mu_{t}) \end{bmatrix} & = \text{softmax} \left( \frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{ d_{k} }} \right) \mathbf{V} = \text{Attention} (\mathbf{Q}, \mathbf{K}, \mathbf{V}) \end{align}\]In the context of deep neural networks, the elements of the matrices \(\mathbf{W}_{q}\) and \(\mathbf{W}_{k}\) are part of the Transformer architecture(3) and they are learned via gradient descent.
This intriguing connection contributes to the notion that a deeper understanding of kernel machines is crucial for comprehending how neural networks function (4). Perhaps, we are getting closer to the ‘master algorithm’ that unifies all factions within the machine learning community (5).
References
- Carroll, S. (2022). The biggest ideas in the universe: Space, time, and motion. Penguin.
- Utkin, L. V., & Konstantinov, A. V. (2022). Attention-based random forest and contamination model. Neural Networks, 154, 346-359.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
- Domingos, P. (2020). Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152.
- Domingos, P. (2015). The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books.