Jekyll2024-01-17T05:49:28+00:00https://ulises-rosas.github.io/feed.xmlUlises Rosas-PuchuriA personal blogMore attention to the Nadayara-Watson regression2024-01-15T00:00:00+00:002024-01-15T00:00:00+00:00https://ulises-rosas.github.io/jekyll/update/nadayara_2<p>In a <a href="https://ulises-rosas.github.io/jekyll/update/nadayara/">previous blog post</a>, I showed the relationship between the calculus of variations and non-linear regression. In this post, I will delve into the link between the resulting regression model, Nadaraya-Watson regression, and the attention mechanism, central to the training step of the state-of-the-art large language models like GPTs, LLaMA, or Mistral.</p>
<p>Typically, solutions derived from the calculus of variations reveal deeper and more intriguing aspects of the problem at hand, rather than merely providing a solution. Consider, for instance, the Lagrangian of the arclength of a function:</p>
\[\begin{align}
J[y] = \int \sqrt{1 + y'} \, dx
\end{align}\]
<p>whose solution yields the equation of a straight line, revealing that it represents the shortest path between two points in the absence of additional constraints; or, delving into a more complex realm, the Einstein-Hilbert Lagrangian density:</p>
\[\begin{align}
J[s] = \int \left(\frac{1}{16\pi G}R + \mathcal{L}_{\text{matter}} \right) \sqrt{-g} \, d^4x
\end{align}\]
<p>whose solution yields the Einstein field equations, revealing the relationship of the geometry of spacetime with matter distribution (1), a key insight for the general theory of relativity.</p>
<p>In the case obtained equation from the previous blog, something deeper is going on, as relatively recently showed in an article by <a href="https://arxiv.org/abs/2201.02880">Utkin & Konstantinov 2022</a>(2). Consider the prediction for the instance $\mu_{1}$ of the testing set:</p>
\[\begin{align}
f(\mu_{1}) & = \displaystyle \frac{ \sum_{i = 1}^{n} y_i \,\delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1)}{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) } \\
& = \begin{bmatrix}
\displaystyle \frac{ \delta_{\sigma}(\mathbf{x}_1 - \mathbf{\mu}_1)}{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) } & \cdots &
\displaystyle \frac{ \delta_{\sigma}(\mathbf{x}_n - \mathbf{\mu}_1)}{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) }
\end{bmatrix}
\begin{bmatrix}
y_{1} \\
\vdots \\
y_{n}
\end{bmatrix}
\end{align}\]
<p>If we take the element $c$ from the left-hand side vector and apply some minor cancellations and transformations, we obtain:</p>
\[\begin{align}
\frac{ \delta_{\sigma}( \mathbf{x}_c - \mathbf{\mu}_1) }{\sum_{i = 1}^{n} \delta_{\sigma}(\mathbf{x}_i - \mathbf{\mu}_1) }
& =
\frac{
\displaystyle
\frac{1}{(2\pi)^{n/2}} \frac{1}{|\sigma^2 \mathbf{I}|^{1/2}}
\exp\{ -\frac{1}{2\sigma^2}
(\mathbf{x}_c - \mathbf{\mu}_1)^\top
(\mathbf{x}_c - \mathbf{\mu}_1) \}
}{
\displaystyle
\sum_{i = 1}^n
\frac{1}{(2\pi)^{n/2}} \frac{1}{|\sigma^2 \mathbf{I}|^{1/2}}
\exp\{ -\frac{1}{2\sigma^2}
(\mathbf{x}_i - \mathbf{\mu}_1)^\top
(\mathbf{x}_i - \mathbf{\mu}_1) \}
}\\
& =
\frac{
\displaystyle
\exp\{ -\frac{1}{2\sigma^2}
\lVert \mathbf{x}_c - \mathbf{\mu}_1 \rVert_2^2 \}
}{
\displaystyle
\sum_{i = 1}^n
\exp\{ -\frac{1}{2\sigma^2}
\lVert \mathbf{x}_i - \mathbf{\mu}_1 \rVert_2^2\}
}\\
& =
\frac{
\displaystyle
\exp\{ \frac{1}{2\sigma^2}
s(\mathbf{x}_c, \mathbf{\mu}_1) \}
}{
\displaystyle
\sum_{i = 1}^n
\exp\{ \frac{1}{2\sigma^2}
s(\mathbf{x}_i, \mathbf{\mu}_1)\}
}
\\
\end{align}\]
<p>In the penultimate line, in the exponent of the denominator, we obtain the negative of the $\ell^2$-norm distance between the training instance \(\mathbf{x}_c\) and the testing instance $\mathbf{\mu}_{1}$ multiplied by a constant. This negative of the $\ell^2$-norm distance can also be seen as a function $s: \mathbb{R}^p \to \mathbb{R}$ that measures the similarity, as they are inversely proportional. This transformation can also be applied to the numerator.</p>
<p>Plugging the above result in the rest of the vector in the prediction of the instance $\mu_{1}$:</p>
\[\begin{align}
f(\mu_{1}) & = \begin{bmatrix}
\frac{
\displaystyle
\exp\{ \frac{1}{2\sigma^2}
s(\mathbf{x}_1, \mathbf{\mu}_1) \}
}{
\displaystyle
\sum_{i = 1}^n
\exp\{ \frac{1}{2\sigma^2}
s(\mathbf{x}_i, \mathbf{\mu}_1)\}
}
& \cdots &
\frac{
\displaystyle
\exp\{ \frac{1}{2\sigma^2}
s(\mathbf{x}_n, \mathbf{\mu}_1) \}
}{
\displaystyle
\sum_{i = 1}^n
\exp\{ \frac{1}{2\sigma^2}
s(\mathbf{x}_i, \mathbf{\mu}_1)\}
}
\end{bmatrix}
\begin{bmatrix}
y_{1} \\
\vdots \\
y_{n}
\end{bmatrix} \\
& =
\text{softmax} \left(
\frac{1}{2\sigma^2}
\begin{bmatrix}
s(\mathbf{x}_1, \mathbf{\mu}_1)
& \cdots &
s(\mathbf{x}_n, \mathbf{\mu}_1)
\end{bmatrix} \right)
\begin{bmatrix}
y_{1} \\
\vdots \\
y_{n}
\end{bmatrix}
\end{align}\]
<p>From here we can already see that the prediction of the given instance is simply a weighted average of the labels.</p>
<p>One way to measure the similarity between two vectors is via inner product. More generally we can define the following inner product:</p>
\[s(\mathbf{x}_i, \mathbf{\mu}_j)=\mathbf{\mu}_j^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_i \in
\mathbb{R}^{1 \times p} \times \mathbb{R}^{p \times t} \times
\mathbb{R}^{t \times p} \times \mathbb{R}^{p \times 1}\]
<p>where $t$ is the number of testing instances and $p$ is the original vector dimension of \(\mathbf{\mu}_j\) and \(\mathbf{x}_i\). Matrices \(\mathbf{W}_{q} \in \mathbb{R}^{t \times p}\) and \(\mathbf{W}_{k} \in \mathbb{R}^{t \times p}\) are effectively transforming the vectors \(\mathbf{\mu}_j\) and \(\mathbf{x}_i\) into new dimensions of size $t$. We will come back later on how to get the elements of these matrices.</p>
<p>So far we have been working on the prediction of the single instance $\mathbf{\mu}_{1}$. Now we are prepared to work with the whole set of testing instances. Plugging the above definition of the similarity function for the whole matrix and applying the softmax function for each row (abuse of notation?), we have:</p>
\[\begin{align}
\begin{bmatrix}
f(\mu_{1}) \\
\vdots \\
f(\mu_{t})
\end{bmatrix}
& =
\text{softmax} \left(
\frac{1}{2\sigma^2}
\begin{bmatrix}
\mathbf{\mu}_1^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_1 & \cdots & \mathbf{\mu}_1^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_n \\
\vdots & \ddots & \vdots \\
\mathbf{\mu}_t^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_1 & \cdots & \mathbf{\mu}_t^{\top}\mathbf{W}_q^{\top}\mathbf{W}_k\mathbf{x}_n
\end{bmatrix}
\right)
\begin{bmatrix}
y_{1} \\
\vdots \\
y_{n}
\end{bmatrix} \\
& =
\text{softmax} \left(
\frac{1}{2\sigma^2}
\begin{bmatrix}
- & \mathbf{\mu}_1^{\top}\mathbf{W}_q^{\top} & -\\
& \vdots & \\
- & \mathbf{\mu}_t^{\top}\mathbf{W}_q^{\top} & -
\end{bmatrix}
\begin{bmatrix}
| & & | \\
\mathbf{W}_k\mathbf{x}_1 & \cdots & \mathbf{W}_k\mathbf{x}_n \\
| & & |
\end{bmatrix}
\right)
\begin{bmatrix}
y_{1} \\
\vdots \\
y_{n}
\end{bmatrix} \\
& =
\text{softmax} \left(
\frac{1}{2\sigma^2}
\underbrace{\mathbf{\mu} \mathbf{W}_q^{\top}}_{\mathbf{Q}}
\underbrace{\mathbf{W}_k \mathbf{X}^{\top}}_{\mathbf{K}^{\top}}
\right)
\underbrace{
\begin{bmatrix}
y_{1} \\
\vdots \\
y_{n}
\end{bmatrix}
}_{\mathbf{V}} \\
& =
\text{softmax} \left(
\frac{\mathbf{Q}\mathbf{K}^{\top}}{2\sigma^2}
\right)
\mathbf{V}
\end{align}\]
<p>if we replace the constant $1/2\sigma^2$ with another constant, $1/\sqrt{d_{k}}$, where $d_{k}$ represents the dimension of the instance vector (in our case, simply $p$), it becomes more apparent that the predictions of Nadaraya-Watson turns into the equation of the attention mechanism when the function $s$ is utilized as a similarity function:</p>
\[\begin{align}
\begin{bmatrix}
f(\mu_{1}) \\
\vdots \\
f(\mu_{t})
\end{bmatrix}
& =
\text{softmax} \left(
\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{ d_{k} }}
\right) \mathbf{V} = \text{Attention} (\mathbf{Q}, \mathbf{K}, \mathbf{V})
\end{align}\]
<p>In the context of deep neural networks, the elements of the matrices \(\mathbf{W}_{q}\) and \(\mathbf{W}_{k}\) are part of the <a href="https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">Transformer architecture</a>(3) and they are learned via gradient descent.</p>
<p>This intriguing connection contributes to the notion that a deeper understanding of kernel machines is crucial for comprehending how neural networks function (4). Perhaps, we are getting closer to the ‘master algorithm’ that unifies all factions within the machine learning community (5).</p>
<h1 id="references">References</h1>
<ol>
<li>Carroll, S. (2022). <em>The biggest ideas in the universe: Space, time, and motion</em>. Penguin.</li>
<li>Utkin, L. V., & Konstantinov, A. V. (2022). Attention-based random forest and contamination model. Neural Networks, 154, 346-359.</li>
<li>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.</li>
<li>Domingos, P. (2020). Every model learned by gradient descent is approximately a kernel machine. <em>arXiv preprint arXiv:2012.00152</em>.</li>
<li>Domingos, P. (2015). <em>The master algorithm: How the quest for the ultimate learning machine will remake our world</em>. Basic Books.</li>
</ol>A fun connectionDoes Pagel’s lambda create a positive semi-definite covariance matrix?2023-11-24T00:00:00+00:002023-11-24T00:00:00+00:00https://ulises-rosas.github.io/jekyll/update/pagel<p>In phylogenetic regression, which is a form of linear regression involving multiple species observations, Pagel’s lambda indicates the extent to which covariance between observations is necessary to maximize the data’s likelihood. This concept is known as the phylogenetic signal in this context and has links to linear mixed models, as we will explore. However, since it is a parameter that affects only the covariances (i.e., the off-diagonal values of the covariance matrix) through multiplication, it was initially unclear to me why the resulting new covariance matrix would still be a positive semi-definite (PSD) matrix. Being PSD confers great properties, such as invertibility, real eigenvalues, and spectral decomposition. With this in mind, I would like to present a brief proof of this property and follow up with some remarks on how to estimate this parameter.</p>
<p><strong>Proposition</strong>. <em>Let $\textbf{C}_{\lambda} \in \mathbb{R}^{n \times n}$ be the transformed covariance matrix $\Omega \in \mathbb{R}^{n \times n}$ with off-diagonal elements multiplied by the scalar $\lambda \in [0,1]$ such that</em>:</p>
\[\textbf{C}_{\lambda} =
\begin{bmatrix}
\sigma^2_{11} & \lambda\sigma_{12} & \dots & \lambda \sigma_{1n} \\
\lambda\sigma_{21} & \sigma^2_{22} & \dots & \lambda\sigma_{2n}\\
\vdots & \vdots & \ddots & \vdots \\
\lambda\sigma_{n1} & \lambda\sigma_{n2} & \dots & \sigma^2_{nn} \\
\end{bmatrix} \text{.}\]
<p><em>Then, $\textbf{C}_{\lambda}$ is positive semi-definite matrix</em>.</p>
<p><strong>Proof</strong>. We can re-write $\mathbf{C}_{\lambda}$ as the following:</p>
\[\begin{align}
\textbf{C}_{\lambda}& =
\lambda\begin{bmatrix}
\sigma^2_{11} &\sigma_{12} & \dots & \sigma_{1n} \\
\sigma_{21} & \sigma^2_{22} & \dots & \sigma_{2n}\\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{n1} & \sigma_{n2} & \dots & \sigma^2_{nn} \\
\end{bmatrix} +
(1 - \lambda)\begin{bmatrix}
\sigma^2_{11} & 0 & \dots & 0 \\
0 & \sigma^2_{22} & \dots & 0\\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & \sigma^2_{nn} \\
\end{bmatrix}
\\\\
& = \lambda \Omega + (1-\lambda) \mathbf{W} \, \text{.}
\end{align}\]
<p>Now, $\Omega$ can be defined as the covariance matrix of some random variable $\mathbf{x} \in \mathbb{R}^{n \times 1}$, with probability density $p(\mathbf{x})$, and expectation $\mu$:</p>
\[\begin{align}
\textbf{C}_{\lambda} & = \lambda\text{E}[ (\mathbf{x} - \mu)(\mathbf{x} - \mu)^\top] + (1-\lambda)\mathbf{W} \\
& = \lambda\int_{\mathbb{R}} p(\mathbf{x}) (\mathbf{x} - \mu)(\mathbf{x} - \mu)^\top \, d\mathbf{x} + (1-\lambda)\mathbf{W} \, \text{.}
\end{align}\]
<p>Finally, let $\mathbf{u} \in \mathbb{R}^{n \times 1}$ be any vector different than 0 such that:</p>
\[\begin{align}
\mathbf{u}^\top\textbf{C}_{\lambda}\mathbf{u} & = \lambda\int_{\mathbb{R}} p(\mathbf{x}) \, \mathbf{u}^\top(\mathbf{x} - \mu)(\mathbf{x} - \mu)^\top\mathbf{u} \, d\mathbf{x} + (1-\lambda)\mathbf{u}^\top\mathbf{W}\mathbf{u}\\
& = \lambda\int_{\mathbb{R}} p(\mathbf{x}) \, \{(\mathbf{x} - \mu)^\top\mathbf{u}\}^\top(\mathbf{x} - \mu)^\top\mathbf{u} \, d\mathbf{x} + (1-\lambda)\mathbf{u}^\top\mathbf{W}^{1/2}\mathbf{W}^{1/2}\mathbf{u} \\
& = \lambda\int_{\mathbb{R}} p(\mathbf{x}) \, ||(\mathbf{x} - \mu)^\top\mathbf{u}||^2 \, d\mathbf{x} + (1-\lambda) ||\mathbf{W}^{1/2}\mathbf{u}||^2 \geq 0 \text{ .}
\end{align}\]
<p>Since $\mathbf{u}^\top\textbf{C}_{\lambda} \mathbf{u} \geq 0$ for any $\mathbf{u} \neq 0$, then \(\textbf{C}_{\lambda}\) is positive semidefinite matrix. $\,\,\square$</p>
<p><em>Remark 1</em>. If we considering the following linear regression model:</p>
\[\mathbf{Y} = \mathbf{X}\beta + \varepsilon, \, \varepsilon \sim \mathcal{N}(0,\sigma^2\mathbf{C}_{\lambda}) \text{ ,}\]
<p>where $\textbf{X} \in \mathbb{R}^{n \times p}$ is matrix with observations, $\textbf{Y} \in \mathbb{R}^{n \times 1}$ is the response variable, $\beta$ is the vector with coefficients, and $\mathbf{C}_{\lambda}$ is the covariance matrix, then likelihood of the data can be expressed as:</p>
\[p(\mathbf{Y}|\mathbf{X},\beta) = \frac{1}{(2\pi)^{n/2}} \frac{1}{|\sigma^2 \mathbf{C}_{\lambda}|^{1/2}}
\exp\{ -\frac{1}{2\sigma^2} (\mathbf{Y} - \mathbf{X}\beta)^\top \mathbf{C}_{\lambda}^{-1} (\mathbf{Y} - \mathbf{X}\beta) \} \text{ .}\]
<p>Here is where expressions such as $\mathbf{C}_{\lambda} = \lambda\Omega + (1 - \lambda)\mathbf{W}$ start singing and reveal to us something profound and interesting about the equation above. It implies that if we allow $\lambda$ to be $0$, the variability between species observations does not influence the likelihood, and only the variance of species is important. Conversely, if $\lambda$ is $1$, then the variances of between species are significant, but the within-species variability is not as crucial. This point is where the connection with linear mixed models becomes more apparent, as the random intercepts and slopes in mixed models also strive to find a balance between variations between individuals versus variations within individuals, which usually involve repeated measurements. Lynch (1991) proposed the phylogenetic version of mixed models, called the Phylogenetic Linear Mixed Model (PLMM). Historically, however, it was less popular than Felsenstein’s (1985) Phylogenetic Independent Contrast (PIC), despite the fact that PIC can be seen as a special case of PLMM when there is not much variation within individuals (Garamszegi 2014), or, if we attempt to make a connection with $\lambda$, when $\lambda = 1$.</p>
<h3 id="obtaining-a-closed-form-for-lambda-is-difficult">Obtaining a closed form for $\lambda$ is difficult</h3>
<p>Since we talked about maximizing, let’s first define our objective function. We can obtain it from the likelihood equation:</p>
\[\begin{align}
\ln p(\textbf{Y} \mid \textbf{X}, \beta ) =
\ln \left\{ \frac{1}{(2\pi)^{n/2}} \right\} +
\ln \left\{ \frac{1}{|\sigma^2 \mathbf{C}_{\lambda}|^{1/2}} \right\}
-\frac{1}{2\sigma^2}
(\textbf{Y} - \textbf{X}\beta)^\top
\mathbf{C}_{\lambda}^{-1}
(\textbf{Y} - \textbf{X}\beta) \\
= \ln \left\{ \frac{1}{(2\pi)^{n/2}} \right\} - \frac{n}{2} \ln \sigma^{2} -\frac{1}{2} \ln |\mathbf{C}_{\lambda}| -\frac{1}{2\sigma^2}
(\textbf{Y} - \textbf{X}\beta)^\top
\mathbf{C}_{\lambda}^{-1}
(\textbf{Y} - \textbf{X}\beta) \\
\Rightarrow \arg \max_{\beta,\lambda,\sigma^2}\,\, \ln p(\textbf{Y} \mid \textbf{X}, \beta ) = \arg \min_{\beta,\lambda,\sigma^2}\,\,
n\ln \sigma^2
+ \ln |\mathbf{C}_{\lambda}|
+ \frac{1}{\sigma^2}
(\textbf{Y} - \textbf{X}\beta)^\top
\mathbf{C}_{\lambda}^{-1}
(\textbf{Y} - \textbf{X}\beta) \text{ .}
\end{align}\]
<p>From above we can define our objective function $J(\beta,s,\mathbf{C}_{\lambda})$ as:</p>
\[\begin{align}
J(\beta,s,\mathbf{C}_{\lambda}) & =
n\ln \sigma^2
+ \ln |\mathbf{C}_{\lambda}|
+ \frac{1}{\sigma^2}
(\textbf{Y} - \textbf{X}\beta)^\top
\mathbf{C}_{\lambda}^{-1}
(\textbf{Y} - \textbf{X}\beta) \\
& = n\ln s
+ \ln |\mathbf{C}_{\lambda}|
+ \frac{1}{s}
\varepsilon^\top
\mathbf{C}_{\lambda}^{-1}
\varepsilon
\end{align}\]
<p>Recall that $\varepsilon$ is the error from our initial linear regression model, as defined in <em>Remark 1</em>, and we replaced $\sigma^2$ simply with $s$. We will use either the first or second line of the objective function definition interchangeably, depending on the variable to be differentiated.</p>
<p>Differentiating $J(\beta,s,\mathbf{C}_{\lambda})$ with respect to $s$ and setting it to zero we have:</p>
\[\begin{align}
\frac{\partial}{\partial s} J(\beta,s,\mathbf{C}_{\lambda}) & = n \frac{1}{s} - (s)^{-2} \varepsilon^\top\mathbf{C}_{\lambda}^{-1} \varepsilon = 0 \\
\implies s & = \frac{1}{n} \varepsilon^\top\mathbf{C}_{\lambda}^{-1} \varepsilon
\end{align}\]
<p>Now, differentiating $J(\beta,s,\mathbf{C}_{\lambda})$ with respect to $\beta$ and setting it to zero we should have the weighted least square solution:</p>
\[\begin{align}
\frac{\partial}{\partial \beta} J(\beta,s,\mathbf{C}_{\lambda}) & = \frac{\partial}{\partial \beta} \left( \frac{1}{s}(\textbf{Y} - \textbf{X}\beta)^\top
\mathbf{C}_{\lambda}^{-1}
(\textbf{Y} - \textbf{X}\beta) \right) = 0 \\
\implies \beta & = (\mathbf{X}^\top\mathbf{C}_{\lambda}^{-1}\mathbf{X})^{-1}
(\mathbf{X}^\top\mathbf{C}_{\lambda}^{-1}\mathbf{Y})
\end{align}\]
<p>Finally, differentiating $J(\beta,s,\mathbf{C}_{\lambda})$ with respect to $\lambda$ and setting it to zero:</p>
\[\begin{align}
\frac{\partial}{\partial \lambda} J(\beta,s,\mathbf{C}_{\lambda}) & = \frac{\partial}{\partial \lambda} \ln |\mathbf{C}_{\lambda}|
+ \frac{\partial}{\partial \lambda}
\left( \varepsilon^\top\mathbf{C}_{\lambda}^{-1} \varepsilon
\right) = 0 \\
& \phantom{=} \mathrm{Tr} \left( \mathbf{C}_{\lambda}^{-1} \frac{\partial}{\partial \lambda} \mathbf{C}_{\lambda} \right)
+ \frac{1}{s}
\varepsilon^\top
\left( \frac{\partial}{\partial \lambda} \mathbf{C}_{\lambda}^{-1} \right)
\varepsilon = 0
\end{align}\]
<p>We used the multiplication rule property for the right hand side differentiation. From here we can start seeing troubles. How can we get the $\lambda$ out of the trace or of the matrix inverse? For the sake of completeness let’s finish the differentitation:</p>
\[\begin{align}
\mathrm{Tr} \left( \mathbf{C}_{\lambda}^{-1} \,(\Omega - \mathbf{W}) \right)
- \frac{1}{s}
\varepsilon^\top
\mathbf{C}_{\lambda}^{-1}
(\Omega - \mathbf{W})\mathbf{C}_{\lambda}^{-1}\,
\varepsilon & = 0
\end{align}\]
<p>For the right-hand side differentiation, we used this property: $\partial (\mathbf{A}^{-1}) = - \mathbf{A}^{-1} (\partial \mathbf{A})\mathbf{A}^{-1}$. Considering $\mathbf{C}_{\lambda}^{-1} = (\lambda\Omega + (1 - \lambda)\mathbf{W})^{-1}$, it becomes apparent that isolating $\lambda$ from the expression above is challenging. Furthermore, we have the constraint that $\lambda \geq 0$ and $\lambda \leq 1$. While multiple inequality constraints can be addressed using Lagrange multipliers followed by the Karush–Kuhn–Tucker (KKT) conditions, which look for all the active constraints that fulfill the KKT conditions (e.g., positive multipliers, feasible inequalities), this usually requires isolating $\lambda$, a task that is not trivial.</p>
<h3 id="but-we-can-use-numerical-optimization">But we can use numerical optimization</h3>
<p>From above derivations, we can propose the following optimization problem:</p>
\[\begin{align*}
\text{minimize} \quad & n \ln s + \ln | \textbf{C}_{\lambda} | + \frac{1}{s}\varepsilon^\top \textbf{C}_{\lambda}^{-1}\varepsilon \\
\text{subject to}
\quad & \textbf{C}_{\lambda} = \lambda\Omega + (1 - \lambda)\textbf{W}\\
\quad & \beta = (\textbf{X}^\top\textbf{C}_{\lambda}^{-1}\textbf{X})^{-1}\textbf{X}^\top\textbf{C}_{\lambda}^{-1}\textbf{y}\\
\quad & \varepsilon = \textbf{Y} - \textbf{X}\beta\\
\quad & s = \frac{1}{n}\varepsilon^\top \textbf{C}_{\lambda}^{-1}\varepsilon \\
\quad & \lambda \geq 0\\
\quad & \lambda \leq 1\\
\end{align*}\]
<p>where $\lambda$ is the unique decision variable.</p>
<p>Here is a set of functions to optimize the above problem:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scipy.optimize</span> <span class="kn">import</span> <span class="n">minimize</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">objective_function</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">Sigma</span><span class="p">,</span> <span class="n">lambda_val</span><span class="p">):</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span> <span class="n">np</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span><span class="n">Sigma</span><span class="p">)</span> <span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">lambda_val</span> <span class="o">*</span> <span class="n">Sigma</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">lambda_val</span><span class="p">)</span> <span class="o">*</span> <span class="n">W</span>
<span class="n">C_inv</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">C</span><span class="p">)</span>
<span class="n">beta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">C_inv</span> <span class="o">@</span> <span class="n">X</span><span class="p">)</span> <span class="o">@</span> <span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">C_inv</span> <span class="o">@</span> <span class="n">y</span><span class="p">)</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="n">X</span> <span class="o">@</span> <span class="n">beta</span>
<span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">n</span><span class="p">)</span><span class="o">*</span><span class="n">e</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">C_inv</span> <span class="o">@</span> <span class="n">e</span>
<span class="n">objective</span> <span class="o">=</span> <span class="n">n</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">det</span><span class="p">(</span><span class="n">C</span><span class="p">))</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">s</span><span class="p">)</span><span class="o">*</span><span class="n">e</span><span class="p">.</span><span class="n">T</span><span class="o">@</span> <span class="n">C_inv</span> <span class="o">@</span><span class="n">e</span>
<span class="k">return</span> <span class="n">objective</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float64</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">optimize_problem</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">Sigma</span><span class="p">):</span>
<span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="n">n</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">X</span><span class="p">))</span>
<span class="c1"># Define the bounds for lambda
</span> <span class="n">bounds</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="c1"># Initial guess for lambda
</span> <span class="n">initial_lambda</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Define the optimization problem
</span> <span class="n">fnc</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">lambda_val</span><span class="p">:</span> <span class="n">objective_function</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">Sigma</span><span class="p">,</span> <span class="n">lambda_val</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">minimize</span><span class="p">(</span><span class="n">fnc</span><span class="p">,</span> <span class="n">x0</span> <span class="o">=</span> <span class="n">initial_lambda</span><span class="p">,</span> <span class="n">bounds</span><span class="o">=</span><span class="n">bounds</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">'SLSQP'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">result</span>
</code></pre></div></div>
<p>Testing in two trees:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">5</span>
<span class="c1"># tree 1
</span><span class="n">C1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">7</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">]])</span>
<span class="n">X1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">n</span><span class="p">,</span> <span class="n">cov</span> <span class="o">=</span> <span class="n">C1</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">2</span><span class="p">).</span><span class="n">T</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">n</span><span class="p">,</span> <span class="n">cov</span> <span class="o">=</span> <span class="n">C1</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">1</span><span class="p">).</span><span class="n">T</span>
<span class="k">print</span><span class="p">(</span><span class="n">optimize_problem</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">C1</span><span class="p">).</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># 1.0
</span>
<span class="c1"># tree 2
</span><span class="n">C2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">1</span><span class="p">,.</span><span class="mi">5</span><span class="p">,.</span><span class="mi">5</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,.</span><span class="mi">5</span><span class="p">,.</span><span class="mi">5</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,.</span><span class="mi">5</span><span class="p">,.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,.</span><span class="mi">5</span><span class="p">,.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">]])</span>
<span class="n">X2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">n</span><span class="p">,</span> <span class="n">cov</span> <span class="o">=</span> <span class="n">C2</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">2</span><span class="p">).</span><span class="n">T</span>
<span class="n">y2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">n</span><span class="p">,</span> <span class="n">cov</span> <span class="o">=</span> <span class="n">C2</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">1</span><span class="p">).</span><span class="n">T</span>
<span class="k">print</span><span class="p">(</span><span class="n">optimize_problem</span><span class="p">(</span><span class="n">X2</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">C2</span><span class="p">).</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># 0.0
</span></code></pre></div></div>
<h1 id="references">References</h1>
<ul>
<li>
<p>Garamszegi, L. Z. (Ed.). (2014). <em>Modern phylogenetic comparative methods and their application in evolutionary biology: concepts and practice</em>. Springer.</p>
</li>
<li>
<p>Housworth, E. A., Martins, E. P., & Lynch, M. (2004). The phylogenetic mixed model. The American Naturalist, 163(1), 84-96.</p>
</li>
<li>
<p>Lynch, M. (1991). Methods for the analysis of comparative data in evolutionary biology. Evolution, 45(5), 1065-1080.</p>
</li>
<li>
<p>Felsenstein, J. (1985). Phylogenies and the comparative method. The American Naturalist, 125(1), 1-15.</p>
</li>
</ul>Yes!Regressions and variational calculus2023-10-03T00:00:00+00:002023-10-03T00:00:00+00:00https://ulises-rosas.github.io/jekyll/update/nadayara<p>I found this <a href="https://towardsdatascience.com/regularized-kernel-regression-from-a-variational-principle-d2b0c03eb919">cool post</a> and I thought it could be fun to implement the solution as well as go over my own understanding of the derived equations.</p>
<p>In machine learning methods we minimize the following loss function:</p>
\[\text{arg min}_{f} \sum_{i = 1}^{n} \{ f(x_i) - y_i \}^2\]
<p>where the function $f$ is an estimator of $y$ . Different machine learning models propose different strategies to minimize the above loss function. However, I always wondered how to convert this optimization problem into a calculus of variation problem. It turns out that you can use the Dirac delta function.</p>
<p>Let $\delta_{\sigma}(x_i - \mu)$ be a function such that:</p>
\[\delta_{\sigma}(x_i - \mu) \approx \begin{cases}
1, & \text{if $x_i = \mu$ }.\\
0, & \text{otherwise}.
\end{cases}\]
<p>The way this function works is by having a distribution around $x_i$ when $x_i$ approaches $\mu$ . This peak can be depicted by a Gaussian distribution. That is,</p>
\[\delta_{\sigma}(x_i - \mu) = \frac{1}{\sqrt{2\pi \sigma^2 }} \exp\{ -\frac{1}{2\sigma^2} (x_i - \mu)^2 \} \text{ .}\]
<p>Then, the key insight is to make the loss function stop depending on $x_i$ and make it depend on $\mu$ instead, such that we define the following functional:</p>
\[\begin{align}
J[f] & = \sum_{i = 1}^{n} \{ f(x_i) - y_i \}^2 \\
& \approx \int_{\mathbb{R}} \sum_{i = 1}^{n} \left\{ f(\mu) - y_i \right\}^2 \, \delta_{\sigma}(x_i - \mu) \,d\mu \\
& = c + \int_{\mathbb{R}} \sum_{i = 1}^{n} \left\{ (f(\mu))^2 - 2y_if(\mu) \right\} \, \delta_{\sigma}(x_i - \mu) \,d\mu \text{ .}
\end{align}\]
<p>In the last step, I just opened the quadratic. The term $c$ that does not depend on $f$ and since this term is constant we can take it out from the functional. Furthermore, to make the functional equation look more straightforward we can re-write it as:</p>
\[\begin{align}
J[f] & = \int_{\mathbb{R}} \sum_{i = 1}^{n} \left\{ f^2 - 2y_if \right\} \, \delta_{\sigma}(x_i - \mu) \,d\mu \text{ .}
\end{align}\]
<p>Now, it is more apparent that the lagrangian for the above functional is:</p>
\[L(\mu, f, f') = \sum_{i = 1}^{n} \left\{ f^2 - 2y_if \right\} \, \delta_{\sigma}(x_i - \mu) \text{ ,}\]
<p>and we can obtain its extremum (i.e., in this case, the minimum) with the Euler-Lagrange (EL) equation:</p>
\[\frac{\partial}{\partial f}L(\mu, f, f') - \frac{d}{d\mu} \frac{\partial}{\partial f'} L(\mu, f, f') = 0 \text{ .}\]
<p>Since our lagrangian does not depend on $f’$, we can re-write the EL equation simply as:</p>
\[\frac{\partial}{\partial f}L(\mu, f, f') = 0\]
<p>Finally, we can plug the actual equation of the lagrangian so that we can obtain $f$ :</p>
\[\begin{align}
\frac{\partial}{\partial f} \left( \sum_{i = 1}^{n} \left\{ f^2 - 2y_if \right\} \, \delta_{\sigma}(x_i - \mu) \right) = 0\\
\sum_{i = 1}^{n} \left\{ 2f - 2y_i \right\} \, \delta_{\sigma}(x_i - \mu) = 0 \\
\sum_{i = 1}^{n} f\delta_{\sigma}(x_i - \mu) - \sum_{i = 1}^{n} y_i \,\delta_{\sigma}(x_i - \mu) = 0 \\
f \sum_{i = 1}^{n} \delta_{\sigma}(x_i - \mu) = \sum_{i = 1}^{n} y_i \,\delta_{\sigma}(x_i - \mu) \\
f = \displaystyle \frac{ \sum_{i = 1}^{n} y_i \,\delta_{\sigma}(x_i - \mu)}{\sum_{i = 1}^{n} \delta_{\sigma}(x_i - \mu) } \text{ ,}
\end{align}\]
<p>and that, my friends, is the so-called <a href="https://en.wikipedia.org/wiki/Kernel_regression">Nadayara-Watson kernel regression</a>.</p>
<h2 id="python-implementation">Python implementation</h2>
<p>Making the functions:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">random</span>
<span class="k">def</span> <span class="nf">gauss</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="n">sig</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">):</span>
<span class="s">"""
delta dirac
"""</span>
<span class="n">c1</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span> <span class="n">np</span><span class="p">.</span><span class="n">pi</span> <span class="p">)</span> <span class="o">*</span> <span class="n">sig</span><span class="p">)</span>
<span class="n">c2</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="p">(</span> <span class="mi">2</span><span class="o">*</span><span class="p">(</span><span class="n">sig</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="p">)</span>
<span class="k">return</span> <span class="n">c1</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span> <span class="n">c2</span><span class="o">*</span><span class="p">(</span> <span class="n">u</span><span class="o">**</span><span class="mi">2</span> <span class="p">)</span> <span class="p">)</span>
<span class="k">def</span> <span class="nf">phi1</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">xs</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">gauss</span><span class="p">(</span><span class="n">u</span> <span class="o">-</span> <span class="n">xs</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">phi2</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">xs</span><span class="p">,</span><span class="n">ys</span><span class="p">):</span>
<span class="k">return</span> <span class="n">ys</span> <span class="o">@</span> <span class="n">gauss</span><span class="p">(</span> <span class="n">u</span> <span class="o">-</span> <span class="n">xs</span> <span class="p">)</span>
<span class="k">def</span> <span class="nf">nadayara_watson</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span><span class="n">x_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">):</span>
<span class="n">fu</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">x_test</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">u</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">x_test</span><span class="p">):</span>
<span class="n">fu</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">phi2</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span><span class="o">/</span><span class="n">phi1</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="n">X_train</span><span class="p">)</span>
<span class="k">return</span> <span class="n">fu</span>
</code></pre></div></div>
<p>Making the dataset and plotting results</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">500</span>
<span class="n">num_test</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="mf">0.30</span><span class="o">*</span><span class="n">n</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="mi">5</span> <span class="o">*</span> <span class="n">rng</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X</span><span class="p">).</span><span class="n">ravel</span><span class="p">()</span>
<span class="c1"># Add noise to targets
</span><span class="n">y</span><span class="p">[::</span><span class="mi">5</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">3</span> <span class="o">*</span> <span class="p">(</span><span class="mf">0.5</span> <span class="o">-</span> <span class="n">rng</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">//</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">test_idx</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="n">k</span> <span class="o">=</span> <span class="n">num_test</span><span class="p">)</span>
<span class="n">train_idx</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span> <span class="nb">set</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">))</span> <span class="o">-</span> <span class="nb">set</span><span class="p">(</span><span class="n">test_idx</span><span class="p">)</span> <span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">train_idx</span><span class="p">,:],</span><span class="n">X</span><span class="p">[</span><span class="n">test_idx</span><span class="p">,:]</span>
<span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">y</span><span class="p">[</span><span class="n">train_idx</span><span class="p">]</span> <span class="p">,</span><span class="n">y</span><span class="p">[</span><span class="n">test_idx</span><span class="p">]</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">nadayara_watson</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Training set'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_pred</span> <span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">"black"</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Nadayara-Watson predictions'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'x'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'y'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
</code></pre></div></div>
<p style="text-align: center;"><img src="/assets/images/nada_2.png" alt="foo" width="70%" /></p>In progressThe likelihood of a tree2021-02-05T00:00:00+00:002021-02-05T00:00:00+00:00https://ulises-rosas.github.io/jekyll/update/MLtree<p>Consider the following tree:</p>
<p style="text-align: center;"><img src="/assets/images/ml_tree.png" alt="foo" width="40%" /></p>
<p style="text-align: center;"><em>Image from Yang 2014</em></p>
<!-- <center><img src="" ...></center>
<center>This is an image</center> -->
<!-- {:.image-caption style="text-align: center;"} -->
<p>Nodes from 1 to 5 can depict nucleotides of each species on a given alignment, whose states are known (\(T, C, A ,C, C\)), and nodes 0, 6, 7, and 8 depict internal nodes, whose ancetral states are unknown (\(x_{0}, x_{6}, x_{7}, x_{8}\)). Assuming that evolution is independent in both sites and lineages, the likelihood of the whole tree is given by the product between likelihood of the site \(\textbf{x}_{i}\) (i.e., node state at the root) given model parameters \(\theta\) (e.g., branch lengths, transition/transversion rate ratios):</p>
\[\begin{equation}
L(\theta) = f(X|\theta) = \prod_{i = 1}^{n}f(\textbf{x}_{i}|\theta)
\end{equation}\]
<p>Where \(f(x)\) is a function that calculates the likelihood node states. If we apply lagorithms to both sides of the above equation we have:</p>
\[\begin{equation}
\ell = \ln{ L(\theta) } = \sum_{i}^{n} \ln{ f(\textbf{x}_{i}|\theta) }
\end{equation}\]
<p>Then, we just need to define \(f(x)\) and sum all terms of the right side of above equations to get the overall log-likelihood of the tree.</p>
<h2 id="the-pruning-algorithm">The pruning algorithm</h2>
<p>The function \(f(x)\) can be defined as the sum over all possible combinations of nucleotide change probability from tips to \(x_{0}\):</p>
\[\begin{equation}
f(\textbf{x}_{i}| \theta) = \sum_{x_{0}} \sum_{x_{6}} \sum_{x_{7}} \sum_{x_{8}} \left[
\pi_{x_{0}} P_{x_{0},x_{6}}(t_{6}) P_{x_{0},x_{8}}(t_{8})\\
\hspace{125mm} P_{C,x_{8}}(t_{4}) P_{C,x_{8}}(t_{5}) \\
\hspace{80mm} P_{A,x_{6}}(t_{3}) P_{x_{6},x_{7}}(t_{7}) \\
\hspace{125mm} P_{T,x_{7}}(t_{1}) P_{C,x_{7}}(t_{2}) \right]
\end{equation}\]
<p>Where \(\pi_{x_{0}}\) is the stationary probability (e.g., 0.25 for the K80 model), \(P_{i,j}(t)\) is the probability of base change \(i \rightarrow j\) (or \(i \leftarrow j\) due to reversebility of most DNA evolution models) along the branch length \(t\), obtained from the \(Q\) matrix (see below). While the above equation is enough to start estimating site likelihoods, the equation is not computationally efficient. For example, let expand the summation of all states in \(x_{8}\) (i.e., \(A, G, C, T\)):</p>
\[\begin{equation}
f(\textbf{x}_{i}| \theta) = \sum_{x_{0}} \sum_{x_{6}} \sum_{x_{7}} \left[
\left(\pi_{x_{0}} P_{x_{0},x_{6}}(t_{6}) P_{x_{0},A}(t_{8})\\
\hspace{115mm} P_{C,A}(t_{4}) P_{C,A}(t_{5}) \\
\hspace{73mm} P_{A,x_{6}}(t_{3}) P_{x_{6},x_{7}}(t_{7}) \\
\hspace{127mm} P_{T,x_{7}}(t_{1}) P_{C,x_{7}}(t_{2}) \right)+ \\
\hspace{60mm} \vdots\\
\hspace{60mm}\left(\pi_{x_{0}} P_{x_{0},x_{6}}(t_{6}) P_{x_{0},T}(t_{8})\\
\hspace{115mm} P_{C,T}(t_{4}) P_{C,T}(t_{5}) \\
\hspace{73mm} P_{A,x_{6}}(t_{3}) P_{x_{6},x_{7}}(t_{7}) \\
\hspace{124mm} P_{T,x_{7}}(t_{1}) P_{C,x_{7}}(t_{2}) \right)\right]
\end{equation}\]
<p>We should have 8 multiplications (i.e., number of nodes - 1) for each base (i.e., 4) and 3 additions (i.e., number of bases - 1). If we denote \(s\) the number of species and \(b\) the number of bases, then the number of operations after completely expading above equation is \((2sb - b - 1)b^{s - 2}\). For our tree the number of operations then would be \((2(5)(4) - 4 - 1)4^{5 - 2} = 2240\) per site; for s = 20 it is 10,651,518,894,080</p>
<p>The number of operations is huge due to the number of repeated calculations. However, these repeated calculations can be avoided by factorizing the summation:</p>
\[\begin{equation}
f(\textbf{x}_{i}| \theta) = \sum_{x_0} \pi_{x_0} \left\{
\sum_{x_6} P_{x_0,x_6}(t_6) \left[
\left( \sum_{x_7} P_{x_6,x_7}(t_7)P_{x_7,T}(t_1)P_{x_7,C}(t_2) \right)
P_{x_6,A}(t_3)
\right]\\
\times \left[ \sum_{x_8} P_{x_0,x_8}(t_8)P_{x_8,C}(t_4)P_{x_8,C}(t_5) \right]
\right\}
\end{equation}\]
<p>Above factorization is analogous to the factorization of \(x^2 + x\) into \(x(x + 1)\). It might not be apparent at the first sight, but above summation follows the (recursive) shape of the tree. This is the Felsenstein prunning algorithm.</p>
<h2 id="transition-probability-matrix">Transition probability matrix</h2>
<p>The probability of a base change takes the form of a Markov matrix (i.e., all elements $\geq 0$, all columns add to 1):</p>
\[\textbf{p}(t) = \begin{array}{c c c} &
\begin{array}{c c c c c} T &&& C &&& A &&& G\\ \end{array}
\\
\begin{array}{c c c c} T\\ C\\ A\\ G\\ \end{array}
&
\left[
\begin{array}{c c c c}
% P_{T \rightarrow T}(t) & P_{T \rightarrow C}(t) & P_{T \rightarrow A}(t) & P_{T \rightarrow G}(t)\\
% P_{C \rightarrow T}(t) & P_{C \rightarrow C}(t) & P_{C \rightarrow A}(t) & P_{C \rightarrow G}(t)\\
% P_{A \rightarrow T}(t) & P_{A \rightarrow C}(t) & P_{A \rightarrow A}(t) & P_{A \rightarrow G}(t)\\
% P_{G \rightarrow T}(t) & P_{G \rightarrow C}(t) & P_{G \rightarrow A}(t) & P_{G \rightarrow G}(t)
P_{T,T}(t) & P_{T,C}(t) & P_{T,A}(t) & P_{T,G}(t)\\
P_{C,T}(t) & P_{C,C}(t) & P_{C,A}(t) & P_{C,G}(t)\\
P_{A,T}(t) & P_{A,C}(t) & P_{A,A}(t) & P_{A,G}(t)\\
P_{G,T}(t) & P_{G,C}(t) & P_{G,A}(t) & P_{G,G}(t)
\end{array}
\right]
\end{array}\]
<p>And a property of the Markov process is \(\textbf{p}(t_{1} + t_{2}) = \textbf{p}(t_{1})\textbf{p}(t_{2}) = \textbf{p}(t_{2})\textbf{p}(t_{1})\) (Chapman-Kolgomorov equation). If we substract \(\textbf{p}(t_{1})\) at both sides we have:</p>
\[\begin{equation}
\textbf{p}(t_{1} + t_{2}) - \textbf{p}(t_{1}) = \textbf{p}(t_{2})\textbf{p}(t_{1}) - \textbf{p}(t_{1})\\
\textbf{p}(t_{1} + t_{2}) - \textbf{p}(t_{1}) = (\textbf{p}(t_{2}) - \textbf{I})\textbf{p}(t_{1})
\end{equation}\]
<p>Notice that the identity matrix \(\textbf{I}\) is equivalent to \(\textbf{p}(0)\) as it is a diagonal of ones, thus no base change. Then, substituting \(\textbf{p}(0)\) into above equation and applying limits with respect of \(t_{2}\):</p>
\[\begin{equation}
\textbf{p}(t _1 + t_2) - \textbf{p}(t_1) = (\textbf{p}(t_2) - \textbf{p}(0))\textbf{p}(t_1)\\
\lim_{t_2 \to 0} \frac{ \textbf{p}(t_1 + t_2) - \textbf{p}(t_1) }{t_2} = \lim_{t_2 \to 0} \frac{\textbf{p}(t_2) - \textbf{p}(0)}{t_2}\textbf{p}(t_1)
\end{equation}\]
<p>Letting \(t_2\) be \(\Delta t\), and putting a more straightforward format at the right hand side of above equation:</p>
\[\begin{equation}
\lim_{\Delta t \to 0} \frac{ \textbf{p}(t_1 + \Delta t) - \textbf{p}(t_1) }{\Delta t} = \lim_{\Delta t \to 0}\frac{\textbf{p}( 0 + \Delta t) - \textbf{p}(0)}{\Delta t} \textbf{p}(t_1)\\
\textbf{p}^{\prime}(t_1) = \textbf{p}^{\prime}(0)\textbf{p}(t_1)
\end{equation}\]
<p>The rate of change of the probability matrix at \(t = 0\), \(\textbf{p}^{\prime}(0)\), is also know as the \(\textbf{Q}\) matrix or instantaneous rate of change. We end up having the general form for matrix differentiation:</p>
\[\begin{equation}
\textbf{p}^{\prime}(t) = \textbf{Q}\textbf{p}(t)
\end{equation}\]
<!-- ---
**NOTE**
It works with almost all markdown flavours (the below blank line matters).
--- -->
<!-- > **_NOTE:_** The rate of change of the probability matrix at $$t = 0$$, $$\textbf{p}^{\prime}(0)$$, is also know as the $$ \textbf{Q} $$ matrix or instantaneous rate of change. Then, changing $$\textbf{p}^{\prime}(0)$$ by $$ \textbf{Q} $$, and $$t_1$$ by $$t$$, we have the general form for matrix differentiation: -->
<h3 id="approximated-solution">Approximated solution</h3>
<p>The general solution of a matrix differentiation for any \(t\) is (see Box 1 for more details):</p>
\[\begin{equation}
\textbf{p}(t) = e^{ \textbf{Q}t }\textbf{p}(0) = e^{ \textbf{Q}t }\textbf{I} = e^{ \textbf{Q}t }
\end{equation}\]
<p>Then, we can approximate \(\textbf{p}(t)\) with the exponential of the matrix:</p>
\[\begin{equation}
\textbf{p}(t) = e^{ \textbf{Q}t } = \frac{1}{0!}(\textbf{Q}t)^0 + \frac{1}{1!}(\textbf{Q}t)^1 + \frac{1}{2!}(\textbf{Q}t)^2 + \cdots\\
\textbf{p}(t) = \textbf{I} + \textbf{Q}t + \frac{1}{2!}(\textbf{Q}t)^2 + \cdots\\
\end{equation}\]
<p>Observations: i) \(\textbf{p}(t)\) is in function of time and the initial value \(\textbf{Q}\), ii) \(\textbf{Q}\) contains information of the evolutionary model, iii) while above summation converges into a specific matrix in function of the number of terms considered, eigenvalue decomposition could have also been used.</p>
<div class="warning" style="background-color:#E9D8FD; color: #69337A; border-left: solid #805AD5 4px; border-radius: 4px; padding:0.7em;">
<span>
<p style="margin-top:1em; text-align:center">
<b>Box 1: Glimpse on matrix differentiation</b></p>
<p style="margin-left:1em;">
Let $ \textbf{x} $ be a column vector and $ \textbf{A} $ be an invertible matrix. If we have the following form of vector differentiation: $ \textbf{x}^{\prime} = \textbf{A}\textbf{x} $, we know its solution is $ \textbf{x}(t) = e^{ \textbf{A}t } \textbf{x}(0) $. However, above solution form also holds when we have an square matrix such as $ \textbf{p}^{\prime} $. For example, if we represent $ \textbf{p}^{\prime} = \textbf{Q}\textbf{p} $ as:
$$
\begin{bmatrix} | & & | \\
\textbf{p}^{\prime}_1 & \cdots & \textbf{p}^{\prime}_n\\
| & & | \end{bmatrix} =
\begin{bmatrix} - & \textbf{q}_1 & - \\
& \vdots & \\
- & \textbf{q}_n & - \end{bmatrix}
\begin{bmatrix} | & & | \\
\textbf{p}_1 & \cdots & \textbf{p}_n\\
| & & | \end{bmatrix}\\
\begin{bmatrix} | & & | \\
\textbf{p}^{\prime}_1 & \cdots & \textbf{p}^{\prime}_n\\
| & & | \end{bmatrix} =
\begin{bmatrix} \textbf{q}_1\textbf{p}_1 & \cdots & \textbf{q}_1\textbf{p}_n \\
\vdots & \ddots & \vdots \\
\textbf{q}_n\textbf{p}_1 & \cdots & \textbf{q}_n\textbf{p}_n \end{bmatrix}
$$
and taking only the first column from above matrices:
$$
\begin{bmatrix} | \\
\textbf{p}^{\prime}_1 \\
| \end{bmatrix} =
\begin{bmatrix} \textbf{q}_1\textbf{p}_1 \\
\vdots \\
\textbf{q}_n\textbf{p}_1 \end{bmatrix} =
\begin{bmatrix} - & \textbf{q}_1 & - \\
& \vdots & \\
- & \textbf{q}_n & - \end{bmatrix}
\begin{bmatrix} | \\
\textbf{p}_1 \\
| \end{bmatrix} = \textbf{Q}\textbf{p}_1
$$
$\textbf{p}^{\prime}_1$ column vector can be directly solved:
$$
\textbf{p}^{\prime}_1 = \textbf{Q}\textbf{p}_1 \Rightarrow \textbf{p}_1(t) = e^{ \textbf{Q}t } \textbf{p}_1(0)
$$
Plugging back all solved columns into the original matrix:
$$
\begin{bmatrix} | & & | \\
\textbf{p}_1(t) & \cdots & \textbf{p}_n(t)\\
| & & | \end{bmatrix} =
\begin{bmatrix} | & & | \\
e^{ \textbf{Q}t } \textbf{p}_1(0) & \cdots & e^{ \textbf{Q}t } \textbf{p}_n(0)\\
| & & | \end{bmatrix}\\
\begin{bmatrix} | & & | \\
\textbf{p}_1(t) & \cdots & \textbf{p}_n(t)\\
| & & | \end{bmatrix} =
e^{ \textbf{Q}t }\begin{bmatrix} | & & | \\
\textbf{p}_1(0) & \cdots & \textbf{p}_n(0)\\
| & & | \end{bmatrix}\\
\Rightarrow \textbf{p}(t) = e^{ \textbf{Q}t }\textbf{p}(0)
$$
</p>
</span>
</div>
<h3 id="exact-solution">Exact solution</h3>
<p>There are some cases where exact solution can be obtained without the need of numerical optimization over the Q matrix. This is the case of Jukes-Cantor (JC) and Kimura-2-parameters (K2P) models. Here is the solution for the K2P model:</p>
\[\begin{equation}
P_{i,j}(d)=\begin{cases}
\frac{1}{4}( 1 + e^{-4d/(k+2)} + 2e^{-2d(k+1)/(k+2)}) &, \text{if i = j }\\
\frac{1}{4}( 1 + e^{-4d/(k+2)} - 2e^{-2d(k+1)/(k+2)}) &, \text{if transition}\\
\frac{1}{4}( 1 - e^{-4d/(k+2)} ) &, \text{if transversion}\\
\end{cases}
\end{equation}\]
<p>Where transition/transversion ratio is given by \(k = \alpha/\beta\) and the coefficient \(d = (\alpha + 2\beta)t\) is a used as a proxy of time/distance between two bases. To obtain the full solution for the K2P model, the so-called ‘integrating factors’ trick should be used in the middle of the derivations.</p>
<h2 id="python-implementation">Python implementation</h2>
<p>To obtain the probabilities of the tips we know that:</p>
<p>\(L_{7}(T) = P_{TT}(0.2) \times P_{TC}(0.2) = 0.825 \times 0.084 = 0.069\)
\(L_{7}(C) = P_{CT}(0.2) \times P_{CC}(0.2) = 0.084 \times 0.825 = 0.069\)
\(L_{7}(A) = P_{AT}(0.2) \times P_{AC}(0.2) = 0.045 \times 0.045 = 0.002\)
\(L_{7}(G) = P_{GT}(0.2) \times P_{GC}(0.2) = 0.045 \times 0.045 = 0.002\)</p>
<p>But, this can also be re-written as:</p>
\[L_{7} = \begin{bmatrix} 0.069 & 0.069 & 0.002 & 0.002 \end{bmatrix} =\\
\begin{bmatrix} 0.825 & 0.084 & 0.045 & 0.045 \end{bmatrix} \odot
\begin{bmatrix} 0.084 & 0.825 & 0.045 & 0.045 \end{bmatrix} =\\
X \odot Y\]
<p>In turn, \(X\) and \(Y\) can be re-written as:</p>
\[X = \begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix} \cdot
\begin{bmatrix} 0.825 & 0.084 & 0.045 & 0.045\\
0.084 & 0.825 & 0.045 & 0.045\\
0.045 & 0.045 & 0.825 & 0.084\\
0.045 & 0.045 & 0.084 & 0.825 \end{bmatrix}\]
\[Y = \begin{bmatrix} 0 & 1 & 0 & 0 \end{bmatrix} \cdot
\begin{bmatrix} 0.825 & 0.084 & 0.045 & 0.045\\
0.084 & 0.825 & 0.045 & 0.045\\
0.045 & 0.045 & 0.825 & 0.084\\
0.045 & 0.045 & 0.084 & 0.825 \end{bmatrix}\]
<p>Where \(L_{1} = \begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix}\) represents the base T at the terminal node 1
and \(L_{2} = \begin{bmatrix} 0 & 1 & 0 & 0 \end{bmatrix}\) represents the base C at the terminal node 2. Then,
\(L_{7}\) can be expressed in terms of daughter’s likelihood and \(P\) matrix:</p>
\[L_{7} = X \odot Y = L_{1} \cdot P_{(0.2)} \odot L_{2} \cdot P_{(0.2)}\]
<p>From there it follows the Felsenstein prunning algorithm. Here is an possible implementation of it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>
<h3 id="kimuras-two-parameters">Kimura’s two-parameters</h3>
<p>Kimura’s two-parameters equation for comparing two bases (\(i, j\)):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">k2p</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
<span class="n">_lib</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'A'</span><span class="p">:</span> <span class="s">'G'</span><span class="p">,</span> <span class="s">'G'</span><span class="p">:</span> <span class="s">'A'</span><span class="p">,</span>
<span class="s">'T'</span><span class="p">:</span> <span class="s">'C'</span><span class="p">,</span> <span class="s">'C'</span><span class="p">:</span> <span class="s">'T'</span>
<span class="p">}</span>
<span class="n">first_t</span> <span class="o">=</span><span class="k">lambda</span> <span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">:</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="o">*</span><span class="p">(</span> <span class="n">d</span><span class="o">/</span><span class="p">(</span><span class="n">k</span><span class="o">+</span><span class="mi">2</span><span class="p">)</span> <span class="p">))</span>
<span class="n">second_t</span><span class="o">=</span><span class="k">lambda</span> <span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">:</span><span class="mi">2</span><span class="o">*</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="o">*</span><span class="p">(</span> <span class="n">d</span><span class="o">*</span><span class="p">(</span><span class="n">k</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">k</span><span class="o">+</span><span class="mi">2</span><span class="p">)</span> <span class="p">))</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="n">j</span><span class="p">:</span>
<span class="c1"># equal base
</span> <span class="k">return</span> <span class="mf">0.25</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">first_t</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">)</span> <span class="o">+</span> <span class="n">second_t</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">if</span> <span class="n">_lib</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">j</span><span class="p">:</span>
<span class="c1"># transition
</span> <span class="k">return</span> <span class="mf">0.25</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">first_t</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">)</span> <span class="o">-</span> <span class="n">second_t</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># transversion
</span> <span class="k">return</span> <span class="mf">0.25</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">first_t</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">))</span>
</code></pre></div></div>
<p>P matrix implementation:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">P_mat</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="n">k</span><span class="p">):</span>
<span class="s">"""
P matrix
"""</span>
<span class="n">bases</span> <span class="o">=</span> <span class="p">[</span><span class="s">'T'</span><span class="p">,</span> <span class="s">'C'</span><span class="p">,</span> <span class="s">'A'</span><span class="p">,</span> <span class="s">'G'</span><span class="p">]</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span>
<span class="p">[</span><span class="n">k2p</span><span class="p">(</span><span class="n">bases</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">i</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">bases</span><span class="p">],</span>
<span class="p">[</span><span class="n">k2p</span><span class="p">(</span><span class="n">bases</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">i</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">bases</span><span class="p">],</span>
<span class="p">[</span><span class="n">k2p</span><span class="p">(</span><span class="n">bases</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">i</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">bases</span><span class="p">],</span>
<span class="p">[</span><span class="n">k2p</span><span class="p">(</span><span class="n">bases</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="n">i</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">bases</span><span class="p">]</span>
<span class="p">])</span>
</code></pre></div></div>
<!--
\left[\matrix{a^2-b^2& -1\\ 1& 2ab}\right] -->
<p>My tree:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tree</span> <span class="o">=</span> <span class="p">{</span>
<span class="mi">0</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[</span><span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="bp">None</span><span class="p">},</span>
<span class="mi">1</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">7</span><span class="p">},</span>
<span class="mi">2</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">7</span><span class="p">},</span>
<span class="mi">3</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">6</span><span class="p">},</span>
<span class="mi">4</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">8</span><span class="p">},</span>
<span class="mi">5</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">8</span><span class="p">},</span>
<span class="mi">6</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">7</span><span class="p">],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">0</span><span class="p">},</span>
<span class="mi">7</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">6</span><span class="p">},</span>
<span class="mi">8</span><span class="p">:</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span> <span class="s">'daughters'</span><span class="p">:</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span> <span class="s">'parent'</span><span class="p">:</span> <span class="mi">0</span><span class="p">}</span> <span class="p">}</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">terminal_q</span> <span class="o">=</span> <span class="n">P_mat</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">internal_q</span> <span class="o">=</span> <span class="n">P_mat</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">update_probs</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">node</span><span class="p">):</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="n">node</span><span class="p">][</span><span class="s">'data'</span><span class="p">])</span>
<span class="n">daughters</span> <span class="o">=</span> <span class="n">tree</span><span class="p">[</span><span class="n">node</span><span class="p">][</span><span class="s">'daughters'</span><span class="p">]</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">daughters</span><span class="p">:</span>
<span class="k">return</span> <span class="n">probs</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">terminal_q</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">probs</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">internal_q</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="recursive-function">Recursive function</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">update_recursively</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">node</span><span class="p">)</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">daughters</span> <span class="o">=</span> <span class="n">tree</span><span class="p">[</span><span class="n">node</span><span class="p">][</span><span class="s">'daughters'</span><span class="p">]</span>
<span class="n">no_data</span> <span class="o">=</span> <span class="p">[</span> <span class="n">d</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">daughters</span> <span class="k">if</span> <span class="n">tree</span><span class="p">[</span><span class="n">d</span><span class="p">][</span><span class="s">'data'</span><span class="p">]</span> <span class="ow">is</span> <span class="bp">None</span> <span class="p">]</span>
<span class="k">if</span> <span class="n">no_data</span><span class="p">:</span>
<span class="k">return</span> <span class="n">update_recursively</span><span class="p">(</span><span class="n">no_data</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">new_prob</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">4</span><span class="p">,))</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">daughters</span><span class="p">:</span>
<span class="n">new_prob</span> <span class="o">*=</span> <span class="n">update_probs</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">d</span><span class="p">)</span>
<span class="n">tree</span><span class="p">[</span><span class="n">node</span><span class="p">][</span><span class="s">'data'</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_prob</span>
<span class="n">anc_node</span> <span class="o">=</span> <span class="n">tree</span><span class="p">[</span><span class="n">node</span><span class="p">][</span><span class="s">'parent'</span><span class="p">]</span>
<span class="k">return</span> <span class="n">update_recursively</span><span class="p">(</span><span class="n">anc_node</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s run this function from the node 0</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">update_recursively</span><span class="p">(</span><span class="n">node</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="mi">0</span>
<span class="mi">6</span>
<span class="mi">7</span>
<span class="mi">6</span>
<span class="mi">0</span>
<span class="mi">8</span>
<span class="mi">0</span>
<span class="bp">None</span>
</code></pre></div></div>
<p>Now, if we see our initial tree:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">tree</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
0: {'data': array([1.120e-04, 1.838e-03, 7.500e-05, 1.400e-05]),
'daughters': [6, 8],
'parent': None},
1: {'data': [1, 0, 0, 0], 'daughters': [], 'parent': 7},
2: {'data': [0, 1, 0, 0], 'daughters': [], 'parent': 7},
3: {'data': [0, 0, 1, 0], 'daughters': [], 'parent': 6},
4: {'data': [0, 1, 0, 0], 'daughters': [], 'parent': 8},
5: {'data': [0, 1, 0, 0], 'daughters': [], 'parent': 8},
6: {'data': array([0.00300556, 0.00300556, 0.00434364, 0.00044365]),
'daughters': [3, 7],
'parent': 0},
7: {'data': array([0.06953344, 0.06953344, 0.00205366, 0.00205366]),
'daughters': [1, 2],
'parent': 6},
8: {'data': array([0.00710204, 0.68077648, 0.00205366, 0.00205366]),
'daughters': [4, 5],
'parent': 0}}
</code></pre></div></div>
<p>References:</p>
<ol>
<li>Yang, Z. (2014). Molecular evolution: a statistical approach. Oxford University Press.</li>
<li>Felenstein, J. (2004). Inferring phylogenies. Sunderland, MA: Sinauer associates.</li>
<li>Pupko, T., & Mayrose, I. (2020). A gentle introduction to probabilistic evolutionary models.</li>
</ol>A very fast view and Python implementationBranches and dendropy2020-12-04T00:00:00+00:002020-12-04T00:00:00+00:00https://ulises-rosas.github.io/jekyll/update/terminalbranchlength<p>Dendropy contains many subclasses inside each object, then extract information might actually
end up being a task of finding the correct one subclass. Here I present the call of those subclasses
for getting some common metrics in branches</p>
<h2 id="terminal-branch-lengths">Terminal branch lengths</h2>
<p>Given the following mock tree:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dendropy</span>
<span class="n">str_tree</span> <span class="o">=</span> <span class="s">'((A:1,B:2):3,(C:4,D:5):6);'</span>
<span class="n">tree</span> <span class="o">=</span> <span class="n">dendropy</span><span class="p">.</span><span class="n">Tree</span><span class="p">.</span><span class="n">get_from_string</span><span class="p">(</span><span class="n">str_tree</span><span class="p">,</span> <span class="s">'newick'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span> <span class="n">tree</span><span class="p">.</span><span class="n">as_ascii_plot</span><span class="p">(</span><span class="n">plot_metric</span> <span class="o">=</span> <span class="s">'length'</span><span class="p">)</span> <span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> /-- A
/------+
| \---- B
+
| /-------- C
\------------+
\---------- D
</code></pre></div></div>
<p>We can get terminal branch length by iterating nodes in ‘postorder’:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">nd</span> <span class="ow">in</span> <span class="n">tree</span><span class="p">.</span><span class="n">postorder_edge_iter</span><span class="p">():</span>
<span class="k">if</span> <span class="n">nd</span><span class="p">.</span><span class="n">length</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">taxn</span> <span class="o">=</span> <span class="n">nd</span><span class="p">.</span><span class="n">head_node</span><span class="p">.</span><span class="n">taxon</span>
<span class="k">if</span> <span class="n">taxn</span><span class="p">:</span>
<span class="n">df</span><span class="p">[</span><span class="n">taxn</span><span class="p">.</span><span class="n">label</span><span class="p">]</span> <span class="o">=</span> <span class="n">nd</span><span class="p">.</span><span class="n">length</span>
<span class="n">df</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'A': 1.0, 'B': 2.0, 'C': 4.0, 'D': 5.0}
</code></pre></div></div>
<h2 id="collapse-nodes-into-a-polytomy">Collapse nodes into a polytomy</h2>
<p>Between two internal nodes theres is a edge (i.e., branch length) and when this edge is
so short, you might want to collapse these two nodes, thus creating a polytomy.</p>
<p>Given mock the following tree:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str_tree</span> <span class="o">=</span> <span class="s">'((A:1,B:2):3,((C:0,D:0):0,E:7):8);'</span>
<span class="n">tree</span> <span class="o">=</span> <span class="n">dendropy</span><span class="p">.</span><span class="n">Tree</span><span class="p">.</span><span class="n">get_from_string</span><span class="p">(</span><span class="n">str_tree</span><span class="p">,</span> <span class="s">'newick'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span> <span class="n">tree</span><span class="p">.</span><span class="n">as_string</span><span class="p">(</span><span class="s">'newick'</span><span class="p">)</span> <span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((A:1.0,B:2.0):3.0,((C:0.0,D:0.0):0.0,E:7.0):8.0);
</code></pre></div></div>
<p>We can see from the above string that ‘C’ and ‘D’ form a single clade and ‘E’ is the sister taxon of these ones. Now, lets try to collapse these three taxa into a single clade by using a threshold (i.e., <code class="language-plaintext highlighter-rouge">min_len = 0</code>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">min_len</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">nd</span> <span class="ow">in</span> <span class="n">tree</span><span class="p">.</span><span class="n">postorder_edge_iter</span><span class="p">():</span>
<span class="k">if</span> <span class="n">nd</span><span class="p">.</span><span class="n">length</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">nd</span><span class="p">.</span><span class="n">is_internal</span><span class="p">()</span> <span class="ow">and</span> <span class="n">nd</span><span class="p">.</span><span class="n">length</span> <span class="o"><=</span> <span class="n">min_len</span><span class="p">:</span>
<span class="n">nd</span><span class="p">.</span><span class="n">collapse</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span> <span class="n">tree</span><span class="p">.</span><span class="n">as_string</span><span class="p">(</span><span class="s">'newick'</span><span class="p">)</span> <span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((A:1.0,B:2.0):3.0,(C:0.0,D:0.0,E:7.0):8.0);
</code></pre></div></div>
<p>Notice that now ‘C’, ‘D’, and ‘E’ are inside the same parenthesis.</p>Code gist