# Posts on Mathematics

## An interpretation of the Lagrange multipliers method

Posted by Diego Assencio on 2015.11.14 under Mathematics (Calculus)

The Lagrange multipliers method is a technique which solves the following problem: find the local maxima and minima of a function $f(x_1, \ldots, x_n)$ over values of $(x_1, \ldots, x_n)$ which satisfy $m$ constraint equations: $$g_k(x_1, \ldots, x_n) = 0, \quad k = 1, 2, \ldots, m. \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}$$ Each constraint equation defines an $(n-1)$-dimensional surface. The intersection of all these surfaces defines the set of points over which we must find local maxima and minima of $f({\bf x})$, where ${\bf x} = (x_1, \ldots, x_n)$. In what follows, we will refer to this set of points as the feasible surface. Without the constraints defined in \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}, we could use standard techniques from calculus to solve the problem by finding and analyzing the stationary points of $f({\bf x})$. With the imposed constraints, however, things are not as simple.

Our approach here will be to study the properties of the local maxima and minima of $f({\bf x})$ subject to the constraints above. This will lead us naturally to the Lagrange multipliers method and therefore illustrate why it works.

To start, notice that each constraint equation $g_k({\bf x}) = 0$ allows us to express one coordinate $x_j$ in terms of the others. In principle, we can start with $g_1({\bf x}) = 0$ and determine $x_n$ in terms of $(x_1, \ldots, x_{n-1})$, then use this equation for $x_n$ on $g_2({\bf x}) = 0$ to determine $x_{n-1}$ in terms of $(x_1, \ldots, x_{n-2})$ and so on. With $m$ constraints, we can determine $(x_{n-m+1}, \ldots, x_n)$ in terms of $(x_1, \ldots, x_{n-m})$, so the feasible surface can be parameterized using $(n-m)$ parameters, one for each of the coordinates $x_j$ for $j = 1, 2, \ldots, (n-m)$. To be more precise, we can set $x_j = \alpha_j$ for $j = 1, 2, \ldots, (n-m)$ and express the remaining coordinates $x_j$ for which $j = (n-m+1), \ldots, n$ in terms of $(\alpha_1, \ldots, \alpha_{n-m})$. In the general case, we might not be able to proceed exactly as described above (for instance, if $g_1({\bf x})$ does not explicitly depend on $x_n$, we cannot use it to determine $x_n$ in terms of $(x_1, \ldots, x_{n-1})$), but the main idea remains valid: ignoring pathological cases, we can express one of the free variables in terms of the other free variables for each constraint equation, so with $m$ constraints, only $(n-m)$ variables will remain free. This means the feasible surface is an $(n-m)$-dimensional surface.

Let's give an example to clarify what we just discussed. Consider a function $f(x,y,z)$ and two constraints $g_1(x,y,z) = 0$ and $g_2(x,y,z) = 0$. We can use $g_1(x,y,z) = 0$ to determine $z$ as a function of $x$ and $y$: $z = z(x,y)$. By replacing $z$ with $z(x,y)$ on $g_2(x,y,z) = 0$, we remain with only two free variables $x$ and $y$, so we can determine $y$ as a function of $x$: $y = y(x)$, meaning the feasible surface is a curve which can be parameterized using $x = \alpha$. This curve is the set of points $(\alpha, y(\alpha), z(\alpha, y(\alpha)))$ for all valid values of $\alpha$. Therefore, with $n = 3$ dimensions and $m = 2$ constraints, we get a feasible surface with $(n-m) = 1$ dimension.

Now consider a point ${\bf x}$ on the feasible surface such that $f({\bf x})$ is a local maximum or minimum. Consider also an infinitesimal displacement $\delta{\bf x}$ away from ${\bf x}$, with $\delta{\bf x}$ being tangent to the feasible surface. We have that: $$f({\bf x} + \delta{\bf x}) \approx f({\bf x}) + \nabla f({\bf x})\cdot \delta{\bf x} \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_f_x_plus_dx}$$ Since $\nabla f({\bf x})$ points along the direction at which $f({\bf x})$ increases the fastest, $\nabla f({\bf x})$ must be locally perpendicular to the feasible surface and therefore $\nabla f({\bf x})\cdot \delta{\bf x} = 0$. If this would not be the case, then either $\nabla f({\bf x})\cdot \delta{\bf x} \gt 0$ or $\nabla f({\bf x})\cdot \delta{\bf x} \lt 0$. Assuming $\nabla f({\bf x})\cdot \delta{\bf x} \gt 0$, we see from equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_f_x_plus_dx} that $f({\bf x} + \delta{\bf x}) \gt f({\bf x})$. But for such a displacement $\delta{\bf x}$, we would also have $\nabla f({\bf x})\cdot (-\delta{\bf x}) \lt 0$ and equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_f_x_plus_dx} with $\delta{\bf x}$ replaced by $(-\delta{\bf x})$ implies $f({\bf x} - \delta{\bf x}) \lt f({\bf x})$. In other words, $f$ increases if we move to ${\bf x} + \delta{\bf x}$ but decreases if we move to ${\bf x} - \delta{\bf x}$, meaning $f({\bf x})$ cannot be a local maximum or minimum. The argument is similar for $\nabla f({\bf x})\cdot \delta{\bf x} \lt 0$. To summarize, $\nabla f({\bf x})$ is locally perpendicular to the feasible surface at all points ${\bf x}$ of the surface where $f({\bf x})$ is a local maximum or minimum.

This leads us to the question: how can we mathematically determine what "perpendicular to the feasible surface" is? The answer lies in the constraints from \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}: for each function $g_k({\bf x})$, the constraint $g_k({\bf x}) = 0$ defines a level set of $g_k({\bf x})$, i.e., a set of points over which $g_k({\bf x})$ is equal to a constant value. Since the constant in our case is zero, we will refer to this level set as the zero level set of $g_k({\bf x})$. Using the fact that $\nabla g_k({\bf x})$ points along the direction of fastest growth of $g_k({\bf x})$ at ${\bf x}$, we can infer that it must be locally perpendicular to the zero level set of $g_k({\bf x})$ since, for a given point ${\bf x}$ on the zero level set and an infinitesimal displacement vector $\delta{\bf x}$ which is tangent to this level set, we have that ${\bf x} + \delta{\bf x}$ is still on the zero level set and therefore: $$g_k({\bf x} + \delta{\bf x}) = 0 \approx g_k({\bf x}) + \nabla g_k({\bf x})\cdot \delta{\bf x} = 0 + \nabla g_k({\bf x})\cdot \delta{\bf x}$$ so $\nabla g_k({\bf x})\cdot \delta{\bf x} = 0$. In other words, for every ${\bf x}$ in the zero level set of $g_k({\bf x})$, $\nabla g_k({\bf x})$ is locally perpendicular to this surface at ${\bf x}$.

Therefore, for a point ${\bf x}$ on the zero level set of $g_k({\bf x})$, $\nabla g_k({\bf x})$ defines an $(n-1)$-dimensional plane which is tangent to this level set at ${\bf x}$. For any infinitesimal displacement $\delta{\bf x}$ which is parallel to this plane, ${\bf x} + \delta{\bf x}$ is still on the zero level set of $g_k({\bf x})$. So if ${\bf x}$ belongs to the feasible surface, it is on the zero level set of all functions $g_k({\bf x})$ and therefore there are $m$ planes which pass through ${\bf x}$ and are perpendicular to the zero level sets of each $g_k({\bf x})$ for $k = 1, 2, \ldots m$ respectively. The intersection of these $m$ planes defines an $(n-m)$-dimensional surface which is locally equal (tangent) to the feasible surface. Indeed, for an infinitesimal displacement $\delta{\bf x}$ away from ${\bf x}$ such that ${\bf x} + \delta{\bf x}$ remains on the intersection of the $m$ planes, ${\bf x} + \delta{\bf x}$ remains on the zero level set of every function $g_k({\bf x})$. In other words, ${\bf x} + \delta{\bf x}$ is on the feasible surface, so the $(n-m)$-dimensional surface defined by the intersection of the $m$ planes at ${\bf x}$ is locally tangent to the feasible surface.

As an example, if $n = 3$ and $m = 2$, for any point $(x,y,z)$ on the feasible surface, there are two planes perpendicular to $\nabla g_1(x,y,z)$ and $\nabla g_2(x,y,z)$ respectively which pass through $(x,y,z)$; the intersection of these planes defines a line passing through $(x,y,z)$. For small displacements around $(x,y,z)$ on this line, we stay on both of these planes and therefore on the zero level sets of both $g_1(x,y,z)$ and $g_2(x,y,z)$, i.e., on the feasible surface. This implies that points on this line within an infinitesimal neighborhood of ${\bf x}$ are part of the feasible surface as well.

For the displacement vector $\delta{\bf x}$ considered above, since ${\bf x} + \delta{\bf x}$ is on each of the $m$ planes which pass through ${\bf x}$ and are perpendicular to $\nabla g_k({\bf x})$ for $k = 1, 2, \ldots m$ respectively, then $\delta{\bf x}$ is perpendicular to all $\nabla g_k({\bf x})$ and therefore to any linear combination of these vectors, i.e., for any set of values $(\lambda_1, \ldots, \lambda_m)$, we have that: $$\sum_{k=1}^m \lambda_k \nabla g_k({\bf x})\cdot\delta{\bf x} = 0$$ As a concrete example, consider again the case $n = 3$ and $m = 2$. The feasible surface is a one-dimensional curve: at each point $(x,y,z)$ on this curve, there are two distinct planes passing through $(x,y,z)$ which are locally tangent to the zero level sets of $g_1(x,y,z)$ and $g_2(x,y,z)$ and perpendicular to $\nabla g_1(x,y,z)$ and $\nabla g_2(x,y,z)$ respectively. The intersection of these planes forms a line which is locally tangent to the feasible surface at $(x,y,z)$. An infinitesimal displacement vector $\delta{\bf x}$ which is parallel to this line is parallel to both planes and therefore simultaneously perpendicular to $\nabla g_1(x,y,z)$ and $\nabla g_2(x,y,z)$. This means $\delta{\bf x}$ is perpendicular to $\lambda_1 \nabla g_1(x,y,z) + \lambda_2 \nabla g_2(x,y,z)$ for any values of $\lambda_1$ and $\lambda_2$.

As we showed above, for a point ${\bf x}$ on the feasible surface at which $f({\bf x})$ is a local maximum or minimum, $\nabla f({\bf x})$ is locally perpendicular to any infinitesimal displacement vector $\delta{\bf x}$ which is tangent to the feasible surface at ${\bf x}$. Since the feasible surface has dimension $(n-m)$ and the set of $m$ perpendicular vectors $\nabla g_k({\bf x})$ spans locally an $m$-dimensional vector space (assuming they are linearly independent, which is the case here since we agreed on ignoring pathological cases), then we must be able to express $\nabla f({\bf x})$ as a linear combination of any $(n-m)$ linearly independent vectors which are tangent to the feasible surface and the $m$ vectors $\nabla g_k({\bf x})$ for $k = 1, 2, \ldots m$ which are locally perpendicular to it. With $\nabla f({\bf x})$ being perpendicular to the feasible surface, we can only have: $$\nabla f({\bf x}) = \sum_{k=1}^m \lambda_k \nabla g_k({\bf x}) \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f}$$ In other words, if we can determine all points ${\bf x}$ for which both \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints} and \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} hold for any set of values $(\lambda_1, \ldots, \lambda_m)$, then all maxima and minima of $f({\bf x})$ over the feasible surface must be contained in this set of points because, as we showed above, every maximum or minimum of $f({\bf x})$ on the feasible surface must satisfy equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} for some set of values $(\lambda_1, \ldots, \lambda_m)$. But this is exactly what the Lagrange multiplier method computes! Indeed, the Lagrangian multiplier method tells us to solve the following problem: compute all stationary points of the following function (commonly called the "Lagrangian" for the problem): $$L(x_1, \ldots, x_n, \lambda_1, \ldots, \lambda_m) = f(x_1, \ldots, x_n) - \sum_{k=1}^m \lambda_k g_k(x_1, \ldots, x_n) \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_L}$$ Notice that this problem has no constraints: we just have to determine the stationary points of $L$, i.e, solve a standard calculus problem. At the stationary points of $L$, all its first order partial derivatives vanish: $$\begin{eqnarray} \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_1}\displaystyle\frac{\partial L}{\partial x_i} &=& \frac{\partial f}{\partial x_i} - \sum_{k=1}^m \lambda_k \frac{\partial g_k}{\partial x_i} = 0 & \quad \textrm{ for } i = 1, 2, \ldots, n \\[5pt] \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_2}\displaystyle\frac{\partial L}{\partial \lambda_k} &=& g_k(x_1, \ldots, x_n) = 0 &\quad \textrm{ for } k = 1, 2, \ldots, m \end{eqnarray}$$ Equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_2} shows that the stationary points $(x_1, \ldots, x_n, \lambda_1, \ldots, \lambda_m)$ of $L$ are such that $(x_1, \ldots, x_n)$ satisfy all the constraints $g_k(x_1, \ldots, x_n) = 0$ for $k = 1, 2, \ldots, m$, so these points fall inside the feasible surface defined by equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}. Additionally, equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_1} can be expressed as below: $$\nabla_{\bf x} L({\bf x}, {\pmb\lambda}) = \nabla f({\bf x}) - \sum_{k=1}^m \lambda_k \nabla g_k({\bf x}) = 0 \Longrightarrow \nabla f({\bf x}) = \sum_{k=1}^m \lambda_k \nabla g_k({\bf x})$$ where above $\nabla_{\bf x} L$ is used to indicate that it is the gradient of $L$ computed only over the coordinates ${\bf x} = (x_1, \ldots, x_n)$, with $\pmb\lambda = (\lambda_1, \ldots, \lambda_m)$. To summarize, the Lagrangian multiplier method determines all points $({\bf x}, {\pmb\lambda})$ such that the constraints \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints} are satisfied and such that \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} also holds. By studying the properties of each of these points, we can then determine which ones are local maxima, minima or just saddle points, but this task is outside the scope of this post and will be left to the reader. Notice, however, that if all we care about is finding the global maximum/minimum of $f({\bf x})$ over the feasible surface, all we need to do is compute $f({\bf x})$ for every point ${\bf x}$ obtained from the Lagrange multipliers method and only keep the largest/smallest one.

### References

 [1] Dan Klein, Lagrange Multipliers without Permanent Scarring [2] T. M. Apostol, Calculus, Volume II, John Wiley & Sons; 2nd edition (1969)

## How many coin flips until you get a head?

Posted by Diego Assencio on 2014.05.21 under Mathematics (Statistics and probability)

Suppose you have a coin which, when flipped, yields a head with probability $p$ and a tail with probability $q = 1-p$ (if the coin is fair, then $p = q = 1/2$). How many times do you need to flip this coin on average until you get a head?

To answer this question, consider the probability $P(n)$ of obtaining a head only on the $n$-th coin flip. In other words, $P(n)$ is the probability of obtaining $n-1$ consecutive tails and then a head: $$P(n) = q^{n-1} p$$ The average number of coin flips we need until we get a head is then: $$E_p[n] = \sum_{n=1}^\infty nP(n) = \sum_{n=1}^\infty nq^{n-1} p \label{post_6c1cfa313064329046317358d2aa22c0_mean_n}$$ All we need to do now is to compute the infinite sum on the right-hand side of equation \eqref{post_6c1cfa313064329046317358d2aa22c0_mean_n}. To achieve that goal, consider the following sequence: $$e_k = \sum_{n=1}^k nq^{n-1} p = p\sum_{n=1}^k nq^{n-1} \label{post_6c1cfa313064329046317358d2aa22c0_ek}$$ The sequence $e_k$ is such that $E_p[n] = \lim_{k \rightarrow \infty} e_k$. It is closely related to the geometric series. In fact, consider the sequence $g_k$ which is obtained by taking the first $k$ terms of the geometric series as its $k$-th element (below $z$ is any real number of our choice): $$g_k = \sum_{n=0}^k z^n = \displaystyle\frac{1 - z^{k+1}}{1 - z}$$ Taking the derivative of $g_k$ with respect to $z$ yields: $$\displaystyle\frac{d g_k}{dz} = \sum_{n=1}^k nz^{n-1} = -\frac{(k+1)z^k}{1 - z} + \frac{1 - z^{k+1}}{(1 - z)^2} \label{post_6c1cfa313064329046317358d2aa22c0_dgk_dz}$$ where the term with $n = 0$ was omitted since it is identically zero. Comparing equations \eqref{post_6c1cfa313064329046317358d2aa22c0_ek} and \eqref{post_6c1cfa313064329046317358d2aa22c0_dgk_dz}, we see that: $$e_k = p\sum_{n=1}^k nq^{n-1} = p\displaystyle\frac{d g_k}{dz}\Bigg|_{z = q}$$ In other words, $e_k$ is equal to $p$ times the derivative of $g_k$ with respect to $z$ computed at $z = q$: $$e_k = p\left(-\frac{(k+1)q^k}{p} + \frac{1 - q^{k+1}}{p^2}\right) = -(k+1)q^k + \frac{1 - q^{k+1}}{p} \label{post_6c1cfa313064329046317358d2aa22c0_ek2}$$ If $p = 0$, we will never get a head and the problem is therefore of little interest. For $p \gt 0$, $q \lt 1$ and therefore $q^k \rightarrow 0$ as $k \rightarrow \infty$. Using this fact on equation \eqref{post_6c1cfa313064329046317358d2aa22c0_ek2}, we obtain: $$\boxed{ E_p[n] = \lim_{k\rightarrow \infty} e_k = \displaystyle\frac{1}{p} }$$ Interestingly, the answer for this problem is extremely simple: the average number of times we need to flip the coin until we get a head is just $1/p$. If $p = 1$, every coin flip results in a head, so $E_{p=1}[n] = 1$. As $p$ approaches $0$, the coin becomes more and more biased towards tails, so the average number of times we need to flip it until we get a head increases and, not surprisingly, diverges as $p \rightarrow 0$. Since for a fair coin we have $p = 1/2$, then: $$\boxed{ E_{p=1/2}[n] = 2 }$$ so we need on average to flip the coin only twice to get a head.

## The Fibonacci sequence and the golden ratio

Posted by Diego Assencio on 2014.05.01 under Mathematics (Real analysis)

The Fibonacci sequence is a very famous sequence which is defined as below: $$F_0 = 0, \quad F_1 = 1, \quad F_n = F_{n-1} + F_{n-2} \;\; \textrm{for} \; n \geq 2$$ The first values of the sequence are $0,1,1,2,3,5,8,13,\ldots$

### Obtaining $F_n$ in a non-recursive manner

Since the sequence is defined in a recursive manner, computing its $n$-th element without using recursion is not an immediately obvious task. It can, however, be done if we look at the elements of the Fibonacci sequence in pairs: $$\begin{eqnarray} F_{n+1} &=& F_{n} + F_{n-1} \nonumber \\[5pt] F_{n} &=& F_{n} \label{post_c772250d88e35665d3a793882a7b70e5_fibonacci_pairs} \end{eqnarray}$$ The second equality is trivial, but it is useful as it helps us interpret equation \eqref{post_c772250d88e35665d3a793882a7b70e5_fibonacci_pairs} using matrices as shown below: $$\left(\begin{matrix} F_{n+1} \\ F_n \end{matrix}\right) = \left(\begin{matrix} F_{n} + F_{n-1} \\ F_{n} \end{matrix}\right) = \left(\begin{matrix} 1 & 1 \\ 1 & 0\end{matrix}\right) \left(\begin{matrix} F_{n} \\ F_{n-1} \end{matrix}\right) \label{post_c772250d88e35665d3a793882a7b70e5_fibonacci_pairs_matrix}$$ Equation \eqref{post_c772250d88e35665d3a793882a7b70e5_fibonacci_pairs_matrix} is very interesting since it yields a recursive formula which associates the pairs $(F_{n+1},F_n)$ and $(F_n,F_{n-1})$. We can use this equation to relate $(F_{n+1},F_n)$ directly to $(F_1,F_0)$. Defining: $$M = \left(\begin{matrix} 1 & 1 \\ 1 & 0\end{matrix}\right)$$ we have that: $$\left(\begin{matrix} F_{n+1} \\ F_n \end{matrix}\right) = M\left(\begin{matrix} F_{n} \\ F_{n-1} \end{matrix}\right) = M^2\left(\begin{matrix} F_{n-1} \\ F_{n-2} \end{matrix}\right) = \ldots = M^n\left(\begin{matrix} F_{1} \\ F_{0} \end{matrix}\right)$$ Since $F_0 = 0$ and $F_1 = 1$, we obtain: $$\boxed{ \left(\begin{matrix} F_{n+1} \\ F_n \end{matrix}\right) = M^n\left(\begin{matrix} 1 \\ 0 \end{matrix}\right) } \label{post_c772250d88e35665d3a793882a7b70e5_eq_Fnp1_Fn_power_of_M}$$ Computing $M^n$ can be done easily if we first diagonalize it ($M$ is symmetric and therefore diagonalizable). Let's first compute its eigenvalues: $$\det(M - \lambda I) = 0 \Longrightarrow \det\left(\begin{matrix} 1 - \lambda & 1 \\ 1 & - \lambda\end{matrix}\right) = \lambda^2 - \lambda - 1 = 0 \label{post_c772250d88e35665d3a793882a7b70e5_eq_det_M_minus_lambdaI}$$ The solutions of equation \eqref{post_c772250d88e35665d3a793882a7b70e5_eq_det_M_minus_lambdaI} are the eigenvalues of $M$: $$\displaystyle \lambda_{\pm} = \frac{1 \pm \sqrt{5}}{2} \label{post_c772250d88e35665d3a793882a7b70e5_lambda_pm}$$ The eigenvalue $\lambda_+$ is known as the golden ratio and is usually represented as the Greek letter $\phi$ (we will however use $\lambda_+$ in this discussion since it will be more convenient).

Let's now compute the eigenvectors of $M$ associated with these eigenvalues: $$\left(\begin{matrix} 1 & 1 \\ 1 & 0\end{matrix}\right) \left(\begin{matrix}\alpha_{\pm} \\ \beta_{\pm} \end{matrix}\right) = \left(\begin{matrix}\alpha_{\pm} + \beta_{\pm}\\ \alpha_{\pm} \end{matrix}\right) = \lambda_{\pm}\left(\begin{matrix}\alpha_{\pm} \\ \beta_{\pm} \end{matrix}\right) \label{post_c772250d88e35665d3a793882a7b70e5_eigenvec_M_general}$$ Equation \eqref{post_c772250d88e35665d3a793882a7b70e5_eigenvec_M_general} implies that the eigenvectors of $M$ satisfy: $$\alpha_{\pm} = \lambda_{\pm}\beta_{\pm} \label{post_c772250d88e35665d3a793882a7b70e5_alpha_beta_pm}$$ Since we can choose $\beta_+$ and $\beta_-$ arbitrarily, we will take $\beta_+ = \beta_- = 1$ to compute two eigenvectors ${\bf w}_+$ and ${\bf w}_-$ of $M$ associated with $\lambda_+$ and $\lambda_-$ respectively. With these choices of values for $\beta_{\pm}$ and from equation \eqref{post_c772250d88e35665d3a793882a7b70e5_alpha_beta_pm}, we get: $${\bf w}_+ = \left(\begin{matrix}\lambda_+ \\ 1 \end{matrix}\right) \quad\quad {\bf w}_- = \left(\begin{matrix}\lambda_- \\ 1 \end{matrix}\right) \label{post_c772250d88e35665d3a793882a7b70e5_eigenvectors_M}$$ NOTE: since $M$ is symmetric, its eigenvectors must be orthogonal. Indeed: $${\bf w}_+^T{\bf w}_- = \lambda_+\lambda_- + 1 = \displaystyle \left(\frac{1 + \sqrt{5}}{2}\right)\left(\frac{1 - \sqrt{5}}{2}\right) + 1 =\frac{1 - 5}{4} + 1 = 0$$

Equation \eqref{post_c772250d88e35665d3a793882a7b70e5_eigenvectors_M} allows us to diagonalize $M$. Letting $P = ({\bf w}_+, {\bf w}_-)$ be the $2 \times 2$ matrix whose columns are ${\bf w}_+$ and ${\bf w}_-$: $$P = \left(\begin{matrix} \lambda_+ & \lambda_- \\ 1 & 1 \end{matrix}\right)$$ we have that: $$MP = (M{\bf w}_+, M{\bf w}_-) = (\lambda_+{\bf w}_+, \lambda_-{\bf w}_-) = ({\bf w}_+, {\bf w}_-)\left(\begin{matrix} \lambda_+ & 0 \\ 0 & \lambda_-\end{matrix}\right) = PD$$ and therefore: $$M = PDP^{-1}, \quad D = \left(\begin{matrix} \lambda_+ & 0 \\ 0 & \lambda_-\end{matrix}\right), \quad P^{-1} = \frac{1}{\lambda_+ - \lambda_-}\left(\begin{matrix} 1 & -\lambda_- \\ -1 & \lambda_+\end{matrix}\right) \label{post_c772250d88e35665d3a793882a7b70e5_diagonalization_M}$$ where $P^{-1}$ was computed using the standard formula for the inverse of $2 \times 2$ matrices ($P$ is invertible since $\det(P) = \lambda_+ - \lambda_- \neq 0$).

Equation \eqref{post_c772250d88e35665d3a793882a7b70e5_diagonalization_M} is the diagonalization of $M$. This equation is very useful since it turns the computation of $M^n$ into a trivial task: $$M^n = (PDP^{-1})^n = PD^nP^{-1} \label{%INDEX_eq_Mn}$$ where $$D^n = \left(\begin{matrix} \lambda_+^n & 0 \\ 0 & \lambda_-^n\end{matrix}\right) \label{%INDEX_eq_Dn}$$ Using equations \eqref{%INDEX_eq_Mn} and \eqref{%INDEX_eq_Dn} on equation \eqref{post_c772250d88e35665d3a793882a7b70e5_eq_Fnp1_Fn_power_of_M}, we obtain: $$\left(\begin{matrix} F_{n+1} \\ F_n \end{matrix}\right) = PD^nP^{-1}\left(\begin{matrix} 1 \\ 0 \end{matrix}\right) = \frac{1}{\lambda_+ - \lambda_-}\left(\begin{matrix} \lambda_+^{n+1} - \lambda_-^{n+1} \\ \lambda_+^{n} - \lambda_-^{n} \end{matrix}\right) \label{post_c772250d88e35665d3a793882a7b70e5_Fnp1_and_Fn_final}$$ Using the values of $\lambda_{\pm}$ given on equation \eqref{post_c772250d88e35665d3a793882a7b70e5_lambda_pm}, we obtain the final expression for $F_n$: $$\boxed{ F_n = \displaystyle\frac{1}{\sqrt{5}}\left[\left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n\right] } \label{post_c772250d88e35665d3a793882a7b70e5_final_Fn}$$

### The golden ratio

Since $F_{n+1} = F_{n} + F_{n-1} \geq F_{n} + 1$ for all $n \geq 2$, the sequence $F_n$ diverges as $n \rightarrow \infty$. However, the sequence: $$a_n = \displaystyle\frac{F_{n+1}}{F_n} \;\; \textrm{for} \; n \geq 1$$ does not. As a matter of fact, it converges to the the golden ratio. Proving this fact is easy (and there are many ways to do it). From equation \eqref{post_c772250d88e35665d3a793882a7b70e5_Fnp1_and_Fn_final}, we have that: $$\displaystyle\frac{F_{n+1}}{F_n} = \frac{\lambda_+^{n+1} - \lambda_-^{n+1}}{\lambda_+^{n} - \lambda_-^{n}} = \lambda_+\left(\frac{1 - (\lambda_-/\lambda_+)^{n+1}} {1 - (\lambda_-/\lambda_+)^{n}}\right)$$ Since $|\lambda_+| \approx 1.62 \gt |\lambda_-| \approx 0.62$, both $(\lambda_-/\lambda_+)^{n+1}$ and $(\lambda_-/\lambda_+)^{n}$ vanish as $n \rightarrow \infty$. This implies that: $$\boxed{ \lim_{n \rightarrow \infty}\displaystyle\frac{F_{n+1}}{F_n} = \lambda_+ = \displaystyle\frac{1 + \sqrt{5}}{2} = 1.61803\ldots }$$

### Computing $F_n$ for large values of $n$

Although equation \eqref{post_c772250d88e35665d3a793882a7b70e5_final_Fn} is mathematically correct, it is not adequate for computing $F_n$ using a computer. The problem is that the evaluation of $F_n$ will always have issues due to the accuracy errors which are inherent in floating point arithmetic.

One can, however, use arbitrary precision arithmetic to compute $F_n$ using equation \eqref{post_c772250d88e35665d3a793882a7b70e5_eq_Fnp1_Fn_power_of_M} and exponentiation by squaring. A Python script which does exactly that can be found here.

Notice, however, that the time complexity of the method just described is not $O(\log_2 n)$! In spite of the fact that the time complexity of exponentiation by squaring is logarithmic, the use of arbitrary precision makes the time complexity of the overall algorithm become $O(n)$. This is because $\textrm{digits}(F_n)$, the number of digits in $F_n$, is proportional to $n$ as $n$ becomes large and arithmetic with arbitrary precision involves operations which are $O(n)$ or worse if $n$ is the typical length of the numbers involved (consider adding two large numbers with $n$ digits: the time complexity of that operation is $O(n)$). To see why $\textrm{digits}(F_n)$ grows linearly with $n$, notice first that since $|\lambda_+| \gt |\lambda_-|$, $F_n \approx \lambda_+^n / \sqrt{5}$ for large values of $n$. Therefore, when $n$ becomes large, we have that: $$\textrm{digits}(F_n) \approx \log_{10}(F_n) \approx \log_{10}(\lambda_+^n) - \log_{10}(\sqrt{5}) \approx n\log_{10}(\lambda_+) \propto n$$ To better illustrate the time complexity of computing $F_n$ using exponentiation by squaring and arbitrary precision arithmetic, I used the Python script mentioned above to compute $F_n$ for $n = 2^i$, $i = 20, 21, \ldots, 30$ (those are quite large values of $n$). Figure 1 shows that the time taken for computing $F_n$ grows linearly with $n$.

 Fig. 1: $\log_2(T_n)$ vs. $\log_2(n)$, where $T_n$ is the time (in seconds) required for computing $F_n$. A linear fit for the given data is also shown to illustrate the fact that $T_n$ grows linearly with $n$.