The Lagrange multipliers method is a technique which solves the following problem: find the local maxima and minima of a function $f(x_1, \ldots, x_n)$ over values of $(x_1, \ldots, x_n)$ which satisfy $m$ constraint equations: $$ g_k(x_1, \ldots, x_n) = 0, \quad k = 1, 2, \ldots, m. \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints} $$ Each constraint equation defines an $(n-1)$-dimensional surface. The intersection of all these surfaces defines the set of points over which we must find local maxima and minima of $f({\bf x})$, where ${\bf x} = (x_1, \ldots, x_n)$. In what follows, we will refer to this set of points as the feasible surface. Without the constraints defined in \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}, we could use standard techniques from calculus to solve the problem by finding and analyzing the stationary points of $f({\bf x})$. With the imposed constraints, however, things are not as simple.

Our approach here will be to study the properties of the local maxima and minima of $f({\bf x})$ subject to the constraints above. This will lead us naturally to the Lagrange multipliers method and therefore illustrate why it works.

To start, notice that each constraint equation $g_k({\bf x}) = 0$ allows us to express one coordinate $x_j$ in terms of the others. In principle, we can start with $g_1({\bf x}) = 0$ and determine $x_n$ in terms of $(x_1, \ldots, x_{n-1})$, then use this equation for $x_n$ on $g_2({\bf x}) = 0$ to determine $x_{n-1}$ in terms of $(x_1, \ldots, x_{n-2})$ and so on. With $m$ constraints, we can determine $(x_{n-m+1}, \ldots, x_n)$ in terms of $(x_1, \ldots, x_{n-m})$, so the feasible surface can be parameterized using $(n-m)$ parameters, one for each of the coordinates $x_j$ for $j = 1, 2, \ldots, (n-m)$. To be more precise, we can set $x_j = \alpha_j$ for $j = 1, 2, \ldots, (n-m)$ and express the remaining coordinates $x_j$ for which $j = (n-m+1), \ldots, n$ in terms of $(\alpha_1, \ldots, \alpha_{n-m})$. In the general case, we might not be able to proceed exactly as described above (for instance, if $g_1({\bf x})$ does not explicitly depend on $x_n$, we cannot use it to determine $x_n$ in terms of $(x_1, \ldots, x_{n-1})$), but the main idea remains valid: ignoring pathological cases, we can express one of the free variables in terms of the other free variables for each constraint equation, so with $m$ constraints, only $(n-m)$ variables will remain free. This means the feasible surface is an $(n-m)$-dimensional surface.

Let's give an example to clarify what we just discussed. Consider a function $f(x,y,z)$ and two constraints $g_1(x,y,z) = 0$ and $g_2(x,y,z) = 0$. We can use $g_1(x,y,z) = 0$ to determine $z$ as a function of $x$ and $y$: $z = z(x,y)$. By replacing $z$ with $z(x,y)$ on $g_2(x,y,z) = 0$, we remain with only two free variables $x$ and $y$, so we can determine $y$ as a function of $x$: $y = y(x)$, meaning the feasible surface is a curve which can be parameterized using $x = \alpha$. This curve is the set of points $(\alpha, y(\alpha), z(\alpha, y(\alpha)))$ for all valid values of $\alpha$. Therefore, with $n = 3$ dimensions and $m = 2$ constraints, we get a feasible surface with $(n-m) = 1$ dimension.

Now consider a point ${\bf x}$ on the feasible surface such that $f({\bf x})$ is a local maximum or minimum. Consider also an infinitesimal displacement $\delta{\bf x}$ away from ${\bf x}$, with $\delta{\bf x}$ being tangent to the feasible surface. We have that: $$ f({\bf x} + \delta{\bf x}) \approx f({\bf x}) + \nabla f({\bf x})\cdot \delta{\bf x} \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_f_x_plus_dx} $$ Since $\nabla f({\bf x})$ points along the direction at which $f({\bf x})$ increases the fastest, $\nabla f({\bf x})$ must be locally perpendicular to the feasible surface and therefore $\nabla f({\bf x})\cdot \delta{\bf x} = 0$. If this would not be the case, then either $\nabla f({\bf x})\cdot \delta{\bf x} \gt 0$ or $\nabla f({\bf x})\cdot \delta{\bf x} \lt 0$. Assuming $\nabla f({\bf x})\cdot \delta{\bf x} \gt 0$, we see from equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_f_x_plus_dx} that $f({\bf x} + \delta{\bf x}) \gt f({\bf x})$. But for such a displacement $\delta{\bf x}$, we would also have $\nabla f({\bf x})\cdot (-\delta{\bf x}) \lt 0$ and equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_f_x_plus_dx} with $\delta{\bf x}$ replaced by $(-\delta{\bf x})$ implies $f({\bf x} - \delta{\bf x}) \lt f({\bf x})$. In other words, $f$ increases if we move to ${\bf x} + \delta{\bf x}$ but decreases if we move to ${\bf x} - \delta{\bf x}$, meaning $f({\bf x})$ cannot be a local maximum or minimum. The argument is similar for $\nabla f({\bf x})\cdot \delta{\bf x} \lt 0$. To summarize, $\nabla f({\bf x})$ is locally perpendicular to the feasible surface at all points ${\bf x}$ of the surface where $f({\bf x})$ is a local maximum or minimum.

This leads us to the question: how can we mathematically determine what "perpendicular to the feasible surface" is? The answer lies in the constraints from \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}: for each function $g_k({\bf x})$, the constraint $g_k({\bf x}) = 0$ defines a level set of $g_k({\bf x})$, i.e., a set of points over which $g_k({\bf x})$ is equal to a constant value. Since the constant in our case is zero, we will refer to this level set as the zero level set of $g_k({\bf x})$. Using the fact that $\nabla g_k({\bf x})$ points along the direction of fastest growth of $g_k({\bf x})$ at ${\bf x}$, we can infer that it must be locally perpendicular to the zero level set of $g_k({\bf x})$ since, for a given point ${\bf x}$ on the zero level set and an infinitesimal displacement vector $\delta{\bf x}$ which is tangent to this level set, we have that ${\bf x} + \delta{\bf x}$ is still on the zero level set and therefore: $$ g_k({\bf x} + \delta{\bf x}) = 0 \approx g_k({\bf x}) + \nabla g_k({\bf x})\cdot \delta{\bf x} = 0 + \nabla g_k({\bf x})\cdot \delta{\bf x} $$ so $\nabla g_k({\bf x})\cdot \delta{\bf x} = 0$. In other words, for every ${\bf x}$ in the zero level set of $g_k({\bf x})$, $\nabla g_k({\bf x})$ is locally perpendicular to this surface at ${\bf x}$.

Therefore, for a point ${\bf x}$ on the zero level set of $g_k({\bf x})$, $\nabla g_k({\bf x})$ defines an $(n-1)$-dimensional plane which is tangent to this level set at ${\bf x}$. For any infinitesimal displacement $\delta{\bf x}$ which is parallel to this plane, ${\bf x} + \delta{\bf x}$ is still on the zero level set of $g_k({\bf x})$. So if ${\bf x}$ belongs to the feasible surface, it is on the zero level set of all functions $g_k({\bf x})$ and therefore there are $m$ planes which pass through ${\bf x}$ and are perpendicular to the zero level sets of each $g_k({\bf x})$ for $k = 1, 2, \ldots m$ respectively. The intersection of these $m$ planes defines an $(n-m)$-dimensional surface which is locally equal (tangent) to the feasible surface. Indeed, for an infinitesimal displacement $\delta{\bf x}$ away from ${\bf x}$ such that ${\bf x} + \delta{\bf x}$ remains on the intersection of the $m$ planes, ${\bf x} + \delta{\bf x}$ remains on the zero level set of every function $g_k({\bf x})$. In other words, ${\bf x} + \delta{\bf x}$ is on the feasible surface, so the $(n-m)$-dimensional surface defined by the intersection of the $m$ planes at ${\bf x}$ is locally tangent to the feasible surface.

As an example, if $n = 3$ and $m = 2$, for any point $(x,y,z)$ on the feasible surface, there are two planes perpendicular to $\nabla g_1(x,y,z)$ and $\nabla g_2(x,y,z)$ respectively which pass through $(x,y,z)$; the intersection of these planes defines a line passing through $(x,y,z)$. For small displacements around $(x,y,z)$ on this line, we stay on both of these planes and therefore on the zero level sets of both $g_1(x,y,z)$ and $g_2(x,y,z)$, i.e., on the feasible surface. This implies that points on this line within an infinitesimal neighborhood of ${\bf x}$ are part of the feasible surface as well.

For the displacement vector $\delta{\bf x}$ considered above, since ${\bf x} + \delta{\bf x}$ is on each of the $m$ planes which pass through ${\bf x}$ and are perpendicular to $\nabla g_k({\bf x})$ for $k = 1, 2, \ldots m$ respectively, then $\delta{\bf x}$ is perpendicular to all $\nabla g_k({\bf x})$ and therefore to any linear combination of these vectors, i.e., for any set of values $(\lambda_1, \ldots, \lambda_m)$, we have that: $$ \sum_{k=1}^m \lambda_k \nabla g_k({\bf x})\cdot\delta{\bf x} = 0 $$ As a concrete example, consider again the case $n = 3$ and $m = 2$. The feasible surface is a one-dimensional curve: at each point $(x,y,z)$ on this curve, there are two distinct planes passing through $(x,y,z)$ which are locally tangent to the zero level sets of $g_1(x,y,z)$ and $g_2(x,y,z)$ and perpendicular to $\nabla g_1(x,y,z)$ and $\nabla g_2(x,y,z)$ respectively. The intersection of these planes forms a line which is locally tangent to the feasible surface at $(x,y,z)$. An infinitesimal displacement vector $\delta{\bf x}$ which is parallel to this line is parallel to both planes and therefore simultaneously perpendicular to $\nabla g_1(x,y,z)$ and $\nabla g_2(x,y,z)$. This means $\delta{\bf x}$ is perpendicular to $\lambda_1 \nabla g_1(x,y,z) + \lambda_2 \nabla g_2(x,y,z)$ for any values of $\lambda_1$ and $\lambda_2$.

As we showed above, for a point ${\bf x}$ on the feasible surface at which $f({\bf x})$ is a local maximum or minimum, $\nabla f({\bf x})$ is locally perpendicular to any infinitesimal displacement vector $\delta{\bf x}$ which is tangent to the feasible surface at ${\bf x}$. Since the feasible surface has dimension $(n-m)$ and the set of $m$ perpendicular vectors $\nabla g_k({\bf x})$ spans locally an $m$-dimensional vector space (assuming they are linearly independent, which is the case here since we agreed on ignoring pathological cases), then we must be able to express $\nabla f({\bf x})$ as a linear combination of any $(n-m)$ linearly independent vectors which are tangent to the feasible surface and the $m$ vectors $\nabla g_k({\bf x})$ for $k = 1, 2, \ldots m$ which are locally perpendicular to it. With $\nabla f({\bf x})$ being perpendicular to the feasible surface, we can only have: $$ \nabla f({\bf x}) = \sum_{k=1}^m \lambda_k \nabla g_k({\bf x}) \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} $$ In other words, if we can determine all points ${\bf x}$ for which both \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints} and \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} hold for any set of values $(\lambda_1, \ldots, \lambda_m)$, then all maxima and minima of $f({\bf x})$ over the feasible surface must be contained in this set of points because, as we showed above, every maximum or minimum of $f({\bf x})$ on the feasible surface must satisfy equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} for some set of values $(\lambda_1, \ldots, \lambda_m)$. But this is exactly what the Lagrange multiplier method computes! Indeed, the Lagrangian multiplier method tells us to solve the following problem: compute all stationary points of the following function (commonly called the "Lagrangian" for the problem): $$ L(x_1, \ldots, x_n, \lambda_1, \ldots, \lambda_m) = f(x_1, \ldots, x_n) - \sum_{k=1}^m \lambda_k g_k(x_1, \ldots, x_n) \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_L} $$ Notice that this problem has no constraints: we just have to determine the stationary points of $L$, i.e, solve a standard calculus problem. At the stationary points of $L$, all its first order partial derivatives vanish: $$ \begin{eqnarray} \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_1}\displaystyle\frac{\partial L}{\partial x_i} &=& \frac{\partial f}{\partial x_i} - \sum_{k=1}^m \lambda_k \frac{\partial g_k}{\partial x_i} = 0 & \quad \textrm{ for } i = 1, 2, \ldots, n \\[5pt] \label{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_2}\displaystyle\frac{\partial L}{\partial \lambda_k} &=& g_k(x_1, \ldots, x_n) = 0 &\quad \textrm{ for } k = 1, 2, \ldots, m \end{eqnarray} $$ Equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_2} shows that the stationary points $(x_1, \ldots, x_n, \lambda_1, \ldots, \lambda_m)$ of $L$ are such that $(x_1, \ldots, x_n)$ satisfy all the constraints $g_k(x_1, \ldots, x_n) = 0$ for $k = 1, 2, \ldots, m$, so these points fall inside the feasible surface defined by equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints}. Additionally, equation \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_partial_L_1} can be expressed as below: $$ \nabla_{\bf x} L({\bf x}, {\pmb\lambda}) = \nabla f({\bf x}) - \sum_{k=1}^m \lambda_k \nabla g_k({\bf x}) = 0 \Longrightarrow \nabla f({\bf x}) = \sum_{k=1}^m \lambda_k \nabla g_k({\bf x}) $$ where above $\nabla_{\bf x} L$ is used to indicate that it is the gradient of $L$ computed only over the coordinates ${\bf x} = (x_1, \ldots, x_n)$, with $\pmb\lambda = (\lambda_1, \ldots, \lambda_m)$. To summarize, the Lagrangian multiplier method determines all points $({\bf x}, {\pmb\lambda})$ such that the constraints \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_constraints} are satisfied and such that \eqref{post_63f2ce1a5c3bb291485c0ef5fb8223f2_eq_nabla_f} also holds. By studying the properties of each of these points, we can then determine which ones are local maxima, minima or just saddle points, but this task is outside the scope of this post and will be left to the reader. Notice, however, that if all we care about is finding the global maximum/minimum of $f({\bf x})$ over the feasible surface, all we need to do is compute $f({\bf x})$ for every point ${\bf x}$ obtained from the Lagrange multipliers method and only keep the largest/smallest one.

### References

[1] | Dan Klein, Lagrange Multipliers without Permanent Scarring |

[2] | T. M. Apostol, Calculus, Volume II, John Wiley & Sons; 2nd edition (1969) |