Optimization and Convergence
============================
This chapter will give an overview of the derivations for different optimization algorithms.
In contrast to other texts, we'll start with _the_ classic optimization algorithm, Newton's method,
derive several widely used variants from it, before coming back full circle to deep learning (DL) optimizers.
The main goal is the put DL into the context of these classical methods. While we'll focus on DL, we will also revisit
the classical algorithms for improved learning algorithms later on in this book. Physics simulations exaggerate the difficulties caused by neural networks, which is why the topics below have a particular relevance for physics-based learning tasks.
```{note}
*Deep-dive Chapter*: This chapter is a deep dive for those interested in the theory of different optimizers. It will skip evaluations as well as source code, and instead focus on theory. The chapter is highly recommend as a basis for the chapters of {doc}`physgrad`. However, it is not "mandatory" for getting started with topics like training via _differentiable physics_. If you'd rather quickly get started with practical aspects, feel free to skip ahead to {doc}`supervised`.
```
## Notation
This chapter uses a custom notation that was carefully chosen to give a clear and
brief representation of all methods under consideration.
We have a scalar loss function $L(x): \mathbb R^N \rightarrow \mathbb R$, the optimum (the minimum of $L$) is at location $x^*$,
and $\Delta$ denotes a step in $x$. Different intermediate update steps in $x$ are denoted by a subscript,
e.g., as $x_n$ or $x_k$.
In the following, we often need inversions, i.e. a division by a certain quantity.
For matrices $A$ and $B$, we define $\frac{A}{B} \equiv B^{-1} A$.
When $a$ and $b$ are vectors, the result is a matrix obtained with one of the two formulations below.
We'll specify which one to use:
$$
\frac{a}{b} \equiv \frac{a a^T}{a^T b } \text{ or } \frac{a}{b} \equiv \frac{a b^T }{b^T b}
$$ (vector-division)
% note, both divide by a scalar ; a a^T gives a symmetric matrix, a b^T not
% Only in this one chapter, the derivatives of $L(x)$ w.r.t. $x$ will be denoted by $'$. We won't use this in any other chapter, but as we're only dealing with derivatives of $x$ here, it will make some parts below a lot clearer.
%So, applying $\partial / \partial x$ once yields $L' \equiv J(x)$. As $J$ is a row vector, the gradient (column vector) $\nabla L$ is given by $J^T$. Applying $\partial / \partial x$ again gives the Hessian matrix $L'' \equiv H(x)$ And the third $\partial / \partial x$ gives the third derivative tensor $L''' \equiv K(x)$, which we luckily never need to compute as a full tensor, but which is required for some of the derivations below.
Applying $\partial / \partial x$ once to $L$ yields the Jacobian $J(x)$. As $L$ is scalar, $J$ is a row vector, and the gradient (column vector) $\nabla L$ is given by $J^T$.
Applying $\partial / \partial x$ again gives the Hessian matrix $H(x)$,
and another application of $\partial / \partial x$ gives the third derivative tensor denoted by $K(x)$. We luckily never need to compute $K$ as a full tensor, but it is needed for some of the derivations below. To shorten the notation below,
we'll typically drop the $(x)$ when a function or derivative is evaluated at location $x$, e.g., $J$ will denote $J(x)$.
The following image gives an overview of the resulting matrix shapes for some of the commonly used quantities.
We don't really need it afterwards, but for this figure $N$ denotes the dimension of $x$, i.e. $x \in \mathbb R^N$.
![opt conv shapes pic](resources/overview-optconv-shapes.png)
## Preliminaries
We'll need a few tools for the derivations below, which are summarized here for reference.
Not surprisingly, we'll need some Taylor-series expansions. With the notation above it reads:
$$L(x+\Delta) = L + J^T \Delta + \frac{1}{2} H \Delta^2 + \cdots$$
Then we also need the _Lagrange form_, which yields an exact solution for a $\xi$ from the interval $[x, x+\Delta]$:
$$L(x+\Delta) = L + J^T \Delta + \frac{1}{2} H(\xi) \Delta^2$$
In several instances we'll make use of the fundamental theorem of calculus, repeated here for completeness:
$$f(x+\Delta) = f(x) + \int_0^1 \text{d}s ~ f'(x+s \Delta) \Delta \ . $$
In addition, we'll make use of Lipschitz-continuity with constant $\mathcal L$:
$|f(x+\Delta) + f(x)|\le \mathcal L \Delta$, and the well-known Cauchy-Schwartz inequality:
$ u^T v < |u| \cdot |v| $.
## Newton's method
Now we can start with arguably the most classic algorithm for optimization: _Newton's method_. It is derived
by approximating the function we're interested in as a parabola. This can be motivated by the fact that pretty much every minimum looks like a parabola close up.
![parabola linear](resources/overview-optconv-parab.png)
So we can represent $L$ near an optimum $x^*$ by a parabola of the form $L(x) = \frac{1}{2} H(x-x^*)^2 + c$,
where $c$ denotes a constant offset. At location $x$ we observe $H$ and
$J^T=H \cdot (x_k-x^*)$. Re-arranging this directly yields an equation to compute the minimum: $x^* = x_k - \frac{J^T}{H}$.
Newton's method by default computes $x^*$ in a single step, and
hence the update in $x$ of Newton's method is given by:
$$
\Delta = - \frac{J^T}{H}
$$ (opt-newton)
Let's look at the order of convergence of Newton's method.
For an optimum $x^*$ of $L$, let
$\Delta_n^* = x^* - x_n$ denote the step from a current $x_n$ to the optimum, as illustrated below.
![newton x-* pic](resources/overview-optconv-minimum.png)
Assuming differentiability of $J$,
we can perform the Lagrange expansion of $J^T$ at $x^*$:
$$\begin{aligned}
0 = J^T(x^*) &= J^T(x_n) + H(x_n) \Delta^*_n + \frac{1}{2} K (\xi_n ){\Delta^*_n}^2
\\
\frac{J^T}{H} &= -\frac{K}{2H}{\Delta^*_n}^2 - \Delta^*_n
\end{aligned}$$
In the second line, we've already divided by $H$, and dropped $(x_n)$ and $(\xi_n )$ to shorten the notation.
When we insert this into $\Delta_n^*$ we get:
$$\begin{aligned}
{\Delta^*_{n+1}} &= x^* - x_{n+1} \\
&= x^* - \big( x_n - \frac{J^T}{H} \big) \\
&= \Delta^*_n - \frac{K}{2H} {\Delta^*_n}^2 - \Delta^*_n \\
&= - \frac{K}{2H} {\Delta^*_n}^2
\end{aligned}$$
% simple: insert $\frac{J}{H}$ from above, re-arrange
Thus, the distance to the optimum changes by ${\Delta^*_n}^2$, which means once we're close enough
we have quadratic convergence. This is great, of course,
but it still depends on the pre-factor $\frac{K}{2H}$, and will diverge if its $>1$.
Note that this is an exact expression, there's no truncation thanks to the Lagrange expansion.
And so far we have quadratic convergence, but the convergence to the optimum is not guaranteed.
For this we have to allow for a variable step size.
## Adaptive step size
Thus, as a next step for Newton's method we
introduce a variable step size $\lambda$ which gives the iteration
$x_{n+1} = x_n + \lambda \Delta = x_n - \lambda \frac{J^T}{H}$.
As illustrated in the picture below, this is especially helpful if $L$ is not exactly
a parabola, and a small $H$ might overshoot in undesirable ways. The far left in this example:
![newton lambda step pic](resources/overview-optconv-adap.png)
To make statements about convergence, we need some fundamental assumptions: convexity and smoothness
of our loss function. Then we'll focus on showing that the loss decreases, and
that we move along a sequence of smaller sets $\forall x ~ L(x)