Gradient Descent

If wee initialize the parameters at some $w_{0} \in R^{d}$ , the gradient descent algorithm updates the parameters as follows:

w_{t + 1} = w_{t} - η \nabla L (w_{t})

where $η$ is the learning rate and $\nabla L (w_{t})$ is the gradient of the loss function $L$ at $w_{t}$ .

It decreases the value of the loss at each iteration, as long as the learning rate is small enough and the value of the gradient is nonzero. Eventually, this should result in gradient descent coming close to a point where the gradient is zero.

We can prove that gradient descent works under the assumption that the second derivative of the objective is bounded. Suppose that for some constant $μ > 0$ , for all $x$ in the space and for any vector $v$ in $R^{d}$ ,

∣ v^{T} \nabla^{2} L (x) v ∣ \leq μ ∥ v ∥^{2} .

Here, $\nabla^{2} L (x)$ denotes the matrix of partial derivatives of the loss function $L$ . This is equivalent to the condition that $∥ \nabla^{2} L (x) ∥_{2} \leq μ$

Starting from this condition, let’s look at how the objective changes over time as we run gradient descent with a fixed step size. From Taylor’s theorem, there exists a $ξ$ such that

L (w_{t + 1}) = L (w_{t} - η \nabla L (w_{t}))

= L (u_{t}) - η (\nabla f (u_{t}))^{T} \nabla L (u_{t}) + \frac{1}{2} η^{2} (\nabla L (u_{t}))^{T} \nabla^{2} L (ξ_{t}) (\nabla L (u_{t}))

\leq L (u_{t}) - η ∥\nabla L (u_{t}) ∥^{2} + \frac{η ^{2} L}{2} ∥\nabla L (u_{t}) ∥^{2} = f (u_{t}) - η (1 - \frac{η L}{2}) ∥\nabla f (u_{t}) ∥^{2} .

If we choose our step size $η$ to be small enough that $1 > η L$ , then,

L (u_{t + 1}) \leq L (u_{t}) - \frac{η}{2} ∥\nabla L (u_{t}) ∥^{2}

\frac{η}{2} ∥\nabla L (u_{t}) ∥^{2} \leq L (u_{t}) - L (u_{t + 1})

The objective is guaranteed to decrease at each iteration.

Now, if we sum this up across $T$ iterations of gradient descent, we get

\frac{η}{2} t = 0 \sum T - 1 ∥\nabla L (u_{t}) ∥^{2} \leq t = 0 \sum T - 1 (L (u_{t}) - L (u_{t + 1})) = L (u_{0}) - f (u_{T}) \leq L (u_{0}) - L^{*}

where $L^{*}$ is the global minimum value of the loss function $L$ . From here, we can get

t \in {0, ..., T} min ∥\nabla L (u_{t}) ∥^{2} \leq \frac{1}{T} t = 0 \sum T - 1 ∥\nabla L (u_{t}) ∥^{2} \leq \frac{2 ( L ( u _{0} ) - L ^{*} )}{η T} .

This means that the smallest gradient we observe after $T$ iterations is getting smaller proportional to $1/ T$ . So gradient descent converges (as long as we look at the smallest observed gradient).

We have a metric to measure the rate of this convergence. It is called the condition number ( $k_{f}$ ) of the function. The condition number is the ratio of the largest eigenvalue of the Hessian matrix to the smallest eigenvalue of the kernel function. If the condition number is large, the function is ill-conditioned and the convergence of gradient descent is slow.

k_{f} = \frac{λ _{m a x} H}{λ _{m i n} K}

If we take the step size( $η$ ) to the inverse of the maximum curvature of the hessian matrix. We have an exponential convergence of the gradient descent algorithm.

L (w_{t}) \leq (1 - \frac{1}{k _{f}})^{T} L (w_{0})

A condition number which is big would mean the loss barely changes each step and a small condition number would mean the loss changes a lot each step.

Tannhäuser Gate

Recent Notes

First contact

Weaponizing SAEs

Program Synthesis

Gradient Descent and optimizers

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Adaptive Learning Rates

AdaGrad

RMSProp

Adam

Graph View

Backlinks

Tannhäuser Gate

Recent Notes

First contact

Weaponizing SAEs

Program Synthesis

Gradient Descent and optimizers

Gradient Descent §

Stochastic Gradient Descent (SGD) §

Mini-Batch Gradient Descent §

Adaptive Learning Rates §

AdaGrad §

RMSProp §

Adam §

Graph View

Backlinks

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Adaptive Learning Rates

AdaGrad

RMSProp

Adam