Gradient Descent

If wee initialize the parameters at some , the gradient descent algorithm updates the parameters as follows:

where is the learning rate and is the gradient of the loss function at .

It decreases the value of the loss at each iteration, as long as the learning rate is small enough and the value of the gradient is nonzero. Eventually, this should result in gradient descent coming close to a point where the gradient is zero.

We can prove that gradient descent works under the assumption that the second derivative of the objective is bounded. Suppose that for some constant , for all in the space and for any vector in ,

Here, denotes the matrix of partial derivatives of the loss function . This is equivalent to the condition that

Starting from this condition, let’s look at how the objective changes over time as we run gradient descent with a fixed step size. From Taylor’s theorem, there exists a such that

If we choose our step size to be small enough that , then,

The objective is guaranteed to decrease at each iteration.

Now, if we sum this up across iterations of gradient descent, we get

where is the global minimum value of the loss function . From here, we can get

This means that the smallest gradient we observe after iterations is getting smaller proportional to . So gradient descent converges (as long as we look at the smallest observed gradient).

We have a metric to measure the rate of this convergence. It is called the condition number () of the function. The condition number is the ratio of the largest eigenvalue of the Hessian matrix to the smallest eigenvalue of the kernel function. If the condition number is large, the function is ill-conditioned and the convergence of gradient descent is slow.

If we take the step size() to the inverse of the maximum curvature of the hessian matrix. We have an exponential convergence of the gradient descent algorithm.

A condition number which is big would mean the loss barely changes each step and a small condition number would mean the loss changes a lot each step.

Proof of this is pretty complicated, take a look here

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Adaptive Learning Rates

AdaGrad

RMSProp

Adam