Layer code

Basically represent the equation: $W x + b$ where $W$ is the weight matrix and $b$ is the bias.

class Layer:
    
    def __init__(M,N):
        self.weight = self.random.random((M,N))
        self.bias = np.zeros(M)
    
    def __forward(x):
        return np.inner(W,x) + self.bias

The forward pass is done by the inner product of the weight and the input plus the bias. The problem is the weight initialization is from the standard normal.

This is a problem because:

Vanishing Gradient Problem
- If the weights are too small the gradient goes to zero and there is no learning
Exploding Gradient Problem
- If the weights are too large the gradient explodes and the outputs make no sense

Detailed understanding of the problem

If we initalize the weights from a standard normal distribution. We can see that the values across layers basically go to zero as we go deeper.

Mathematical proof of problem:

f^{'} (s_{k}) \approx 1,

Var [z^{i}] = Var [x] i^{'} = 0 \prod i - 1 Var [W^{' i^{'}}],

We write $Var [W^{' i^{'}}]$ for the shared scalar variance of all weights at layer $i^{'}$ . Then for a network with $d$ layers,

Var [\frac{\partial Cost}{\partial s ^{i}}] = Var [\frac{\partial Cost}{\partial s ^{d}}] i^{'} = i \prod d n_{i^{'} + 1} Var [W^{' i^{'}}],

Var [\frac{\partial Cost}{\partial w ^{i}}] = i^{'} = 0 \prod i - 1 n_{i^{'}} Var [W^{' i^{'}}] i^{'} = i \prod d - 1 n_{i^{'} + 1} Var [W^{' i^{'}}] \times Var [x] Var [\frac{\partial Cost}{\partial s ^{d}}]

As you see the problem, each layers variance is being multiplied. This is the reason, if we see the graph flattens out as the variance increases all the time.

To fix this, we set the variance of the gradients to be same across layers.

From a forward-propagation point of view, to keep information flowing we would like that

\forall (i, i^{'}), Var [z^{i}] = Var [z^{i^{'}}] .

From a back-propagation point of view we would similarly like to have

\forall (i, i^{'}), Var [\frac{\partial Cost}{\partial s ^{i}}] = Var [\frac{\partial Cost}{\partial s ^{i^{'}}}] .

Kaiming initialization

In popular terms, we use kaiming and xavier initialization. Kaiming paper takes into account the activation function, whereas Xavier initialization does not. The explanation is described here

Summary:

Let us assume that, w is symetric across 0 and bias is zero.
ReLU only cares about the positive part
So, we care only half of the distribution cause it is symetric therefore we can calculate $E (x_{l}^{2}) = \frac{1}{2} Var (y_{l - 1})$
We also know that $Var (y_{l}) = n_{l} Var (w_{l} x_{i}) = n_{l} (E [w_{l}^{2}] E [x_{l}^{2}] - E [w_{l}]^{2} E [x_{l}]^{2})$ , we set the weights to be 0 mean so the right side is useless. We also know that $E (w_{l}^{2}) = Var (w_{l})$ as mean is zero. Final equation is $n_{l} Var (w_{l}) E [x_{l}^{2}]$
Lets combine previous two points, we get $Var (y_{l}) = n_{l} Var (w_{l}) \frac{1}{2} Var (y_{l - 1})$
For these to not scale up or down, we need $Var (y_{l}) = Var (y_{l - 1})$
Therefore, we need $Var (w_{l}) = \frac{2}{n _{l}}$

Kaiming Normalization is setting the variance to be $\frac{2}{n _{l}}$

Modified Code for layer

class Layer:
    
    def __init__(self,M,N, bias = None):
        self.weight = self.__kaiming_init(M,N)
        self.bias = bias if bias is not None else np.zeros(M)
    
    def __forward(self,x):
        output = np.inner(self.weight,x)
        if self.bias is not None:
            return  output + self.bias
        return temp
    
    def __kaiming_init(self,M,N):
        std_dev = np.sqrt(2.0 / M)
        return np.random.normal(0, std_dev, size=(M, N))

This explanation assumes that we are using vanilla neural networks with no normalization. The value of initialization techniques gets lost when we use batch normalization or layer normalization.

Tannhäuser Gate

Recent Notes

First contact

Weaponizing SAEs

Program Synthesis

Part 1 Layer

Layer code

Detailed understanding of the problem

Mathematical proof of problem:

Kaiming initialization

Kaiming Normalization is setting the variance to be $\frac{2}{n _{l}}$

Modified Code for layer

References:

Graph View

Backlinks

Tannhäuser Gate

Recent Notes

First contact

Weaponizing SAEs

Program Synthesis

Part 1 Layer

Layer code §

Detailed understanding of the problem §

Mathematical proof of problem: §

Kaiming initialization §

Kaiming Normalization is setting the variance to be nl​2​ §

Modified Code for layer §

References: §

Graph View

Backlinks

Layer code

Detailed understanding of the problem

Mathematical proof of problem:

Kaiming initialization

Kaiming Normalization is setting the variance to be $\frac{2}{n _{l}}$

Modified Code for layer

References: