项目作者: pulakk

项目描述 :
An extensive derivation of backpropagation algorithm in a simple Neural Network
高级语言:
项目地址: git://github.com/pulakk/Backprop-Derivation.git
创建时间: 2018-01-17T18:34:38Z
项目社区:https://github.com/pulakk/Backprop-Derivation

开源协议:

下载


Understanding and Deriving the Backpropagation algorithm in a simple neural network

Let us try to understand the Backpropagation algorithm using an overly simplified Neural Network with just one hidden layer.

It has a bit of linear algebra and calculus involved but anyone who has atleast the basic knowledge about matrix multiplication and the chain rule of differentiation, should be able to follow through most of what will be discussed.

We shall not delve too deep into proving the algorithm itself, but at the end, it will be very clear how the same principles could be used to intuitively undertand even larger networks with much more hidden layers and units.

The model

For the sake of simplicity, we will be using a Neural Network with two inputs, two outputs and one hidden layer with 3 nodes.

Notations

Let us represent as the input, as the activation of the hidden layers, as the output and and as the biases in the hidden layer and the output layer respectively.

And let us represent weight matrices as and .

Matrix multplication (Dot Product) will be represented using the symbol, whereas element wise multiplication (Hadamard Product) will be represented using the symbol.

We shall use a general notation for all the activation functions, and the symbol for it’s gradient.

The forward Pass

Forward propagation in a neural network is nothing but a bunch of matrix calculations, and a few activation functions, applied at each layer.

Hidden Layer Values

For the hidden layer, we get the activation values using the following equations -

or more elaborately -

Note that the activation function is used both as an function as well as an function, depending on the context, only to make it easier to understand that the two functions perform very similar operation to both a number as well as a vector of numbers.

Output Layer Values

The values of the output layer is calculated in a similar way using the three output values from the hidden layer units.

or

Backpropagation

By back-propagating across the layers, we caculate the error in the cost function with respect to each each weight and bias variable, so that we can correct the weights to minimize the cost function.

In our case, the values that we need to find during backpropagation are -

where, is the Cost Function.

Weights of the Output Layer

Let us work on the weights of the output layer first.

Since, C is a scalar function, its partial differential w.r.t. to the matrix is given by -

Using chain rule of differentiation,

Since,

the partial differentiation can be written after expansion as,

Let us denote the partial differentiation of the cost function with respect to ith unit in the output layer by .

Which can be denoted as -

Here, is called the error vector of the output layer. It is basically the partial differentiation of the cost function w.r.t. the output layer.

Biases of the Output Layer

We could similarly prove that the partial differentiation of the cost function w.r.t. the biases of the output layer is,

This is because there is no multiplication factor for biases, and the derivation is almost the same as we did for the weights of the layer.

Weights of Hidden Layer

The weights of hidden layer are

Hence, the partial differentiation of the cost function w.r.t. the weights of the hidden layer is

Using Chain Rule,

Since, the weights have multilication factors of only inputs w.r.t. the , as seen previously in the output layer,

denoting partial differentiation of cost function w.r.t. ith unit in the hidden layer as ,

which can be represented as -

Biases of the hidden layer

Similarly for the biases of the hidden layer, the partial differentiation is -

How do we calculate the error/delta vector for each layer?

One can easily notice that once we get hold of the delta/error vector for each layer i, it is very easy to calculate the error in the weights and biases of each layer.
But is there a more systematic way to calculate these delta values? Do we really need to recalculate the partial differentiation of the cost function w.r.t. each layer or parameter?
This is where backpropagation steps in. Remember what actually is?

Using Chain Rule,

Now since,

So, the delta vector for the output layer is,

Delta or Error Vectors for Subsequent Layers

The error vectors can be calculated by propagating through each layer backwards one by one, hence Backpropagation.
Let us try to calculate using

which, upon expansion gives -

Since, , we can rewrite the partial differentiation of w.r.t. as the transpose of the weight matrix for the output layer. Using this and the definition of , we have