An extensive derivation of backpropagation algorithm in a simple Neural Network
Let us try to understand the Backpropagation algorithm using an overly simplified Neural Network with just one hidden layer.
It has a bit of linear algebra and calculus involved but anyone who has atleast the basic knowledge about matrix multiplication and the chain rule of differentiation, should be able to follow through most of what will be discussed.
We shall not delve too deep into proving the algorithm itself, but at the end, it will be very clear how the same principles could be used to intuitively undertand even larger networks with much more hidden layers and units.
For the sake of simplicity, we will be using a Neural Network with two inputs, two outputs and one hidden layer with 3 nodes.
Let us represent as the input,
as the activation of the hidden layers,
as the output and
and
as the biases in the hidden layer and the output layer respectively.
And let us represent weight matrices as and
.
Matrix multplication (Dot Product) will be represented using the symbol, whereas element wise multiplication (Hadamard Product) will be represented using the
symbol.
We shall use a general notation for all the activation functions, and the symbol
for it’s gradient.
Forward propagation in a neural network is nothing but a bunch of matrix calculations, and a few activation functions, applied at each layer.
For the hidden layer, we get the activation values using the following equations -
or more elaborately -
Note that the activation function is used both as an
function as well as an
function, depending on the context, only to make it easier to understand that the two functions perform very similar operation to both a number as well as a vector of numbers.
The values of the output layer is calculated in a similar way using the three output values from the hidden layer units.
or
By back-propagating across the layers, we caculate the error in the cost function with respect to each each weight and bias variable, so that we can correct the weights to minimize the cost function.
In our case, the values that we need to find during backpropagation are -
where, is the Cost Function.
Let us work on the weights of the output layer first.
Since, C is a scalar function, its partial differential w.r.t. to the matrix is given by -
Using chain rule of differentiation,
Since,
the partial differentiation can be written after expansion as,
Let us denote the partial differentiation of the cost function with respect to ith unit in the output layer by .
Which can be denoted as -
Here, is called the error vector of the output layer. It is basically the partial differentiation of the cost function w.r.t. the output layer.
We could similarly prove that the partial differentiation of the cost function w.r.t. the biases of the output layer is,
This is because there is no multiplication factor for biases, and the derivation is almost the same as we did for the weights of the layer.
The weights of hidden layer are
Hence, the partial differentiation of the cost function w.r.t. the weights of the hidden layer is
Using Chain Rule,
Since, the weights have multilication factors of only inputs w.r.t. the , as seen previously in the output layer,
denoting partial differentiation of cost function w.r.t. ith unit in the hidden layer as ,
which can be represented as -
Similarly for the biases of the hidden layer, the partial differentiation is -
One can easily notice that once we get hold of the delta/error vector for each layer i, it is very easy to calculate the error in the weights and biases of each layer.
But is there a more systematic way to calculate these delta values? Do we really need to recalculate the partial differentiation of the cost function w.r.t. each layer or parameter?
This is where backpropagation steps in. Remember what actually is?
Using Chain Rule,
Now since,
So, the delta vector for the output layer is,
The error vectors can be calculated by propagating through each layer backwards one by one, hence Backpropagation.
Let us try to calculate using
which, upon expansion gives -
Since, , we can rewrite the partial differentiation of
w.r.t.
as the transpose of the weight matrix for the output layer. Using this and the definition of
, we have