Why First Hidden Layer Is Very Important In Building a Neural Network Model And Relation Between Vanishing Gradient And First Hidden Layer
Hi everyone, I am going to explain about ‘Why first hidden layer is very important in build a neural network model’ and also i will explain how activation function solve the vanishing gradient problem .I am going to explain this concept using google playground.tensorflow which is a awesome tool to visualize the internal working part of the neural network model.I recommend to play with it so that you will get the more intuition behind the neural network.
In my previous blog, I covered why activation function is used in neural network and also about first hidden layer , so that i am not going to explain in this blog.I would recommend you to read my why activation function is used in neural network which will give more idea about the activation function and first hidden layer.
To show you why first hidden layer is more important ,I am going to restrict the first hidden layer to have two neuron only ,then we can add many hidden layer with no restriction of neuron after the first hidden layer and one more don’t add more then two hidden layer if you are using a sigmoid function,where we may end up in vanishing gradient.If you want to check with more then two hidden layer ,please use the relu activation function which will avoid the vanishing gradient problem, anyway i going to use both the sigmoid and ReLu activation function.
Case 1:
Activation : sigmoid function
Hidden layer: 3
number of neuron(first hidden layer): 2
number of neuron(second hidden layer): 8
number of neuron(third hidden layer): 8
Total hidden layer parameter : 18
Now i am going to run the model and let check whether the model able to separate the dataset.
As we can see that model unable to completely separate or classify the dataset even-though we are using three hidden layer.So we need atleast three 3 neuron in the first hidden layer.In mathematically term atleast we need three linear equation to separate this non linear dataset or In topological term we need one more dimension to convert the low dimension to higher dimension so that we can separate the non linear dataset linearly in higher dimension, where in this case we have 2d dataset so we need one more dimension ,so that we can convert the 2d non linear dataset into 3d linear dataset.
Case:2
Now we try this same problem with below parameter:
Activation : sigmoid function
Hidden layer: 1
number of neuron(first hidden layer): 3
Total hidden layer parameter : 3
Now we run the model and let’s check
Voila!!! It completely separated or classified the dataset with just 3 hidden parameter.
If you want to know why please read the my blog why activation function is used in neural network.
So if you modelling a neural network you should give more importance to the first hidden layer, because all the other hidden layer is depend on the first hidden layer.
ReLu Activation
Relation between Vanishing Gradient and First hidden layer:
Let say we have one input layer , four hidden layer and one output layer so when we do forward propagation the neural network gives the output then by using loss function we can get the error rate, using the chain rule and error rate we do backpropagation so that the model error rate is decreased.
There are many good resource ,So I am not going to explain about the chain rule and vanishing gradient
Due to vanishing gradient learning rate of the neuron in a layer is decreased by layer by layer so that weight of the neuron is not changed.For example
when we have four hidden layer .let say 4th hidden layer learning rate is 0.9 then for 3rd hidden layer is 0.5 2nd hidden layer is 0.1 and 1st hidden layer learning rate is 0.001,due to the first hidden layer is not learning the error rate is not decreased.
To get more intuition I will give a example Let’s say there two class A and B
In class A and class B there are one teacher and 3 student when we maps to neural network we have 4 hidden layer .First hidden layer is teacher and 2,3,4 hidden layer is student.
Let assume in class A the teacher have 100 % knowledge about the teaching subject , So when the teacher teach the subject to the student.
The student get the 100 % knowledge from the teacher and student write the exam and by exam score the student learn correctly and so on then, finally the student learns the knowledge 100%.
Let assume in class B the teacher have 70% correct knowledge about the teaching subject due to less experience in the subject.
By above assumption the student get the 70 % knowledge from the teacher and student write the exam and by exam score the student learn correctly about the subject. Even when student write 100 times the exam ,the student can learn the subject to 70 % only.So the teacher has to learn about the subject so that the student get 100 % knowledge about the subject.Likewise if the first hidden layer does not learn but other hidden layer learns the weight then ,we will not reach the high accuracy rate in neural network.