Why One Hot Encoder Is Important In Classification Model
A quick explanation about One Hot Encoder,While working in classification Problem I had though that why One Hot Encoder is used in any categorical problem.Before go through this blog I recommend you to read my Logistics Regression blog .
To use Numerical value as a Labels we can use Sparse_categorical_crossentropy Loss LINK but i didn’t get any detail about sparse_categorical_crossentropy.If i get this research paper or formula i try to explain in separate blog.
If we have a dataset like shown below:
In above dataset they are three class (Apple, Chicken, Broccoli),If i given a Apple dataset input to logistic regression or neural network ,But it predicts it is a Chicken ,So we will use gradient descent to minimize the loss function to predict correctly.If we use numerical label in classification model the problem arise in loss function,due to that the model learn as Apple(1)<Chicken(2)<Broccoli(3) (Apple is lesser then Chicken and Chicken lesser than Broccoli or in other word the Model end up in ordinal instead of nominal) .it will not work as a nominal classifier instead of it will work as a ordinal classifier which is distance measure how far apple is close to chicken and Broccoli,For example I am going to use mean square error loss function in numerical Label classification.
L = 1/m (predict-target)²
Now I feed a input of calories 95 dataset of apple to logistics regression model or neural network model which it should predict apple(1) but it predicts chicken(2) so using loss function value we have to train our network to predict correctly,Now Lets see how the MSE loss function works.
predict = 2(chicken)
target = 1(apple)
(2–1)²=1
Now we feed a apple input value ,predict = Broccoli(3) but target is apple(1)
predict = 3
target = 1
(3–1)² = 4
As you see that the loss function various which means it train has the apple is so far away from the broccoli but it is close to chicken due to that converge of optimium minima or global minima will be slow and also it may end up in local minima that is classification problem become a non convex problem because we cant use the cross entropy loss function which is convex loss function.
One Hot encoder
if a model predicts a chicken but the target class is apple
predict = (0,1,0)
target = (1,0,0)
L = ((0,1,0)-(1,0,0))
According to multi classification loss function
L = (0–1)²+(1–0)²+(0–0)² = 2
Now if a model predicts a broccoli but the target class is apple
predict = (0,1,0)
target = (0,0,1)
L = ((0,0,1)-(1,0,0))
L = (0–1)²+(0–0)²+(1–0)² = 2
As we can see that if we use one hot encoder label the loss function give same error value so it classifies the label has nominal(variable without no inherent order or ranking sequence) but on other hand the numerical label classifies the label has ordinal(variables with an ordered series) which unnecessary thing in classification problem.
Numerical label Disadvantage:
- We cant use sigmoid or softmax activation function which is very important in classification problem, because it squash the value to 0 to 1 or normalize.
- So if no sigmoid or softmax activation function means we cant use cross entropy loss function.
- So if we cant use cross entropy loss function ,we have to go for Mean Square Error Loss function which is mostly a non convex loss function.
I give a simple example why the cross entropy loss function is important in classification problem.
Let assume it is a numerical problem,then we have to use Mean Square Loss loss function and we can use Relu activation function.
Relu is a simple Non linear activation function which make the negative value to zero and if value is positive it will be same value.
Let assume we have a neural network which the weight is initialized and now we do forward propagation in neural network for apple input value 95 and we get a output value 200.So our target value is 1 but the model predicted has 200 because we used relu has a activation function which is constrained only to negative to be 0 but we get any positive number even we get 1000000 value based on the initial weight in model.
target = 1
predicted = 200
Lets do MSE and we doing single data so m is 1 (m=1)
L =1/m (predicted — target)²
L = 1/1 (1–200)²
(-199)² = 39601
Error Rate is 39601
Now if we apply a back propagation in neural network with gradient descent(learn rate = 0.001)
learning rate = 0.01 target =1 predicted = 200 input(x) = 95 and single output neuron so weight only 1.
weight=weight-(0.01)*(39601)(95)
weight = weight-(0.01)*(3762095)
weight = weight-(37620)
As we can see that the Error rate so high due to that the convergence to global minima has less chance.
But If we use one hot encoder ,we can use softmax activation function or sigmoid activation function and cross entropy loss function due to that the problem is convex .
Now we are going to feed input to the model and the initially the model predict ( 001)which is broccoli, but the target is apple(100).When we use softmax function the output act as a probabilistic model.If the model predict =(0.8,0.1,0.1) means 0.8 % of chance that the output is a apple and 0.1 % chance that the output may be a chicken or broccoli.So we will take highest percentage as a classified label.
predict = (0 0 1)
target = (1 0 0 )
For mlogloss loss function
N = 1 ,log(0) is undefined so we take predict = (0.0001,0,0.9999) target = (1,0,0)
L = -((1)log (0.0001) + (0) log(0)+(1) log(0.9999))
L = -(-4+0–0.00004343)
L ~ 4
When we compare the numerical label loss value and One Hot Encoder loss value , the One Hot Encoder Loss is much smaller.
4 < 39601
If we apply gradient descent
weight = weight-alpha(pi-yi)
In last layer we have three output neuron so we have three weights.but for numerical model we have only one output neuron.
weight = weight-0.01*(0.001–1)
weight1 = weight1–0.01*(0–0)
weight2 = weight2–0.01*(0.9999–0)
for MSE loss function
predict = (0,0,1)
target = (1,0,0)
(0–1)²+(0–0)²+(1–0)² = 2
Error rate is 2
If we use softmax or sigmoid function for this particular three class problem with MSE loss function the maxmium error we get is 2.
Numerical vs One Hot Encoder(MSE Loss function)
If you like this blog please do clap.If you want to know any interesting topics in Deep Learning please comment i will write on those topics.
Other links of my blog
https://medium.com/@vigneshgig/why-activation-function-is-used-in-neural-network-fb024b4e4ab3
https://medium.com/@vigneshgig/how-to-scrape-the-dynamic-website-using-sitemap-731f5e4651a9
https://medium.com/@vigneshgig/web-scraping-without-being-blocked-at-faster-rate-4439545faef7