Why One Hot Encoder Is Important In Classification Model

6 min readNov 24, 2018

A quick explanation about One Hot Encoder,While working in classification Problem I had though that why One Hot Encoder is used in any categorical problem.Before go through this blog I recommend you to read my Logistics Regression blog .

To use Numerical value as a Labels we can use Sparse_categorical_crossentropy Loss LINK but i didn’t get any detail about sparse_categorical_crossentropy.If i get this research paper or formula i try to explain in separate blog.

If we have a dataset like shown below:

In above dataset they are three class (Apple, Chicken, Broccoli),If i given a Apple dataset input to logistic regression or neural network ,But it predicts it is a Chicken ,So we will use gradient descent to minimize the loss function to predict correctly.If we use numerical label in classification model the problem arise in loss function,due to that the model learn as Apple(1)<Chicken(2)<Broccoli(3) (Apple is lesser then Chicken and Chicken lesser than Broccoli or in other word the Model end up in ordinal instead of nominal) .it will not work as a nominal classifier instead of it will work as a ordinal classifier which is distance measure how far apple is close to chicken and Broccoli,For example I am going to use mean square error loss function in numerical Label classification.

L = 1/m (predict-target)²

Now I feed a input of calories 95 dataset of apple to logistics regression model or neural network model which it should predict apple(1) but it predicts chicken(2) so using loss function value we have to train our network to predict correctly,Now Lets see how the MSE loss function works.

predict = 2(chicken)

target = 1(apple)

(2–1)²=1

Now we feed a apple input value ,predict = Broccoli(3) but target is apple(1)

predict = 3

target = 1

(3–1)² = 4

As you see that the loss function various which means it train has the apple is so far away from the broccoli but it is close to chicken due to that converge of optimium minima or global minima will be slow and also it may end up in local minima that is classification problem become a non convex problem because we cant use the cross entropy loss function which is convex loss function.

One Hot encoder

if a model predicts a chicken but the target class is apple

predict = (0,1,0)

target = (1,0,0)

L = ((0,1,0)-(1,0,0))

According to multi classification loss function

L = (0–1)²+(1–0)²+(0–0)² = 2

Now if a model predicts a broccoli but the target class is apple

predict = (0,1,0)

target = (0,0,1)

L = ((0,0,1)-(1,0,0))

L = (0–1)²+(0–0)²+(1–0)² = 2

As we can see that if we use one hot encoder label the loss function give same error value so it classifies the label has nominal(variable without no inherent order or ranking sequence) but on other hand the numerical label classifies the label has ordinal(variables with an ordered series) which unnecessary thing in classification problem.

Numerical label Disadvantage:

We cant use sigmoid or softmax activation function which is very important in classification problem, because it squash the value to 0 to 1 or normalize.
So if no sigmoid or softmax activation function means we cant use cross entropy loss function.

So if we cant use cross entropy loss function ,we have to go for Mean Square Error Loss function which is mostly a non convex loss function.

I give a simple example why the cross entropy loss function is important in classification problem.

Let assume it is a numerical problem,then we have to use Mean Square Loss loss function and we can use Relu activation function.

Relu is a simple Non linear activation function which make the negative value to zero and if value is positive it will be same value.

Let assume we have a neural network which the weight is initialized and now we do forward propagation in neural network for apple input value 95 and we get a output value 200.So our target value is 1 but the model predicted has 200 because we used relu has a activation function which is constrained only to negative to be 0 but we get any positive number even we get 1000000 value based on the initial weight in model.

target = 1

predicted = 200

Lets do MSE and we doing single data so m is 1 (m=1)

L =1/m (predicted — target)²

L = 1/1 (1–200)²

(-199)² = 39601

Error Rate is 39601

Now if we apply a back propagation in neural network with gradient descent(learn rate = 0.001)

learning rate = 0.01 target =1 predicted = 200 input(x) = 95 and single output neuron so weight only 1.

weight=weight-(0.01)*(39601)(95)

weight = weight-(0.01)*(3762095)

weight = weight-(37620)

As we can see that the Error rate so high due to that the convergence to global minima has less chance.

But If we use one hot encoder ,we can use softmax activation function or sigmoid activation function and cross entropy loss function due to that the problem is convex .

Now we are going to feed input to the model and the initially the model predict ( 001)which is broccoli, but the target is apple(100).When we use softmax function the output act as a probabilistic model.If the model predict =(0.8,0.1,0.1) means 0.8 % of chance that the output is a apple and 0.1 % chance that the output may be a chicken or broccoli.So we will take highest percentage as a classified label.