Machine learning- Classification using Logistic Regression (Mathematical concept)

9 min readJul 17, 2018

Hi, I am Vignesh Today I am going to explain about the Mathematical concept of Classification problem using Logistic Regression and also explain about why it is called logistic Regression,Difference between Linear Regression and Logistic Regression ,why we are using Logarithmic cost function instead of Mean Square Error (MSE) which is a popular cost function,Proof of Logarithmic Cost Function and Decision Boundary.

After Data collection, Data cleansing etc., There are three common steps involved in machine learning process.

Model Representation
Hypothesis Representation
Cost Function

Model Representation

In Model Representation we have to represent or describe our model by analysing the Data Set and what Model we are going to do with that Data Set such as supervised learning or unsupervised learning.

From this Data Set we can say that it is a supervised learning and binary classification problem because we have only two label which is 0's and 1's

Hypothesis Representation

Next, We are going to choose our hypothesis.The Hypothesis of Linear Regression and Logistic Regression are similar ,But In Linear Regression the y is a discrete value.The hypothesis of Linear Regression is hθ(x)= θ₀+θ₁X.In Binary Classification problem the y will either 0’s or 1’s,So it doesn’t make sense for hθ(x) to take values larger than 1 or smaller than 0 when y ∈ {0,1}.To make work the Linear Regression as a Classification problem, We are going to constrain the hypothesis hθ(x) between 0 ≤ hθ(x) ≤ 1. To achieve this hθ(x)= θ₀+θ₁X passed into the logistic function.So using logistic function in classification problem it is called Logistic Regression.

So what ever the Z value it is constrained between 0’s and 1's.

Why Linear Regression is not a good Model for Classification Problem ?

We are going to predict whether the person has Cancer or not depending upon the Tumor Size.

So after that 0.5 it predict 1(Yes) and below 0.5 it predict 0(No).From given Data Set it look like the Linear Regression is working perfectly. But this is not going to working in all the case.Now I am going to add one more Tumor Size Point in the Data Set ,Let see how it going to work.

In this case the Linear Regression fail to classify the Data Set correctly.For this problem we are going to apply Logistic Function in hθ(x) hypothesis to make the Linear Regression work has a Classification Model.

Cost Function

The Cost Function tell how well the Logistic Regression working properly.If the Cost Function value is zero than the Model working properly. It gives the error value present in Model.Here I am going to use both the Logarithmic Cost Function and Mean Square Error(MSE).

To show why Logarithmic Cost Function is a better optimised function or convex function.

If we use Mean Square Error Cost Function J(θ) vs θ become Non Convex Function it is not a optimised function,which the process of getting to minimum error is very slow process due to local many local minima present in the Non Convex Function.

If you want to know more convex function go and study about the Convex Constrain Optimization it is a nice topic have a look at it.Reference Link here

Using the Logistic Function and Logarithmic Cost Function ,we constrain our model to 0’s and 1's.So that we are making our Model to focus in a right spot to end up with minimum error fastly using Gradient Descent.

Derivation of Logarithmic Cost Function

Cost Function

In Classification problem y belong to {0,1} So it follow Bernoulli Distribution.According to conditional probability

Bernoulli Distribution

The Bernoulli distribution essentially models a single trial of flipping a weighted coin. It is the probability distribution of a random variable taking on only two values, (“success”) 1 and (“failure”) 0 with complementary probabilities p and (1-p) respectively. The Bernoulli distribution therefore describes events having exactly two outcomes, which are ubiquitous in real life. Some examples of such events are as follows: a team will win a championship or not, a student will pass or fail an exam, and a rolled dice will either show a 6 or any other number.

First Method using Entropy

In information theory, Entropy H is a measure of the uncertainty associated with a random variable.

Second Method using Maximum Likelihood

Maximum likelihood, also called the maximum likelihood method, is the procedure of finding the value of one or more parameters for a given statistic which makes the known likelihood distribution a maximum.

p(yi) is the likelihood of single data point xi that is given the value of yi what is the probability of xi occuring. So it is the Conditional Probability p(xi|yi).

The Likelihood of the entire Data Set x is the product of the individual data point likelihood.

L(θ) is the Maximum Likelihood and L(θ) is equivalent to minimization of −L(θ).We can think negative to make negative value to positive value, Because the error is a Non-Negative value.

Decision Boundary

In order to get discrete 0’s or 1’s classification ,we are going to map the output of the hypothesis function to 1’s or 0's.

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

g(z)≥0.5

when z≥0

z=0,e0=1⇒g(z)=1/2z→∞,e−∞→0⇒g(z)=1z→−∞,e∞→∞⇒g(z)=0

So if our input to g is hθ(x)= θ₀+θ₁X, then that means:

hθ(x)=g(θTx)≥0.5whenθTx≥0

From these statements we can now say:

θTx≥0⇒y=1θTx<0⇒y=0

The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

Example:

θ=⎡⎣5−10⎤⎦y=1if5+(−1)x1+0x2≥05−x1≥0−x1≥−5x1≤5

In this case, our decision boundary is a straight vertical line placed on the graph where x_1 = 5x1=5, and everything to the left of that denotes y = 1, while everything to the right denotes y = 0.

Again, the input to the sigmoid function g(z) (e.g. θ^T XθTX) doesn’t need to be linear, and could be a function that describes a circle (e.g. z = theta_0 + theta_1 x_1² +theta_2 x_2²z=θ0+θ1x12+θ2x22) or any shape to fit our data.

If hθ(x) = 0 and y=1 mean cost function is infinity (∞)

If hθ(x) = 1 and y=1 mean cost function is 0

If hθ(x) = 1 and y=0 mean cost function is infinity (∞)

If hθ(x) = 0 and y=0 mean cost function is 0

Derivation

Randomly initialising the θ0,θ1,θ2 value.From the graph we can x1 and x2 should be intercept.So I am going to set θ0 value to be greater than zero.If the different θ0 value can also be zero.

For simplicity I going to reduce the Data Set.

Now Let check the Cost function for this Decision Boundary.

Now using Gradient Descent we are going to minimize our cost function. I have already explained about Gradient Descent in Linear Regression Method.Here is the Link

We can see that Decision Boundary is moving upward, Now let check the Cost Function.

The process is continued until we get Optimised Decision Boundary.

Mean Square Error Cost Function

This MSE error show’s that it has got stuck in local Minima, But in Logarithmic Cost function, J(theta) value is 1.034, There is a huge difference between MSE Cost Function and Logarithmic Cost Function where MSE shows very low error.So it is better to use Logarithmic Cost Function in Logistic Regression.

My other topic link