Understanding the logistic regression algorithm in R

Vikrant thakur
5 min readDec 22, 2020

Performing a simple Logistic regression in Rstudio for mtcars dataset

Classification image (Source- Pixabay)

Linear regression is the most basic type of regression machine learning technique. On the other hand, the logistic regression machine learning technique is one of the most simple and basic algorithms for classification problems. Some of the examples of classification are Online transaction fraud, tumor malignin, and E-mail spam classification. In logistic regression logistic sigmoid function is used to transform the output to return a probability value. So, one can say that the logistic regression algorithm is a predictive analysis algorithm based on the concept of probability. Logistic regression is used to find the outcome of an event in binary values which can represent whether true or false. The outcome can be categorized into 0 or 1 which represents failure and success respectively.

About regression analysis

Regression analysis is an important technique to model and analyzes data. In this technique, we try to minimize the distance between different data points and a straight line or curve. There are many different regression models but linear and logistic regression is the most basic type of regression algorithm. The regression model is built upon the relationship between the dependent and independent variables in the dataset. Based on the type of dependent variable, the number of the independent variables, and the shape of the regression line which can be straight or curve regression algorithm are classified into many types.

Types of logistic regression

  1. Binary logistic regression: It is a regression involving the dependent variable as binary. (Tumour malignant problem) Ex: success or failure
  2. Multiple Logistic regression: It is a regression involving the dependent variables with multi-class. Ex: Dog, cats, or sheep

When to use Logistic regression?

In our daily life, there are many events whose result can be from any one of the categories. Some questions like Will India win over Australia in Kohli’s absence? Or Is terrorism danger to humanity more than other problems? — Can have only two possible results like yes or no. Such problems can be solved using logistic regression.

What is sigmoid Function?

The logistic regression algorithm can be called a linear regression with a more complex cost function which is known as the sigmoid function. It is also known as the logistic function. It can map any real value between 0 and 1 but never exactly at those limits.

e/(1+e^-value) (e — Natural base logarithms)

Its graph is an S-shaped curve which can be observed in the image given below –

Image for the logistic function graph

Logistic regression equation –

y = e^ (b0 + b1*x) / (1 + e^ (b0 + b1*x))

Here y is the dependent variable or one which is predicted, b0 is the intercept term and b1 is the coefficient for the single input value (x).

Avoiding overfitting and underfitting in Logistic regression :

While selecting a model for logistic regression it is very important to consider the model fit. Adding an independent variable to the logistic variables in the model will increase the variance and make the model fit. But if the variables are added in less it will result in underfitting or if more than will result in overfitting of the model. By overfitting, it means that the model will not be able to use as a generalized model for other data set beyond our training dataset and vice-versa for underfitting which cannot even be used for the training data.

Example in R programming-

Here in the example, we will simply create a logistic model for understanding the relationship between two variables of an in-built dataset in an R environment which is called mtcars.

The basic function for logistic regression in R is glm(). The syntax of the glm() function is given below-

glm(formula, data, value)

Following is the above terms in the syntax −

· formula is the symbol for the relationship between two variables

· data is the input dataset (mtcars for our example)

· family is R object to specify the details of the model

About dataset

The dataset was brought from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

It is a table of 32 observations of a total of 11 variables. Some of the important variables are mpg (miles per gallon), cyl (Number of cylinders), disp (displacement), hp (Horsepower), and wt (weight).

Below is the source code for the R program that we performed on the mtcars dataset for making a logistic model.

R program for logistic regression

Explanation of code

The first line of code declares a vector named tran_data having the input from the mtcars dataset. In the second line, we have created our.data vector which has the glm() function which tries to understand the relationship between the am and the other three variables named cyl, hp, and wt. we have provided data as tran_data here and family is classified as binomial as the outcome is expected in the same manner.

Output

The console image of the studio for output

Explanation of output

As the p-value in the last column is more than 0.05 for the variables “cyl” and “hp”, we consider them to be insignificant in contributing to the value of the variable “am”. Only weight (wt) impacts the “am” value in this regression model.

The weight variable has a p-value of less than 0.05 so we can say that it is significant in contributing to the value of am and am is dependent on weight for its values.

Conclusion

Classification is a machine learning technique from supervised machine learning. In classification, the model tries to predict the category or class to which the dependent variable falls based on its relationship with dependent variables. Logistic regression is a fundamental classification algorithm. It is a linear classifier that is similar to polynomial and linear regression. It is fast and relatively uncomplicated than other classification algorithms. Although it can be best used for binary classification one can also use logistic regression for multi-class classification.

--

--