Customer Churn – Logistic Regression with R

Customer Churn – Logistic Regression with R


In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. In this blog post, we are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

Learning/Prediction Steps


Data Description

Telecom dataset has the details for 7000+ unique customers, where details of each customer are represented in a unique row and below is the structure of the dataset: chrun_lr_dataframe Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning. chrun_lr_head

Data Preprocessing

    • Data cleansing and preparation will be done in this step. Transforming continuous variable into meaningful factor variable will improve the model performance and help understand the insights of the data. For example, in this dataset, the tenure interval variable is converted to factor variable with range in months. Thus, understanding the type of customers with tenure value to perform churn decision.
    • As part of data cleansing, the missing values are identified using the missing map plot. The telecom dataset has minimal number of missing value record and is dropped out from analysis.

churn_lr_na chrun_lr_missing_plot

    • Custom logic is implemented to create derived categorical variable from the tenure variable and continuous variables. As it will not affect the prediction value, customer id and tenure values are dropped from further process.


    • New categorical feature is created as mentioned above.


    • Few categorical variables have duplicate reference values and it refers to the same level. For example, “MultipleLine” feature has possible values as “Yes, No, No Phone Service”. Since “No” and “No Phone Service” have the same meaning, these records are replaced with unique reference.


Partitioning the Data & Logistic Regression

    • In the predictive modeling, the data need to be partitioned into train and test sets. 70% of the data will be partitioned for training purpose and 30% of the data will be partitioned for testing purpose.
    • In this dataset, 4K+ customer records are used for training purpose and 2K+ records are used for testing purpose.
    • Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can be used to predict chrun that are available in R or Python or Spark ML.
    • Multiple models can be executed on top of the telecom dataset to compare their performance and error rate to choose the best model. In this blog post, we have used Logistic Regression Model with R using glm package. Future blogs will focus on other models and combination of models.


Model Summary

From the model summary, the response churn variable is affected by tenure interval, contract period, paper billing, senior citizen, and multiple line variables. The importance of the variable will be identified by the legend of the correlated coefficients (*** – high importance, * – medium importance, and dot – next level of importance). Rerunning the model with these dependent variables will impact the model performance and accuracy. churn_lr_model

Prediction Accuracy

    • Models built using train datasets are tested through the test dataset. Accuracy and error rate are used to understand how these models are behaving for the test dataset. The selection of the best model is determined by using these measures.
    • Confusion Matrix/ Misclassification Table: It is a table used to describe the performance of the classification model on a test data. It is used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.


    • The various measures derived from the confusion matrix are:


    • With the choice of logistic regression, it is evident that the accuracy for this model is evaluated as 80% and error rate as 20%. The accuracy of the model can be improved with other classification models such as decision tree, and random forest with parameter tuning.



28083 Views 8 Views Today
  • t_nmurthy

    This is a good read.

    • Treselle Systems Blog


  • Abdoulaye Diallo

    Hell thanks for this article. i’ve one question, did you erase all data in churn column for the test data. if I understand these data (yes or no) shoud be predicted by yhe model as it’s classification.
    could explain me?

    • Treselle Systems Blog

      Yes, we have neglected the churn column in the test data and have used trained data for prediction.

  • Alex King

    Very good article, thank you for sharing! Just a quick question – how come the date of churn was not included in the data? Surely there would be some interaction between the month and other variables that might increase the likelihood of some customers churning. For example, it might be that there is an increased churn rate for senior citizens with Fibre optic in April.

    • Treselle Systems Blog

      Thanks!!! The sample dataset used in this blog does not have churn date feature. As said, churned month might be correlated with other features and will affect the customer churn. If this feature is available, the performance of a particular customer group (cohort) over the month can be seen.

  • Hari Shaw

    Hi, Nice tutorial and thanks for this.
    when I am running glm function i.e. the below step
    telecomModel <- glm(Churn ~ .,family=binomial(link="logit"),data=trainData)
    I am getting the Error in eval(family$initialize) : y values must be 0 <= y <= 1 . Now I changed the variable "Churn" from Yes, No to 1, 0. also from character to numeric. but still I am getting error "Error in weights * y : non-numeric argument to binary operator". What I suspect is, when I glimpse the data set or the train data set I found that there are lots of categorical variables which needs to convert to numeric or double correct? How can we run glm directly without converting categorical to numeric or double variables?
    could you please correct me?

    • Treselle Systems Blog

      Are you using the same code available in our GitHub link?
      No, it’s not necessary to convert the categorical variables into numeric or double correct. You can run the GLM function with the categorical variables too. See the structure of the train data attached other than “monthly charges and total charges” features. The rest of the features are categorical variables.

  • fati

    Hi, Thank you for this amazing work. But the Amelia and dplyr packages does’nt work for me. Any help plz !