top of page

Telecom Customer Churn Prediction

An end to end ML Project

25231.png

Overview

 

The project follows a modularized and systematic approach, emphasizing clarity, maintainability, and collaborative development. The iterative nature of the project ensures that it evolves based on insights gained during exploratory data analysis and model training.



Dataset Used

 

Unfortunately, this dataset is no longer available on Kaggle. You can still download the zip file below:




Dataset Overview:

  • Columns: 12

  • Categorical Variables: 'Gender', 'Subscription Type', 'Contract Length'

  • Continuous Variables: 'CustomerID', 'Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction'

  • Target Variable: Churn

  • Memory Usage: 50.1+ MB


Outliers: None.


Null Values: 0,00198% evenly distributed across all columns.


Class Imbalance: None.


Proportion of target variable:

  • 44,5% (No churn)

  • 55.5% (Churn)



Key Highlights

 
  • Exploratory data analysis to uncover patterns and features influencing customer churn.

  • Feature engineering and selection to create predictive indicators.

  • Model training and selection based on Precision Score using Randomized Search CV, Cross-Validation, and Hyperparameter tuning.

  • Feature importance analysis for further iterations.



Model Characteristics

 
  • The model for this project is an XGBoostClassifier, and it was chosen by iterating and experimenting with RandomizedSearchCV, Cross-validation, and Hyperparameter tuning.

  • It delivers high precision and a strong ROC AUC score. (0.928 for Precision-Recall Curve and 0.920 for ROC Curve)





  • The top features influencing churn are 'high_support_calls', 'low_spender', and 'high_payment_delay'.






Flask App

 

By running the app.py script users can interact with the ML algorithm:







Issue & Limitations

 

While this project serves primarily as an educational endeavor, certain industry-standard practices remain unaddressed.


Notably, the absence of a well-designed Continuous Integration/Continuous Deployment (CI/CD) pipeline using GitHub actions limits exposure to crucial practices for real-world development workflows.


Additionally, a robust deployment strategy for the predictive model is lacking, preventing the transition from development to production. These omissions limit the project's agility, hindering the implementation of real-time updates and model utilization in operational environments.


Addressing these aspects would further align the project with best practices in the data science and machine learning industry.



bottom of page