Predict Employee Attrition
Introduction
Purpose: The purpose of this analysis is to uncover the factors that lead to employee attrition.
Data Source: This fictional dataset is created by IBM data scientists and released on Kaggle. There are 1470 employee records in this dataset.
GitHub: Please click here for the entire scripts.
Brief Summary: After experimenting with Logistic Regression, Decision Tree and Random Forest models, the logistic regression model overall has a better capability to predict employees’ attrition. Findings and recommendations are provided at the end of the analysis.
The below contents demonstrate how I understand the dataset and how I develop the model.
Explore data
1. Read/ Load data
2. Explain features: (Just to name a few)
Attrition: Whether the employee leaves the company
NumCompaniesWorked: no. of companies the employee has worked with
PercentSalaryHike: By what percent does the salary increase between the last and this year
TotalWorkingYears: no. of years the employee worked
YearsInCurrentRole: no.of years the employee worked in the current role
YearsSinceLastPromotion: no. of years the employee worked since last promotion
YearsWithCurrManager: no. of years the employee worked with current manager
3. Check missing value — no missing value
4. Browse through the dataset


We can also discover the attrition rate by different features.
by Gender

by Business Travel — 25% of frequently traveled employees left

by OverTime — 31% of employees who work overtime left

by Department & Job Role: 21% employees in Sales department, and 40% sales representatives left the company.


Logistic Regression Model
- Prepare data
Deal with categorical data
Deal with continuous data
Split training and test set
Standardize continuous data
2. Build model (with Cross-validation)
3. Check accuracy
Accuracy, Precision, Recall rate

ROC curve


4. Obtain coefficients

Decision Tree Model
- Prepare for grid search
2. Search and build the Decision Tree model (with Cross-validation)
3. Check model performance
Accuracy, Precision, Recall rate
Applying a similar code as the logistic regression model, we can obtain an accuracy rate of 0.78, a precision rate of 0.55, and a recall rate of 0.15.
ROC curve
Applying a similar code as the logistic regression model, we can obtain the ROC curve as follows:

Random Forest Model
- Prepare for grid search
2. Search and build the Decision Tree model (with Cross-validation)
3. Check model performance
Accuracy, Precision, Recall rate
Applying a similar code as the logistic regression model, we can obtain an accuracy rate of 0.89, a precision rate of 0.84, and a recall rate of 0.44.
ROC curve
Applying a similar code as the logistic regression model, we can obtain the ROC curve as follows:

Finding & Recommendations
The performance of these three models are listed below. It’s clear that the logistic regression model has a better capability to predict employees’ attrition.

The analysis is summarized as below:
- After experimenting with Logistic Regression, Decision Tree and Random Forest models, A model with an accuracy rate of 0.93 is developed to predict attrition.
- “OverTime” and “BusinessTravel” are the top two important features that affect employees’ attrition. Employees working overtime and employees taking business travel frequently have the tendency to churn. It’s recommended to allocate more resources to address these two aspects.
- Other factors include JobRole, Marital Status, Years Since Last Promotion and Years in Current Role.
- Job Role: Research Director has a lower possibility, whereas Sales Representative and Laboratory Technician have a higher possibility to leave the company.
- Marital Status: Employees who are single have a higher possibility to churn.
- Years Since Last Promotion and Years in Current Role: Employees recently promoted have a higher possibility, while employees who stay in the current role for a long time have a lower possibility to churn.
4. It’s recommended to collect information such as “whether the employee is a manager”, “benefit received”, “location” to include in the model in the future.