Predict Employee Attrition

Yenlingktw
4 min readMar 30, 2020

Introduction

Purpose: The purpose of this analysis is to uncover the factors that lead to employee attrition.

Data Source: This fictional dataset is created by IBM data scientists and released on Kaggle. There are 1470 employee records in this dataset.

GitHub: Please click here for the entire scripts.

Brief Summary: After experimenting with Logistic Regression, Decision Tree and Random Forest models, the logistic regression model overall has a better capability to predict employees’ attrition. Findings and recommendations are provided at the end of the analysis.

The below contents demonstrate how I understand the dataset and how I develop the model.

Explore data

1. Read/ Load data

2. Explain features: (Just to name a few)

Attrition: Whether the employee leaves the company

NumCompaniesWorked: no. of companies the employee has worked with

PercentSalaryHike: By what percent does the salary increase between the last and this year

TotalWorkingYears: no. of years the employee worked

YearsInCurrentRole: no.of years the employee worked in the current role

YearsSinceLastPromotion: no. of years the employee worked since last promotion

YearsWithCurrManager: no. of years the employee worked with current manager

3. Check missing value — no missing value

No missing value found

4. Browse through the dataset

Sample data
Summary of the data set

We can also discover the attrition rate by different features.

by Gender

by Business Travel — 25% of frequently traveled employees left

by OverTime — 31% of employees who work overtime left

by Department & Job Role: 21% employees in Sales department, and 40% sales representatives left the company.

Logistic Regression Model

  1. Prepare data

Deal with categorical data

Deal with continuous data

Split training and test set

Standardize continuous data

2. Build model (with Cross-validation)

3. Check accuracy

Accuracy, Precision, Recall rate

ROC curve

4. Obtain coefficients

Decision Tree Model

  1. Prepare for grid search

2. Search and build the Decision Tree model (with Cross-validation)

3. Check model performance

Accuracy, Precision, Recall rate

Applying a similar code as the logistic regression model, we can obtain an accuracy rate of 0.78, a precision rate of 0.55, and a recall rate of 0.15.

ROC curve

Applying a similar code as the logistic regression model, we can obtain the ROC curve as follows:

Random Forest Model

  1. Prepare for grid search

2. Search and build the Decision Tree model (with Cross-validation)

3. Check model performance

Accuracy, Precision, Recall rate

Applying a similar code as the logistic regression model, we can obtain an accuracy rate of 0.89, a precision rate of 0.84, and a recall rate of 0.44.

ROC curve

Applying a similar code as the logistic regression model, we can obtain the ROC curve as follows:

Finding & Recommendations

The performance of these three models are listed below. It’s clear that the logistic regression model has a better capability to predict employees’ attrition.

The analysis is summarized as below:

  1. After experimenting with Logistic Regression, Decision Tree and Random Forest models, A model with an accuracy rate of 0.93 is developed to predict attrition.
  2. “OverTime” and “BusinessTravel” are the top two important features that affect employees’ attrition. Employees working overtime and employees taking business travel frequently have the tendency to churn. It’s recommended to allocate more resources to address these two aspects.
  3. Other factors include JobRole, Marital Status, Years Since Last Promotion and Years in Current Role.
  • Job Role: Research Director has a lower possibility, whereas Sales Representative and Laboratory Technician have a higher possibility to leave the company.
  • Marital Status: Employees who are single have a higher possibility to churn.
  • Years Since Last Promotion and Years in Current Role: Employees recently promoted have a higher possibility, while employees who stay in the current role for a long time have a lower possibility to churn.

4. It’s recommended to collect information such as “whether the employee is a manager”, “benefit received”, “location” to include in the model in the future.

--

--

Yenlingktw

With five-year experience in people analytics, I am studying Business Analytics at UCLA Anderson school of Management.