THE FUTURE IS HERE

Machine Learning using Decision Trees and Random Forests in Python with Code

One of the simplest, yet most useful Machine Learning algorithm, is the Decision Trees. As the name implies, Decision trees are trees of decisions.

Decision Trees

A Decision Tree is a tree in which the nodes represent decisions (a square box), random transitions (a circular box) or terminal nodes, and the edges or branches are binary (yes/no, true/false) representing possible paths from one node to another.

Let’s start off with a thought experiment to give some motivation behind using a decision tree method. Imagine that you play Tennis every Saturday and you invite a friend to come with you. Sometimes the friend shows up, sometimes he doesn’t. For him it depends upon a variety of factors such as : weather, temperature, humidity, etc. You start keeping track of these features and whether or not he showed up to play. Based on this data, you build a table which looks like:

Trending AI Articles:

1. Are you using the term ‘AI’ incorrectly?

2. TensorFlow Object Detection API tutorial

3. TensorFlow Object Detection API: basics of detection (2/2)

4. Understanding and building Generative Adversarial Networks(GANs)- Deep Learning with PyTorch

Based on this data, you want to predict whether your friend will turn up to play or not. An intuitive way to do this is through a decision tree.

In this tree, we have :

Nodes : Split for the value of a certain attribute. Here we have, Outlook node, Humidity node and Windy node.

Edges : Outcome of a split to next node.

Root : The node that performs the first split. In our case, the Outlook node.

Leaves : The terminal nodes that predict the outcome. The colored nodes, i.e., Yes and No nodes, are the leaves.

Example

Now, that we have seen what a decision tree is, let us code a decision tree. Today, we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a very interesting year in 2016, so let’s check out some of their data and keep the context in mind. This data is from before they even went public.

We will use lending data from 2007–2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here.

Import Libraries

Import the usual libraries for pandas and plotting. You can import sklearn later on.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Get the Data

Use pandas to read loan_data.csv as a dataframe called loans.

loans = pd.read_csv('loan_data.csv')

Check out the info() method on loans.

loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy 9578 non-null int64
purpose 9578 non-null object
int.rate 9578 non-null float64
installment 9578 non-null float64
log.annual.inc 9578 non-null float64
dti 9578 non-null float64
fico 9578 non-null int64
days.with.cr.line 9578 non-null float64
revol.bal 9578 non-null int64
revol.util 9578 non-null float64
inq.last.6mths 9578 non-null int64
delinq.2yrs 9578 non-null int64
pub.rec 9578 non-null int64
not.fully.paid 9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB

Here are what the columns represent:

  • credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
  • purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
  • int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
  • installment: The monthly installments owed by the borrower if the loan is funded.
  • log.annual.inc: The natural log of the self-reported annual income of the borrower.
  • dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  • fico: The FICO credit score of the borrower.
  • days.with.cr.line: The number of days the borrower has had a credit line.
  • revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
  • revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
  • inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
  • delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  • pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
loans.head()
Since there are many columns, only a part of the image is posted.

Exploratory Data Analysis

Let’s do some data visualization! We’ll use seaborn and pandas built-in plotting capabilities, but feel free to use whatever library you want. Don’t worry about the colors matching, just worry about getting the main idea of the plot.

Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.

plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')

Create a similar figure, except this time select by the not.fully.paid column.

plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')

Setting up the Data

Let’s get ready to set up our data for our Random Forest Classification Model!

Check loans.info() again.

loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy 9578 non-null int64
purpose 9578 non-null object
int.rate 9578 non-null float64
installment 9578 non-null float64
log.annual.inc 9578 non-null float64
dti 9578 non-null float64
fico 9578 non-null int64
days.with.cr.line 9578 non-null float64
revol.bal 9578 non-null int64
revol.util 9578 non-null float64
inq.last.6mths 9578 non-null int64
delinq.2yrs 9578 non-null int64
pub.rec 9578 non-null int64
not.fully.paid 9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB

Categorical Features

Notice that the purpose column as categorical

That means we need to transform them using dummy variables so sklearn will be able to understand them. Let’s do this in one clean step using pd.get_dummies.

Let’s show you a way of dealing with these columns that can be expanded to multiple categorical features if necessary.

Create a list of 1 element containing the string ‘purpose’. Call this list cat_feats.

cat_feats = ['purpose']

Now use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data.

final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.head()

Train Test Split

Now its time to split our data into a training set and a testing set!

Use sklearn to split your data into a training set and a testing set as we’ve done in the past.

from sklearn.model_selection import train_test_split
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

Training a Decision Tree Model

Let’s start by training a single decision tree first!

Import DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')

Predictions and Evaluation of Decision Tree

Create predictions from the test set and create a classification report and a confusion matrix.

predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
precision    recall  f1-score   support

0 0.85 0.82 0.84 2431
1 0.18 0.22 0.20 443

avg / total 0.75 0.73 0.74 2874
print(confusion_matrix(y_test,predictions))
[[1991  440]
[ 345 98]]

Training the Random Forest model

Now its time to train our model!

Create an instance of the RandomForestClassifier class and fit it to our training data from the previous step.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

Predictions and Evaluation

Let’s predict off the y_test values and evaluate our model.

Predict the class of not.fully.paid for the X_test data.

predictions = rfc.predict(X_test)

Now create a classification report from the results.

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
precision    recall  f1-score   support
          0       0.85      1.00      0.92      2431
1 0.61 0.02 0.05 443
avg / total       0.81      0.85      0.78      2874

Show the Confusion Matrix for the predictions.

print(confusion_matrix(y_test,predictions))
[[2424    7]
[ 432 11]]

Conclusion

What performed better the random forest or the decision tree?

This is the most important question. The answer to such questions is always variable. It depends upon what metric you are trying to optimize for. Notice the recall for each class for the models. It can be seen that neither model did very well in our case. More feature engineering is needed to optimize the model.

Pros

  • Easy to understand and interpret. At each node, we are able to see exactly what decision our model is making. In practice we’ll be able to fully understand where our accuracies and errors are coming from, what type of data the model would do well with, and how the output is influenced by the values of the features. Scikit learn’s visualisation tool is a fantastic option for visualising and understanding decision trees.
  • Require very little data preparation. Many ML models may require heavy data pre-processing such as normalization and may require complex regularisation schemes. Decision trees on the other hand work quite well out of the box after tweaking a few of the parameters.
  • The cost of using the tree for inference is logarithmic in the number of data points used to train the tree. That’s a huge plus since it means that having more data won’t necessarily make a huge dent in our inference speed.

Cons

  • Overfitting is quite common with decision trees simply due to the nature of their training. It’s often recommended to perform some type of dimensionality reduction such as PCA so that the tree doesn’t have to learn splits on so many features
  • For similar reasons as the case of overfitting, decision trees are also vulnerable to becoming biased to the classes that have a majority in the dataset. It’s always a good idea to do some kind of class balancing such as class weights, sampling, or a specialised loss function.

Don’t forget to give us your 👏 !

https://medium.com/media/c43026df6fee7cdb1aab8aaf916125ea/href


Machine Learning using Decision Trees and Random Forests in Python with Code was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.