The Great Decider.
The Agenda
Today, let’s focus specifically on the classification portion of the CART set of algorithms, by utilizing SciKit-Learn’s DecicionTreeClassifier module for machine learning solutions.
But first, let’s take a broader look at the concept space and relevant data structures.
Classification and regression trees (CART) Algorithms
CART algorithms are Supervised
learning models used for problems involving classification and regression.
Supervised Learning
Supervised learning is an approach for engineering predictive models from known labeled data, meaning the dataset already contains the targets appropriately classed. Our goal is to allow the algorithm to build a model from this known data, to predict future labels (outputs), based on our features (inputs) when introduced to a novel dataset.
Classification Example Problems
1) Identifying fake profiles.
2) Classifying a species.
3) Predicting what sport someone plays.
Classification Tree - Overview
-
Objective is to infer class labels from previously unseen data.
-
Algorithmically, it is a recursive(divide & conquer) and greedy(favors optimization) solution.
-
Is a sequence of if-else questions about features.
-
Captures non-linear relationships between features and labels.
-
Is non-parametric, based on observed data and does not assume a normal distribution.
-
No feature is scaling required.
Classification Model - Approach
Let’s briefly set a mental framework for approaching the creation of a classification model.
Like all things data, we’ll begin with a dataset.
Next, we’ll need to break it into training and test sets.
We can then use the training dataset with a learning algorithm (in our case, the scikit-learn DecisionTreeClassifier module) to create a model via induction, which is then applied to make predictions on the test set of data through deduction.
Here’s a general schematic view of the concept.
Source: https://www-users.cs.umn.edu/~kumar001/dmbook/dmslides/chap4_basic_classification.pdf
Model Concerns
As powerful as the technique can be, it needs a strong foundation and human-level quality control.
Here are some points to consider as you prepare for your ML task:
A) Are you selecting the right problem to test?
B) Are you capable of supplying sufficient data?
C) Are you providing your model clean data?
D) Are you able to prevent algorithmic biases and confounding factors?
Performance Metrics
It should be noted at the onset, that simple Decision Trees are highly prone to overfitting, leading to models which are difficult to generalize. One method of mitigating this potential risk is to engage in pruning of the tree, i.e., removing parts of the tree which confer no/low power to the model. We will discuss methods of pruning shortly.
A cautious interpretation of seemingly powerful results is encouraged.
Goal: Achieve the highest possible accuracy, while retaining the lowest error rate.
Accuracy Score
accuracy_score(y_test, y_pred)
The accuracy score is calculated through the ratio of the correctly predicted data points divided by all predicted data points.
Mean Squared Error
mean_squared_error(y_test, y_pred)
Computed average squared difference between the estimated values, and what is being estimated.
Mean Absolute Error
mean_absolute_error(y_test, y_pred)
The mean absolute error reflects the magnitude of difference between the prediction and actual.
Score
score(features, target)
Mean accuracy on the given test data and labels
Confusion Matrix
confusion_matrix(y_test, y_pred)
Summarizes error rate in terms of true/false positives/negatives.
While the rest of the tests outlined above return simple numbers to interpret, the confusion matrix needs a primer on interpetation.
Source: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-S4-S2
Calling confusion_matrix()
will yield a result in the form:
[TP,FP]
[FN,TN]
Tree Data Structure - Fundamentals
Reference Image
Source: Data Structure Tree Diagram - artbattlesu.com
Unlike common linear data structures, like lists
and arrays
, a Tree
is a non-linear, hierarchical method of storing/modeling data. Visually, you can picture an evolutionary tree, a document object model (DOM)
from HTML, or even a flow chart of a company hierarchy.
In contrast to a biological tree originating of kingdom plantae, the data structure tree has a simple anatomy:
A tree consists of nodes and edges.
There are ‘specialized’ node types classified by unique names which represent their place on the hierarchy of the structure.
The root node, is a single point of origin for the rest of the tree. Branches can extend from nodes and link to other nodes, with the link referred to as an edge.
The node which accepts a link/edge, is said to be a child node, and the originating node is the parent.
A single node may have one, two, or no children. If a node has no children, but does have a parent, it is called a leaf. Some will also refer to internal nodes, which have one parent and two children.
Finally, sibling nodes are nodes which share a parent.
Beyond the core anatomy, the tree has unique metrics to be explored: depth and height.
Depth refers to the spatial attributes of an individual node in relation to the root, meaning, how many links/edges are between the specific node and the root node. You could also think of it as the position of the node from root:0
to leaf:m
depth.
The height, refers to the number of edges in the longest possible path of the tree, similar to finding the longest carbon chain back in organic chemistry to determine the IUPAC name of the compound.
Decision Tree
Source: Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan
A decision tree, allows us to run a series of if/elif/else tests/questions on a data point, record, or observation with many attributes to be tested. Each node of this tree, would represent some condition of an attribute to test, and the edges/links are the results of this test constrained to some kind of binary decision. An observation travels through each stage, being assayed and partitioned, to reach a leaf node. The leaf contains the final proposed classification.
For example, if we had a dataset with rich features about a human, we could ask many questions about that person and their behavior based on gender(M/F/O), weight(Above/Below a value), height(Above/Below a value), activities(Sets of choices) to make a prediction.
Classification Plot - Simple
Decision Region | Space where instances become assigned to particular class, blue or red, in the plotting space in the diagram. |
Decision Boundary | Point of transition from one decision region to another, aka one class/label to another, the diagonal black line. |
Source: Vipin Kumar CSci 8980 Fall 2002 13 Oblique Decision Trees
Feature Selection
Identification of which features/columns with the highest weights in predictive power.
In our case, the CART algorithm will do the feature selection for us through the Gini Index
or Entropy
which measure how pure your data is partitioned through it’s journey to a final leaf node classification.
Pruning
As we discussed earlier, decision trees are prone to overfitting. Pruning is one way to mitigate the influence. As each node is a test, and each branch is a result of this test, we can prune unproductive branches which contribute to overfitting. By removing them, we can further generalize the model.
Pre-Pruning
One strategy for pruning is known as pre-pruning. This method relies on ending the series of tests early, stopping the partitioning process. When stopped, what was previously a non-leaf node, becomes the leaf node and a class is declared.
Post-Pruning
Post-Pruning is a different approach. Where pre-pruning occurs during creation of the model, post-pruning begins after the process is complete through the removal of branches. Sets of node removals are tested throughout the branches, to examine the effect on error-rates. If removing particular nodes increases the error-rate, pruning does not occur at those positions. The final tree contains a version of the tree with the lowest expected error-rate.
Decision Tree Classification: Steps to Build and Run
1 Imports
2 Load Data
3 Test and Train Data
4 Instantiate a Decision Tree Classifier
5 Fit data
6 Predict
7 Check Performance Metrics
1 - Import Modules/Libraries [SciKit-Learn]
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2 - Load Data
First, we’re going to want to load a dataset, and create two sets, X and y, which represent our features and our desired label.
# X contains predictors, y holds the classifications
X, y = dataset.data, dataset.target
features = iris.feature_names
3 - Split Dataset into Test and Train sets
Now, we can partition our data into test and train sets, and the typical balance is usually 80/20 or 70/30 test vs train percentages.
The results of the split will be stored in X_train
, X_test
, y_train
, and y_test
.
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
4 - Instantiate a DecisionTreeClassifier
We can instantiate our DecisionTreeClassifier object with a max_depth
, and a random_state
.
random_state
allows us a way of ensuring reproducibility, while max_depth
is a hyper-parameter, which allows us to control complexity of our tree, but must be used with a cautious awareness. If max_depth
is set too high, we risk over-fitting the data, while if it’s too low, we will be underfitting.
dt = DecisionTreeClassifier(max_depth=6, random_state=1)
5 - Fit The Model
We fit our our model by utilizing .fit()
and feeding it parameters X_train
and y_train
which we created previously.
dt.fit(X_train, y_train)
6 - Predict Test Set Labels
We can now test our model by applying it to our X_test
variable, and we’ll store this as y_pred
.
y_pred = dt.predict(X_test)
7 - Check Performance Metrics
We want to check the accuracy of our model, so let’s run through some performance metrics.
Accuracy Score
def AccuracyCheck(model, X_test, y_pred):
acc = accuracy_score(y_test, y_pred)
print('Default Accuracy: {}'.format(round(acc), 3))
Confusion Matrix
def ConfusionMatx(y_test, y_pred):
print('Confusion Matrix: \n{}'.format(confusion_matrix(y_test, y_pred)))
Mean Absolute Error
def MeanAbsErr(y_test, y_pred):
mean_err = metrics.mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error: {}'.format(round(mean_err), 3))
Mean Squared Error
def MeanSqErr(y_test, y_pred):
SqErr = metrics.mean_squared_error(y_test, y_pred)
print('Mean Squared Error: {}'.format(round(SqErr), 3))
Score
def DTCScore(X, y, dtc):
score = dtc.score(X, y, sample_weight=None)
print('Score: {}'.format(round(score)))
Summary Report
Thanks to the power of Python, we can run all of the tests in one go via scripting:
def MetricReport(X, y, y_test, y_pred, dtc):
print("Metric Summaries")
print("-"*16)
AccuracyCheck(model, X_test, y_pred)
MeanAbsErr(y_test, y_pred)
MeanSqErr(y_test, y_pred)
DTCScore(X, y, dtc)
ConfusionMatx(y_test, y_pred)
print("-" * 16)
Hepatitis: A Case Study
We’ll follow the procedures above, with a few twists. We’re going to add a way to visualize our decision tree graph, as well as apply a real dataset using the tools and approaches outlined.
Imports
import pandas as pd
import graphviz
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Dataset
https://archive.ics.uci.edu/ml/datasets/Hepatitis
path = '...hepatitis.csv'
col_names = ['Class', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise','Anorexia', 'Liver_Big', 'Liver_Firm',
'Spleen_Palp', 'Spiders', 'Ascites', 'Varices', 'Bilirubin', 'Alk_Phosph', 'SGOT', 'Albumin', 'Protime',
'Histology' ]
csv = pd.read_csv(path, na_values=["?"], names=col_names)
df = pd.DataFrame(csv)
Survey The Data
def minorEDA(df):
"""
Generates a preliminary EDA Analysis of our file
args: df - DataFrame of our excel file
returns: None
"""
lineBreak = '------------------'
#Check Shape
print(lineBreak*3)
print("Shape:")
print(df.shape)
print(lineBreak*3)
#Check Feature Names
print("Column Names")
print(df.columns)
print(lineBreak*3)
#Check types, missing, memory
print("Data Types, Missing Data, Memory")
print(df.info())
print(lineBreak*3)
Check Integrity
def check_integrity(input_df):
""" Check if values missing, generate list of cols with NaN indices
Args:
input_df - Dataframe
Returns:
List containing column names containing missing data
"""
if df.isnull().values.any():
print("\nDetected Missing Data\nAffected Columns:")
affected_cols = [col for col in input_df.columns if input_df[col].isnull().any()]
affected_rows = df.isnull().sum()
missing_list = []
for each_col in affected_cols:
missing_list.append(each_col)
print(missing_list)
print("\nCounts")
print(affected_rows)
print("\n")
return missing_list
else:
pass
print("\nNo Missing Data Was Detected.")
Unfortunately this set contains missing data points, if we don’t clean them up, we’ll recieve an error. To do this, df.dropna(inplace=True)
in the __main__
portion of our program will allow us to continue, but our model will ultimately be weaker due to missing inputs.
Set Label Target
def set_target(dataframe, target):
"""
:param dataframe: Full dataset
:param target: Name of classification column
:return x: Predictors dataset
:return y: Classification dataset
"""
x = dataframe.drop(target, axis=1)
y = dataframe[target]
return x, y
## Decision Tree
def DecisionTree():
# Build Decision Tree Classifier
dtc = DecisionTreeClassifier(max_depth=6, random_state=2)
return dtc
Test, Train
def TestTrain(X, y):
"""
:param X: Predictors
:param y: Classification
:return: X & Y test/train data
"""
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
return X_train, X_test, y_train, y_test
Fit
def FitData(DTC, X_train, y_train):
# Fit training data
return DTC.fit(X_train, y_train)
Predict
def Predict(dtc, test_x):
y_pred = dtc.predict(test_x)
return y_pred
Check Accuracy, Metrics, Generate Report
def AccuracyCheck(model, X_test, y_pred):
#Cneck Accuracy Score
acc = accuracy_score(y_test, y_pred)
print('Default Accuracy: {}'.format(round(acc), 3))
def ConfusionMatx(y_test, y_pred):
print('Confusion Matrix: \n{}'.format(confusion_matrix(y_test, y_pred)))
def MeanAbsErr(y_test, y_pred):
mean_err = metrics.mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error: {}'.format(round(mean_err), 3))
def MeanSqErr(y_test, y_pred):
SqErr = metrics.mean_squared_error(y_test, y_pred)
print('Mean Squared Error: {}'.format(round(SqErr), 3))
def DTCScore(X, y, dtc):
score = dtc.score(X, y, sample_weight=None)
print('Score: {}'.format(round(score)))
def MetricReport(X, y, y_test, y_pred, dtc):
print("Metric Summaries")
print("-"*16)
ConfusionMatx(y_test, y_pred)
MeanAbsErr(y_test, y_pred)
MeanSqErr(y_test, y_pred)
DTCScore(X, y, dtc)
print("-" * 16)
Visualize Tree Graph
def tree_viz(dtc, df, col_names):
class_n = "Class"
dot = tree.export_graphviz(dtc, out_file=None, feature_names=col_names, class_names=class_n, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot)
graph.format = 'png'
graph.render('Hep', view=True)
Run
path = 'C:\\Users\\ajh20\\Desktop\\hepatitis.csv'
col_names = ['Class', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise','Anorexia', 'Liver_Big', 'Liver_Firm',
'Spleen_Palp', 'Spiders', 'Ascites', 'Varices', 'Bilirubin', 'Alk_Phosph', 'SGOT', 'Albumin', 'Protime',
'Histology' ]
csv = pd.read_csv(path, na_values=["?"], names=col_names)
df = pd.DataFrame(csv)
minorEDA(df)
check_integrity(df)
df.dropna(inplace=True)
X, y = set_target(df, 'Class')
dtc = DecisionTree()
X_train, X_test, y_train, y_test = TestTrain(X, y)
model_test = FitData(dtc, X_train, y_train)
y_pred = Predict(dtc, X_test)
AccuracyCheck(model_test, X_test, y_pred)
tree_viz(dtc, df, col_names)
All-In-One
import pandas as pd
import graphviz
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def set_target(dataframe, target):
"""
:param dataframe: Full dataset
:param target: Name of classification column
:return x: Predictors dataset
:return y: Classification dataset
"""
x = dataframe.drop(target, axis=1)
y = dataframe[target]
return x, y
def TestTrain(X, y):
"""
:param X: Predictors
:param y: Classification
:return: X & Y test/train data
"""
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
return X_train, X_test, y_train, y_test
def DecisionTree():
# Build Decision Tree Classifier
dtc = DecisionTreeClassifier(max_depth=6, random_state=2)
return dtc
def FitData(DTC, X_train, y_train):
# Fit training data
return DTC.fit(X_train, y_train)
def Predict(dtc, test_x):
y_pred = dtc.predict(test_x)
return y_pred
def AccuracyCheck(model, X_test, y_pred):
#Cneck Accuracy Score
acc = accuracy_score(y_test, y_pred)
print('Default Accuracy: {}'.format(round(acc), 3))
def ConfusionMatx(y_test, y_pred):
print('Confusion Matrix: \n{}'.format(confusion_matrix(y_test, y_pred)))
def MeanAbsErr(y_test, y_pred):
mean_err = metrics.mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error: {}'.format(round(mean_err), 3))
def MeanSqErr(y_test, y_pred):
SqErr = metrics.mean_squared_error(y_test, y_pred)
print('Mean Squared Error: {}'.format(round(SqErr), 3))
def DTCScore(X, y, dtc):
score = dtc.score(X, y, sample_weight=None)
print('Score: {}'.format(round(score)))
def MetricReport(X, y, y_test, y_pred, dtc):
print("Metric Summaries")
print("-"*16)
ConfusionMatx(y_test, y_pred)
MeanAbsErr(y_test, y_pred)
MeanSqErr(y_test, y_pred)
DTCScore(X, y, dtc)
print("-" * 16)
def tree_viz(dtc, df, col_names):
class_n = ['0','1']
dot = tree.export_graphviz(dtc, out_file=None, feature_names=col_names, class_names=class_n, filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot)
graph.format = 'png'
graph.render('Hep', view=True)
def minorEDA(df):
"""
Generates a preliminary EDA Analysis of our file
args: df - DataFrame of our excel file
returns: None
"""
lineBreak = '------------------'
#Check Shape
print(lineBreak*3)
print("Shape:")
print(df.shape)
print(lineBreak*3)
#Check Feature Names
print("Column Names")
print(df.columns)
print(lineBreak*3)
#Check types, missing, memory
print("Data Types, Missing Data, Memory")
print(df.info())
print(lineBreak*3)
def check_integrity(input_df):
""" Check if values missing, generate list of cols with NaN indices
Args:
input_df - Dataframe
Returns:
List containing column names containing missing data
"""
if df.isnull().values.any():
print("\nDetected Missing Data\nAffected Columns:")
affected_cols = [col for col in input_df.columns if input_df[col].isnull().any()]
affected_rows = df.isnull().sum()
missing_list = []
for each_col in affected_cols:
missing_list.append(each_col)
print(missing_list)
print("\nCounts")
print(affected_rows)
print("\n")
return missing_list
else:
pass
print("\nNo Missing Data Was Detected.")
path = 'C:\\Users\\ajh20\\Desktop\\hepatitis.csv'
col_names = ['Class', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise','Anorexia', 'Liver_Big', 'Liver_Firm',
'Spleen_Palp', 'Spiders', 'Ascites', 'Varices', 'Bilirubin', 'Alk_Phosph', 'SGOT', 'Albumin', 'Protime',
'Histology' ]
csv = pd.read_csv(path, na_values=["?"], names=col_names)
df = pd.DataFrame(csv)
minorEDA(df)
check_integrity(df)
df.dropna(inplace=True)
X, y = set_target(df, 'Class')
dtc = DecisionTree()
X_train, X_test, y_train, y_test = TestTrain(X, y)
model_test = FitData(dtc, X_train, y_train)
y_pred = Predict(dtc, X_test)
AccuracyCheck(model_test, X_test, y_pred)
tree_viz(dtc, df, col_names)