Noli Angeles
The goal for this project is to create different machine learning models on a dataset and see which one performs better on out-of-sample data. We will be using the Mushroom dataset (includes mushrooms from only the Agaricus and Lepiota Family) from the UC Irvine Machine Learning Repository and can be found here: https://archive.ics.uci.edu/dataset/73/mushroom. It has 8124 observations and 23 total columns (22 features), with the target variable being poisonous
. All of the columns in the dataset are either of categorical or binary data type. The machine learning models we will build are: Naive Bayes, Logistic Regression with Ridge Regularization, and Classification Tree. To compare all the models we will use accuracy scores, false negative rates, confusion matrices, and cross-validation scores.
To make it harder for the models to predict correctly, I want to limit which features are included in the dataset. In this case I decided to keep predictor variables that included ONLY colors and can be seen immediately. There are only 6 features in the dataset that describe the color of certain parts of the mushroom: cap-color, gill-color, stalk-color-above-ring, stalk-color-below-ring, veil-color, and spore-print-color
. Since spore-print-color
isn't something that we can see at first glance, we will also drop this feature, leaving only 5 color features left to predict the target. Why only colors? Maybe in the future we can create a camera that can evaluate and predict if mushrooms are poisonous in the Agaricus and Lepiota Family by only looking at the the different colors on the surface of the mushroom.
In order to use multiple models on a single data set, I would have to do the following steps.
1) Import Libraries and Load Data
2) Data Cleaning and Preprocessing
3) Exploratory Data Analysis
4) Split Dataset into Train/Test Sets
5) Create Models
6) Compare Models
First, we need import all necessary libraries for processing, creating models, and to compare the model performance.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split #splitting dataset
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder #encoding variables
from sklearn.naive_bayes import CategoricalNB #naive bayes model
from sklearn.linear_model import LogisticRegression #logistic regression model
from sklearn.tree import DecisionTreeClassifier #classification tree model
from sklearn.neighbors import KNeighborsClassifier #knn model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay #metrics to compare models
from sklearn.model_selection import cross_val_score #cross-validation
import matplotlib.pyplot as plt #plotting graphs
import seaborn as sns #creating plots/graphs
### unzip file
import zipfile
with zipfile.ZipFile('mushroom.zip', 'r') as zip_ref:
zip_ref.extractall() #unzip into same directory
### write the column names for each column
column_names = ['poisonous','cap-shape','cap-surface','cap-color','bruises','odor','gill-attatchment',
'gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring','stalk-color-below-ring','veil-type',
'veil-color','ring-number','ring-type','spore-print-color','population','habitat']
### load data file with specified column names
df = pd.read_csv("agaricus-lepiota.data", sep=",", header=None, names=column_names)
print(df.head(5)) # print to check if everything looks correct
First we clean the data by checking for any missing or duplicate rows. If there are any I will delete/omit them. Spoiler... There aren't any!
### check for missing data
print(df.isnull().sum())
print()
### check for duplicated rows
dupes = df.duplicated().sum()
print(f'There are {dupes} duplicated rows in this dataset.')
### check the raw values so we can encode later
for col in df.columns:
print(f"{col}: {df[col].unique()}")
### dropping anything that doesn't include colors
df.drop('bruises', axis=1, inplace=True)
df.drop('habitat', axis=1, inplace=True)
df.drop('ring-type', axis=1, inplace=True)
df.drop('odor', axis=1, inplace=True)
df.drop('gill-size', axis=1, inplace=True)
df.drop('population', axis=1, inplace=True)
df.drop('stalk-surface-above-ring', axis=1, inplace=True)
df.drop('stalk-surface-below-ring', axis=1, inplace=True)
df.drop('stalk-root', axis=1, inplace=True)
df.drop('gill-spacing', axis=1, inplace=True)
df.drop('cap-shape', axis=1, inplace=True)
df.drop('ring-number', axis=1, inplace=True)
df.drop('cap-surface', axis=1, inplace=True)
df.drop('gill-attatchment', axis=1, inplace=True)
df.drop('veil-type', axis=1, inplace=True)
df.drop('stalk-shape', axis=1, inplace=True)
df.drop('spore-print-color', axis=1, inplace=True)
### dropping all color features
# df.drop('stalk-color-below-ring', axis=1, inplace=True)
# df.drop('stalk-color-above-ring', axis=1, inplace=True)
# df.drop('gill-color', axis=1, inplace=True)
# df.drop('cap-color', axis=1, inplace=True)
# df.drop('veil-color', axis=1, inplace=True)
for col in df.columns:
print(f"{col}: {df[col].unique()}")
Before we go and create our models, I want to get a better understanding of the dataset. We can explore the data a little bit by creating graphs to see if we can uncover any stories or insights.
To make it easier to make multiple countplots I created a function as well as a manual color map to easily change the labels (instead of changing the names in the original dataset).
### define a function to easily create count plots for desired variables
def plot_countplot(data, feature, target_count, label_mapping=None):
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x=feature, hue=target_count)
plt.title(f"Distribution of Poisonous vs. Edible by {feature.capitalize()}")
plt.xlabel(f"{feature.capitalize()}")
plt.ylabel("Count")
plt.legend(title="Poisonous")
if label_mapping:
plt.xticks(ticks=range(len(df[feature].unique())),
labels=[label_mapping[label] for label in df[feature].unique()])
plt.show()
# manual mapping for color features
cap_color_mapping = {
'n': 'Brown', 'b': 'Buff', 'c': 'Cinnamon', 'g': 'Gray', 'r': 'Green',
'p': 'Pink', 'u': 'Purple', 'e': 'Red', 'w': 'White', 'y': 'Yellow'
}
gill_color_mapping = {
'k': 'Black', 'n': 'Brown', 'b': 'Buff', 'h': 'Chocolate',
'g': 'Gray', 'r': 'Green', 'o': 'Orange', 'p': 'Pink',
'u': 'Purple','e': 'Red', 'w': 'White', 'y': 'Yellow'
}
stalk_color_above_ring_mapping = {
'n': 'Brown', 'b': 'Buff', 'c': 'Cinnamon',
'g': 'Gray', 'o': 'Orange', 'p': 'Pink',
'e': 'Red', 'w': 'White', 'y': 'Yellow'
}
stalk_color_below_ring_mapping = {
'n': 'Brown', 'b': 'Buff', 'c': 'Cinnamon',
'g': 'Gray', 'o': 'Orange', 'p': 'Pink',
'e': 'Red', 'w': 'White', 'y': 'Yellow'
}
veil_color_mapping = {
'n': 'Brown', 'o': 'Orange', 'w': 'White', 'y': 'Yellow'
}
spore_print_color_mapping = {
'k': 'Black', 'n': 'Brown', 'b': 'Buff',
'h': 'Chocolate', 'r': 'Green', 'o': 'Orange',
'u': 'Purple', 'w': 'White', 'y': 'Yellow'
}
### countplot by cap-color
plot_countplot(df,'cap-color','poisonous', cap_color_mapping)
### countplot by gill-color
plot_countplot(df,'gill-color','poisonous', gill_color_mapping)
### countplot by stalk-color-above-ring
plot_countplot(df,'stalk-color-above-ring','poisonous', stalk_color_above_ring_mapping)
### count plot by stalk-color-below-ring
plot_countplot(df,'stalk-color-below-ring','poisonous', stalk_color_below_ring_mapping)
### countplot by veil-color
plot_countplot(df,'veil-color','poisonous', veil_color_mapping)
In order to train our models and test their performance, we have to split the dataset accordingly. Since the Naive Bayes Classifier model needs to use OrdinalEncoder
for the predictor variables and the other models only need to use OneHotEncoding
, we will first split the dataset with an 80/20 split. Afterwards, we will use two separate Train/Test split (with the same data) so we can properly create the models for each of the methods.
X = df.drop(['poisonous'], axis=1) #drop poisonous
y = df['poisonous'] #poisonous as the target
# continue with the same steps for train/test split and encoding
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=456)
# check the dimension of the train and test sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Let's double check to see if X (train and test) only contains the predictor variables and y (train and test) only has the poisonous
column. Then we can also check to see if the proportions of poisonous to edible mushrooms are the same in the train/test split and also in the original data set.
X_train.columns
# check X training data
X_train.head(5)
# check y training data
y_train.head(5)
# check x test data
X_test.head(5)
# check y test data
y_test.head(5)
# proportion of edible:poisonous labels in training set
y_train.value_counts(normalize=True)
# proportion of edible:poisonous labels in test set
y_test.value_counts(normalize=True)
# proportion of original data set before train/test split
df['poisonous'].value_counts(normalize=True)
Since the proportions for the training set and test set are very similar to the original dataset, we can now move on to creating our models!
Before we create our models, we have to make sure that our target variable poisonous
is encoded using LabelEncoder
.
le = LabelEncoder() # rename for simpler coding
y_train_encoded = le.fit_transform(y_train) # encode target variable for train set
y_test_encoded = le.transform(y_test) # encode target variable for test set
# check to see if target variables for both train/test set are encoded
print(y_train_encoded[:5])
print(y_test_encoded[:5])
Now that we've encoded the target variable, we now have to use OrdinalEncoder
to encode the the predictor variables
enc = OrdinalEncoder() # rename for simpler coding
X_train_nb = enc.fit_transform(X_train) # encoding predictor variables in train set for Naive Bayes model
X_test_nb = enc.transform(X_test) # encoding predictor variables in test set for Naive Bayes model
We can now train our Naive Bayes model by fitting it to the training data. We'll check out the cross-validation scores on the training data, then evaluate the model's performance on out-of-sample data measuring accuracy and a confusion matrix.
nb_model = CategoricalNB() # create model object
nb_model.fit(X_train_nb, y_train_encoded) # fit model on training data
nb_cv_score = cross_val_score(nb_model, X_train_nb, y_train_encoded, cv=10, scoring='accuracy')
print("Cross-Validation Accuracy Scores:", nb_cv_score)
print("Mean Cross-Validation Accuracy:", nb_cv_score.mean())
# generate predictions
y_pred_nb = nb_model.predict(X_test_nb)
y_pred_nb[:9]
nb_score = nb_model.score(X_test_nb, y_test_encoded) # get accuracy score
print(f'Naive Bayes Model Accuracy on out-of-sample data: {nb_score}')
nb_cm_test = confusion_matrix(y_pred_nb, y_test_encoded)
disp = ConfusionMatrixDisplay(confusion_matrix=nb_cm_test)
disp.plot(cmap="Blues")
plt.title("Naive Bayes Performance")
plt.show()
nb_fnr = 86 / (86+653)
nb_fnr
Since we already split the dataset and also encoded our target variable, we only need to encode the predictor variables using OneHotEncoder
for the rest of our models.
# one hot encoding for logistic regression and classification tree
oh = OneHotEncoder(sparse=False)
X_train_oh = oh.fit_transform(X_train)
X_test_oh = oh.transform(X_test)
lr_model = LogisticRegression(penalty='l2') # Ridge is L2 regularization
lr_model.fit(X_train_oh, y_train_encoded) # fit the logistic regression model on training data
# generate predictions
y_pred_lr = lr_model.predict(X_test_oh)
y_pred_lr[:9]
lr_cv_score = cross_val_score(lr_model, X_train_oh, y_train_encoded, cv=10, scoring='accuracy')
print(f"Cross-Validation Accuracy Scores: {lr_cv_score}")
print(f"Mean Cross-Validation Accuracy: {lr_cv_score.mean()}")
lr_score = lr_model.score(X_test_oh, y_test_encoded)
lr_score
# evaluate the training accuracy
y_train_pred = lr_model.predict(X_train_oh)
train_accuracy = accuracy_score(y_train_encoded, y_train_pred)
# evaluate the test accuracy
y_test_pred = lr_model.predict(X_test_oh)
test_accuracy = accuracy_score(y_test_encoded, y_test_pred)
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")
lr_cm_test = confusion_matrix(y_pred_lr, y_test_encoded)
disp = ConfusionMatrixDisplay(confusion_matrix=lr_cm_test)
disp.plot(cmap="Blues")
plt.title("Ridge Regression Performance")
plt.show()
lr_fnr = 6 / (6+665)
lr_fnr
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train_oh, y_train_encoded)
# generate predictions
y_pred_tree = tree_model.predict(X_test_oh)
y_pred_tree[:9]
tree_cv_score = cross_val_score(tree_model, X_train_oh, y_train_encoded, cv=10, scoring='accuracy')
print(f"Cross-Validation Accuracy Scores: {tree_cv_score}")
print(f"Mean Cross-Validation Accuracy: {tree_cv_score.mean()}")
tree_train_score = tree_model.score(X_train_oh, y_train_encoded)
tree_train_score
tree_score = tree_model.score(X_test_oh, y_test_encoded)
tree_score
tree_cm_test = confusion_matrix(y_pred_tree, y_test_encoded)
disp = ConfusionMatrixDisplay(confusion_matrix=tree_cm_test)
disp.plot(cmap="Blues")
plt.title("Classification Tree Performance")
plt.show()
tree_fnr = 10 / (10+674)
tree_fnr
Now that we created our models have the metrics and scores, let's see how they compare head-to-head! Here we'll compare the mean CV scores for each model with the accuracy of the prediction on the test data. If any of the CV scores are vastly different from the accuracy from the test predictions, it could mean the models are overfitting. We can also check out the accuracy of prediction and false negative rates.
results = {
'Model': ['Naive Bayes', 'Ridge Regression', 'Classification Tree'],
'CV Score': [nb_cv_score.mean(), lr_cv_score.mean(), tree_cv_score.mean()],
'Test Accuracy': [nb_score, lr_score, tree_score],
'False Negative Rate': [nb_fnr, lr_fnr, tree_fnr]
}
contingency_table = pd.DataFrame(results)
print(contingency_table)
sc# Create subplots for each confusion matrix
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
disp_nb = ConfusionMatrixDisplay(confusion_matrix=nb_cm_test)
disp_nb.plot(cmap='Blues', ax=axes[0])
axes[0].set_title('Confusion Matrix - Naive Bayes')
disp_lr = ConfusionMatrixDisplay(confusion_matrix=lr_cm_test)
disp_lr.plot(cmap='Blues', ax=axes[1])
axes[1].set_title('Confusion Matrix - Logistic Regression')
disp_tree = ConfusionMatrixDisplay(confusion_matrix=tree_cm_test)
disp_tree.plot(cmap='Blues', ax=axes[2])
axes[2].set_title('Confusion Matrix - Classification Tree')
# Display the plot
plt.tight_layout()
plt.show()