Which model is better at predicting whether a mushroom is poisonous or edible (with a twist)?¶

Noli Angeles

Introduction¶

The goal for this project is to create different machine learning models on a dataset and see which one performs better on out-of-sample data. We will be using the Mushroom dataset (includes mushrooms from only the Agaricus and Lepiota Family) from the UC Irvine Machine Learning Repository and can be found here: https://archive.ics.uci.edu/dataset/73/mushroom. It has 8124 observations and 23 total columns (22 features), with the target variable being poisonous. All of the columns in the dataset are either of categorical or binary data type. The machine learning models we will build are: Naive Bayes, Logistic Regression with Ridge Regularization, and Classification Tree. To compare all the models we will use accuracy scores, false negative rates, confusion matrices, and cross-validation scores.

The "twist"¶

To make it harder for the models to predict correctly, I want to limit which features are included in the dataset. In this case I decided to keep predictor variables that included ONLY colors and can be seen immediately. There are only 6 features in the dataset that describe the color of certain parts of the mushroom: cap-color, gill-color, stalk-color-above-ring, stalk-color-below-ring, veil-color, and spore-print-color. Since spore-print-color isn't something that we can see at first glance, we will also drop this feature, leaving only 5 color features left to predict the target. Why only colors? Maybe in the future we can create a camera that can evaluate and predict if mushrooms are poisonous in the Agaricus and Lepiota Family by only looking at the the different colors on the surface of the mushroom.

Steps¶

In order to use multiple models on a single data set, I would have to do the following steps.

1) Import Libraries and Load Data

2) Data Cleaning and Preprocessing

3) Exploratory Data Analysis

4) Split Dataset into Train/Test Sets

5) Create Models

6) Compare Models

Import Libraries and Load Data¶

First, we need import all necessary libraries for processing, creating models, and to compare the model performance.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split #splitting dataset
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder #encoding variables
from sklearn.naive_bayes import CategoricalNB #naive bayes model
from sklearn.linear_model import LogisticRegression #logistic regression model
from sklearn.tree import DecisionTreeClassifier #classification tree model
from sklearn.neighbors import KNeighborsClassifier #knn model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay #metrics to compare models
from sklearn.model_selection import cross_val_score #cross-validation
import matplotlib.pyplot as plt #plotting graphs
import seaborn as sns #creating plots/graphs

### unzip file
import zipfile

with zipfile.ZipFile('mushroom.zip', 'r') as zip_ref:
    zip_ref.extractall() #unzip into same directory

### write the column names for each column
column_names = ['poisonous','cap-shape','cap-surface','cap-color','bruises','odor','gill-attatchment',
               'gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring',
                'stalk-surface-below-ring', 'stalk-color-above-ring','stalk-color-below-ring','veil-type',
               'veil-color','ring-number','ring-type','spore-print-color','population','habitat']

### load data file with specified column names
df = pd.read_csv("agaricus-lepiota.data", sep=",", header=None, names=column_names)
print(df.head(5)) # print to check if everything looks correct

  poisonous cap-shape cap-surface cap-color bruises odor gill-attatchment  \
0         p         x           s         n       t    p                f   
1         e         x           s         y       t    a                f   
2         e         b           s         w       t    l                f   
3         p         x           y         w       t    p                f   
4         e         x           s         g       f    n                f   

  gill-spacing gill-size gill-color  ... stalk-surface-below-ring  \
0            c         n          k  ...                        s   
1            c         b          k  ...                        s   
2            c         b          n  ...                        s   
3            c         n          n  ...                        s   
4            w         b          k  ...                        s   

  stalk-color-above-ring stalk-color-below-ring veil-type veil-color  \
0                      w                      w         p          w   
1                      w                      w         p          w   
2                      w                      w         p          w   
3                      w                      w         p          w   
4                      w                      w         p          w   

  ring-number ring-type spore-print-color population habitat  
0           o         p                 k          s       u  
1           o         p                 n          n       g  
2           o         p                 n          n       m  
3           o         p                 k          s       u  
4           o         e                 n          a       g  

[5 rows x 23 columns]

Data Cleaning¶

First we clean the data by checking for any missing or duplicate rows. If there are any I will delete/omit them. Spoiler... There aren't any!

### check for missing data
print(df.isnull().sum())
print()

### check for duplicated rows
dupes = df.duplicated().sum()
print(f'There are {dupes} duplicated rows in this dataset.')

poisonous                   0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attatchment            0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

There are 0 duplicated rows in this dataset.

### check the raw values so we can encode later
for col in df.columns:
    print(f"{col}: {df[col].unique()}")

poisonous: ['p' 'e']
cap-shape: ['x' 'b' 's' 'f' 'k' 'c']
cap-surface: ['s' 'y' 'f' 'g']
cap-color: ['n' 'y' 'w' 'g' 'e' 'p' 'b' 'u' 'c' 'r']
bruises: ['t' 'f']
odor: ['p' 'a' 'l' 'n' 'f' 'c' 'y' 's' 'm']
gill-attatchment: ['f' 'a']
gill-spacing: ['c' 'w']
gill-size: ['n' 'b']
gill-color: ['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o']
stalk-shape: ['e' 't']
stalk-root: ['e' 'c' 'b' 'r' '?']
stalk-surface-above-ring: ['s' 'f' 'k' 'y']
stalk-surface-below-ring: ['s' 'f' 'y' 'k']
stalk-color-above-ring: ['w' 'g' 'p' 'n' 'b' 'e' 'o' 'c' 'y']
stalk-color-below-ring: ['w' 'p' 'g' 'b' 'n' 'e' 'y' 'o' 'c']
veil-type: ['p']
veil-color: ['w' 'n' 'o' 'y']
ring-number: ['o' 't' 'n']
ring-type: ['p' 'e' 'l' 'f' 'n']
spore-print-color: ['k' 'n' 'u' 'h' 'w' 'r' 'o' 'y' 'b']
population: ['s' 'n' 'a' 'v' 'y' 'c']
habitat: ['u' 'g' 'm' 'd' 'p' 'w' 'l']

### dropping anything that doesn't include colors
df.drop('bruises', axis=1, inplace=True)
df.drop('habitat', axis=1, inplace=True)
df.drop('ring-type', axis=1, inplace=True)
df.drop('odor', axis=1, inplace=True)
df.drop('gill-size', axis=1, inplace=True)
df.drop('population', axis=1, inplace=True)
df.drop('stalk-surface-above-ring', axis=1, inplace=True)
df.drop('stalk-surface-below-ring', axis=1, inplace=True)
df.drop('stalk-root', axis=1, inplace=True)
df.drop('gill-spacing', axis=1, inplace=True)
df.drop('cap-shape', axis=1, inplace=True)
df.drop('ring-number', axis=1, inplace=True)
df.drop('cap-surface', axis=1, inplace=True)
df.drop('gill-attatchment', axis=1, inplace=True)
df.drop('veil-type', axis=1, inplace=True)
df.drop('stalk-shape', axis=1, inplace=True)
df.drop('spore-print-color', axis=1, inplace=True)

### dropping all color features
# df.drop('stalk-color-below-ring', axis=1, inplace=True)
# df.drop('stalk-color-above-ring', axis=1, inplace=True)
# df.drop('gill-color', axis=1, inplace=True)
# df.drop('cap-color', axis=1, inplace=True)
# df.drop('veil-color', axis=1, inplace=True)

for col in df.columns:
    print(f"{col}: {df[col].unique()}")

poisonous: ['p' 'e']
cap-color: ['n' 'y' 'w' 'g' 'e' 'p' 'b' 'u' 'c' 'r']
gill-color: ['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o']
stalk-color-above-ring: ['w' 'g' 'p' 'n' 'b' 'e' 'o' 'c' 'y']
stalk-color-below-ring: ['w' 'p' 'g' 'b' 'n' 'e' 'y' 'o' 'c']
veil-color: ['w' 'n' 'o' 'y']

Exploratory Data Analysis¶

Before we go and create our models, I want to get a better understanding of the dataset. We can explore the data a little bit by creating graphs to see if we can uncover any stories or insights.

Countplots¶

To make it easier to make multiple countplots I created a function as well as a manual color map to easily change the labels (instead of changing the names in the original dataset).

### define a function to easily create count plots for desired variables
def plot_countplot(data, feature, target_count, label_mapping=None):
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=feature, hue=target_count)
    plt.title(f"Distribution of Poisonous vs. Edible by {feature.capitalize()}")
    plt.xlabel(f"{feature.capitalize()}")
    plt.ylabel("Count")
    plt.legend(title="Poisonous")
    
    if label_mapping:
        plt.xticks(ticks=range(len(df[feature].unique())), 
                   labels=[label_mapping[label] for label in df[feature].unique()])
        
    plt.show()

# manual mapping for color features
cap_color_mapping = {
    'n': 'Brown', 'b': 'Buff',  'c': 'Cinnamon',  'g': 'Gray', 'r': 'Green', 
    'p': 'Pink',  'u': 'Purple',  'e': 'Red',  'w': 'White',  'y': 'Yellow'
}

gill_color_mapping = {
    'k': 'Black', 'n': 'Brown', 'b': 'Buff', 'h': 'Chocolate', 
    'g': 'Gray',  'r': 'Green', 'o': 'Orange', 'p': 'Pink', 
    'u': 'Purple','e': 'Red', 'w': 'White', 'y': 'Yellow'
}

stalk_color_above_ring_mapping = {
    'n': 'Brown', 'b': 'Buff', 'c': 'Cinnamon', 
    'g': 'Gray', 'o': 'Orange', 'p': 'Pink', 
    'e': 'Red', 'w': 'White', 'y': 'Yellow'
}

stalk_color_below_ring_mapping = {
    'n': 'Brown', 'b': 'Buff', 'c': 'Cinnamon', 
    'g': 'Gray', 'o': 'Orange', 'p': 'Pink', 
    'e': 'Red', 'w': 'White', 'y': 'Yellow'
}

veil_color_mapping = {
    'n': 'Brown', 'o': 'Orange', 'w': 'White', 'y': 'Yellow'
}

spore_print_color_mapping = {
    'k': 'Black', 'n': 'Brown', 'b': 'Buff', 
    'h': 'Chocolate', 'r': 'Green', 'o': 'Orange', 
    'u': 'Purple', 'w': 'White', 'y': 'Yellow'
}

### countplot by cap-color
plot_countplot(df,'cap-color','poisonous', cap_color_mapping)

### countplot by gill-color
plot_countplot(df,'gill-color','poisonous', gill_color_mapping)

### countplot by stalk-color-above-ring
plot_countplot(df,'stalk-color-above-ring','poisonous', stalk_color_above_ring_mapping)

### count plot by stalk-color-below-ring
plot_countplot(df,'stalk-color-below-ring','poisonous', stalk_color_below_ring_mapping)

### countplot by veil-color
plot_countplot(df,'veil-color','poisonous', veil_color_mapping)

Split Dataset into Train/Test Set¶

In order to train our models and test their performance, we have to split the dataset accordingly. Since the Naive Bayes Classifier model needs to use OrdinalEncoder for the predictor variables and the other models only need to use OneHotEncoding, we will first split the dataset with an 80/20 split. Afterwards, we will use two separate Train/Test split (with the same data) so we can properly create the models for each of the methods.

Splitting the data¶

X = df.drop(['poisonous'], axis=1)  #drop poisonous
y = df['poisonous']  #poisonous as the target

# continue with the same steps for train/test split and encoding
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=456)

# check the dimension of the train and test sets
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

(6499, 5)
(1625, 5)
(6499,)
(1625,)

Sanity Check¶

Let's double check to see if X (train and test) only contains the predictor variables and y (train and test) only has the poisonous column. Then we can also check to see if the proportions of poisonous to edible mushrooms are the same in the train/test split and also in the original data set.

Checking training data¶

X_train.columns

Index(['cap-color', 'gill-color', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-color'],
      dtype='object')

# check X training data
X_train.head(5)

# check y training data
y_train.head(5)

2312    e
6071    p
7957    e
8029    e
3179    p
Name: poisonous, dtype: object

Checking test data¶

# check x test data
X_test.head(5)

# check y test data
y_test.head(5)

970     e
4969    p
4758    p
3052    p
6009    p
Name: poisonous, dtype: object

Double Checking Proportions¶

# proportion of edible:poisonous labels in training set
y_train.value_counts(normalize=True)

e    0.517156
p    0.482844
Name: poisonous, dtype: float64

# proportion of edible:poisonous labels in test set
y_test.value_counts(normalize=True)

e    0.521231
p    0.478769
Name: poisonous, dtype: float64

# proportion of original data set before train/test split
df['poisonous'].value_counts(normalize=True)

e    0.517971
p    0.482029
Name: poisonous, dtype: float64

Since the proportions for the training set and test set are very similar to the original dataset, we can now move on to creating our models!

Creating our models¶

Before we create our models, we have to make sure that our target variable poisonous is encoded using LabelEncoder.

le = LabelEncoder() # rename for simpler coding

y_train_encoded = le.fit_transform(y_train) # encode target variable for train set

y_test_encoded = le.transform(y_test) # encode target variable for test set

# check to see if target variables for both train/test set are encoded
print(y_train_encoded[:5])
print(y_test_encoded[:5])

[0 1 0 0 1]
[0 1 1 1 1]

Naive Bayes Classifier Model¶

Now that we've encoded the target variable, we now have to use OrdinalEncoder to encode the the predictor variables

enc = OrdinalEncoder() # rename for simpler coding

X_train_nb = enc.fit_transform(X_train) # encoding predictor variables in train set for Naive Bayes model

X_test_nb = enc.transform(X_test) # encoding predictor variables in test set for Naive Bayes model

Train and Evaluate Naive Bayes Model¶

We can now train our Naive Bayes model by fitting it to the training data. We'll check out the cross-validation scores on the training data, then evaluate the model's performance on out-of-sample data measuring accuracy and a confusion matrix.

nb_model = CategoricalNB() # create model object
nb_model.fit(X_train_nb, y_train_encoded) # fit model on training data

CategoricalNB()

CategoricalNB()

Cross-Validation Score (Naive Bayes)¶

nb_cv_score = cross_val_score(nb_model, X_train_nb, y_train_encoded, cv=10, scoring='accuracy')
print("Cross-Validation Accuracy Scores:", nb_cv_score)
print("Mean Cross-Validation Accuracy:", nb_cv_score.mean())

Cross-Validation Accuracy Scores: [0.87076923 0.86923077 0.86153846 0.86461538 0.85230769 0.88
 0.85538462 0.87384615 0.83692308 0.85362096]
Mean Cross-Validation Accuracy: 0.8618236339931254

# generate predictions
y_pred_nb = nb_model.predict(X_test_nb)
y_pred_nb[:9]

array([0, 1, 1, 0, 1, 1, 1, 0, 1])

Accuracy Score on out-of-sample data (Naive Bayes)¶

nb_score = nb_model.score(X_test_nb, y_test_encoded) # get accuracy score
print(f'Naive Bayes Model Accuracy on out-of-sample data: {nb_score}')

Naive Bayes Model Accuracy on out-of-sample data: 0.8701538461538462

Confusion Matrix (Naive Bayes)¶

nb_cm_test = confusion_matrix(y_pred_nb, y_test_encoded) 
disp = ConfusionMatrixDisplay(confusion_matrix=nb_cm_test)
disp.plot(cmap="Blues")
plt.title("Naive Bayes Performance")
plt.show()

False Negative Rate (Naive Bayes)¶

nb_fnr = 86 / (86+653)
nb_fnr

0.11637347767253045

Logistic Regression with Ridge Regularization Model¶

Since we already split the dataset and also encoded our target variable, we only need to encode the predictor variables using OneHotEncoder for the rest of our models.

# one hot encoding for logistic regression and classification tree
oh = OneHotEncoder(sparse=False)
X_train_oh = oh.fit_transform(X_train)
X_test_oh = oh.transform(X_test)

/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_encoders.py:975: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

Train and Evaluate Model¶

lr_model = LogisticRegression(penalty='l2')  # Ridge is L2 regularization
lr_model.fit(X_train_oh, y_train_encoded) # fit the logistic regression model on training data

LogisticRegression()

LogisticRegression()

# generate predictions
y_pred_lr = lr_model.predict(X_test_oh)
y_pred_lr[:9]

array([0, 1, 1, 0, 1, 1, 1, 0, 1])

Cross-Validation Score (Ridge Regression)¶

lr_cv_score = cross_val_score(lr_model, X_train_oh, y_train_encoded, cv=10, scoring='accuracy')
print(f"Cross-Validation Accuracy Scores: {lr_cv_score}")
print(f"Mean Cross-Validation Accuracy: {lr_cv_score.mean()}")

/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Cross-Validation Accuracy Scores: [0.91538462 0.90461538 0.90923077 0.91538462 0.90153846 0.91384615
 0.90461538 0.92769231 0.90153846 0.9183359 ]
Mean Cross-Validation Accuracy: 0.9112182055232901

Accuracy Score on out-of-sample data (Ridge Regression)¶

lr_score = lr_model.score(X_test_oh, y_test_encoded)
lr_score

0.9267692307692308

# evaluate the training accuracy
y_train_pred = lr_model.predict(X_train_oh)
train_accuracy = accuracy_score(y_train_encoded, y_train_pred)

# evaluate the test accuracy
y_test_pred = lr_model.predict(X_test_oh)
test_accuracy = accuracy_score(y_test_encoded, y_test_pred)

print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.9130635482381905
Test Accuracy: 0.9267692307692308

Confusion Matrix (Ridge Regression)¶

lr_cm_test = confusion_matrix(y_pred_lr, y_test_encoded) 
disp = ConfusionMatrixDisplay(confusion_matrix=lr_cm_test)
disp.plot(cmap="Blues")
plt.title("Ridge Regression Performance")
plt.show()

False Negative Rate (Ridge Regression)¶

lr_fnr = 6 / (6+665)
lr_fnr

0.00894187779433681

Classification Tree Model¶

Train and Evaluate Model¶

tree_model = DecisionTreeClassifier()
tree_model.fit(X_train_oh, y_train_encoded)

DecisionTreeClassifier()

DecisionTreeClassifier()

# generate predictions
y_pred_tree = tree_model.predict(X_test_oh)
y_pred_tree[:9]

array([0, 1, 1, 0, 1, 1, 1, 0, 1])

Cross-Validation Score (Classification Tree)¶

tree_cv_score = cross_val_score(tree_model, X_train_oh, y_train_encoded, cv=10, scoring='accuracy')
print(f"Cross-Validation Accuracy Scores: {tree_cv_score}")
print(f"Mean Cross-Validation Accuracy: {tree_cv_score.mean()}")

Cross-Validation Accuracy Scores: [0.91692308 0.91692308 0.91846154 0.93538462 0.91384615 0.92615385
 0.91076923 0.93076923 0.91230769 0.92141757]
Mean Cross-Validation Accuracy: 0.9202956027023823

tree_train_score = tree_model.score(X_train_oh, y_train_encoded)
tree_train_score

0.9212186490229266

Accuracy Score on out-of-sample data (Classification Tree)¶

tree_score = tree_model.score(X_test_oh, y_test_encoded)
tree_score

0.9298461538461539

Confusion Matrix (Classification Tree)¶

tree_cm_test = confusion_matrix(y_pred_tree, y_test_encoded) 
disp = ConfusionMatrixDisplay(confusion_matrix=tree_cm_test)
disp.plot(cmap="Blues")
plt.title("Classification Tree Performance")
plt.show()

False Negative Rate (Classification Tree)¶

tree_fnr = 10 / (10+674)
tree_fnr

0.014619883040935672

Comparing the Three Models¶

Now that we created our models have the metrics and scores, let's see how they compare head-to-head! Here we'll compare the mean CV scores for each model with the accuracy of the prediction on the test data. If any of the CV scores are vastly different from the accuracy from the test predictions, it could mean the models are overfitting. We can also check out the accuracy of prediction and false negative rates.

results = {
    'Model': ['Naive Bayes', 'Ridge Regression', 'Classification Tree'],
    'CV Score': [nb_cv_score.mean(), lr_cv_score.mean(), tree_cv_score.mean()],
    'Test Accuracy': [nb_score, lr_score, tree_score],
    'False Negative Rate': [nb_fnr, lr_fnr, tree_fnr]
}

contingency_table = pd.DataFrame(results)
print(contingency_table)

                 Model  CV Score  Test Accuracy  False Negative Rate
0          Naive Bayes  0.861824       0.870154             0.116373
1     Ridge Regression  0.911218       0.926769             0.008942
2  Classification Tree  0.920296       0.929846             0.014620

sc# Create subplots for each confusion matrix
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

disp_nb = ConfusionMatrixDisplay(confusion_matrix=nb_cm_test)
disp_nb.plot(cmap='Blues', ax=axes[0])
axes[0].set_title('Confusion Matrix - Naive Bayes')

disp_lr = ConfusionMatrixDisplay(confusion_matrix=lr_cm_test)
disp_lr.plot(cmap='Blues', ax=axes[1])
axes[1].set_title('Confusion Matrix - Logistic Regression')

disp_tree = ConfusionMatrixDisplay(confusion_matrix=tree_cm_test)
disp_tree.plot(cmap='Blues', ax=axes[2])
axes[2].set_title('Confusion Matrix - Classification Tree')

# Display the plot
plt.tight_layout()
plt.show()