Dummy Classifier Defined: A Visible Information with Code Examples for Newcomers | by Samy Baladram | Aug, 2024

Setting the bar in machine studying with easy baseline fashions

All illustrations on this article had been created by creator, incorporating licensed design parts from Canva Professional.

Have you ever ever questioned how information scientists measure the efficiency of their machine studying fashions? Enter the dummy classifier — a easy but highly effective instrument on the earth of information science. Consider it because the baseline participant in a recreation, setting the minimal normal that different, extra subtle fashions have to beat.

A dummy classifier is an easy machine studying mannequin that makes predictions utilizing primary guidelines, with out truly studying from the enter information. It serves as a baseline for evaluating the efficiency of extra complicated fashions. The dummy classifier helps us perceive if our subtle fashions are literally studying helpful patterns or simply guessing.

Dummy Classifier is likely one of the primary key algorithms in machine studying.

All through this text, we’ll use this straightforward synthetic golf dataset (impressed by [1]) for instance. This dataset predicts whether or not an individual will play golf based mostly on climate situations. It consists of options like outlook, temperature, humidity, and wind, with the goal variable being whether or not to play golf or not.

# Import libraries
from sklearn.model_selection import train_test_split
import pandas as pd

# Make a dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# One-hot Encode 'Outlook' Column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Windy' (bool) and 'Play' (binary) Columns to 0 and 1
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)

# Set function matrix X and goal vector y
X, y = df.drop(columns='Play'), df['Play']

# Break up the information into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=Fals

The dummy classifier operates on easy methods to make predictions. These methods don’t contain any precise studying from the information. As an alternative, they use primary guidelines like:

  1. At all times predicting probably the most frequent class
  2. Randomly predicting a category based mostly on the coaching set’s class distribution
  3. At all times predicting a particular class
For our golf dataset, a dummy classifier may all the time predict “Sure” for taking part in golf if that’s the commonest consequence within the coaching information.

The “coaching” course of for a dummy classifier is kind of easy and doesn’t contain the same old studying algorithms. Right here’s a basic define:

1. Choose Technique

Select one of many following methods:

  • Stratified: Makes random guesses based mostly on the unique class distribution.
  • Most Frequent: At all times picks the commonest class.
  • Uniform: Randomly picks any class.
Depends upon the technique, Dummy Classifier makes totally different prediction.
from sklearn.dummy import DummyClassifier

# Select a technique to your DummyClassifier (e.g., 'most_frequent', 'stratified', and so on.)
technique = 'most_frequent'

2. Gather Coaching Labels

Gather the category labels from the coaching dataset to find out the technique parameters.

The algorithm is solely getting data of the “Most Frequent” class within the coaching dataset — on this case “Sure”.
# Initialize the DummyClassifier
dummy_clf = DummyClassifier(technique=technique)

# "Practice" the DummyClassifier (though no actual coaching occurs)
dummy_clf.match(X_train, y_train)

3. Apply Technique to Check Information

Use the chosen technique to generate a listing of predicted labels to your check information.

If we select the “most frequent” technique and discover that “Sure” (play golf) seems extra typically in our coaching information, the dummy classifier will merely bear in mind to all the time predict “Sure”.
# Use the DummyClassifier to make predictions
y_pred = dummy_clf.predict(X_test)
print("Label :",checklist(y_test))
print("Prediction:",checklist(y_pred))

Consider the Mannequin

Dummy classifier provides 64% accuracy because the baseline for future fashions.
# Consider the DummyClassifier's accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Dummy Classifier Accuracy: {accuracy.spherical(4)*100}%")

Whereas dummy classifiers are easy, they do have a couple of necessary parameters:

  1. Technique: This determines how the classifier makes predictions. Widespread choices embody:
    – ‘most_frequent’: At all times predicts the commonest class within the coaching set.
    – ‘stratified’: Generates predictions based mostly on the coaching set’s class distribution.
    – ‘uniform’: Generates predictions uniformly at random.
    – ‘fixed’: At all times predicts a specified class.
  2. Random State: If utilizing a technique that includes randomness (like ‘stratified’ or ‘uniform’), this parameter ensures reproducibility of outcomes.
  3. Fixed: When utilizing the ‘fixed’ technique, this parameter specifies which class to all the time predict.
For our golf dataset, we’d select the ‘most_frequent’ technique, which doesn’t require further parameters.

Like several instrument in machine studying, dummy classifiers have their strengths and limitations.

Professionals:

  1. Simplicity: Straightforward to grasp and implement.
  2. Baseline Efficiency: Supplies a minimal efficiency benchmark for different fashions.
  3. Overfitting Verify: Helps determine when complicated fashions are overfitting by evaluating their efficiency to the dummy classifier.
  4. Fast to Practice and Predict: Requires minimal computational sources.

Cons:

  1. Restricted Predictive Energy: By design, it doesn’t study from the information, so its predictions are sometimes inaccurate.
  2. No Function Significance: It doesn’t present insights into which options are most necessary for predictions.
  3. Not Appropriate for Advanced Issues: In real-world situations with intricate patterns, dummy classifiers are too simplistic to be helpful on their very own.

Understanding dummy classifiers is essential for any information scientist or machine studying fanatic. They function a actuality verify, serving to us make sure that our extra complicated fashions are literally studying helpful patterns from the information. As you proceed your journey in machine studying, all the time bear in mind to match your fashions towards these easy baselines — you is likely to be stunned by what you study!

# Import needed libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

# Make dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Carry out one-hot encoding on 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' and 'Play' columns to binary indicators
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)

# Break up information into options (X) and goal (y), then into coaching and check units
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

# Initialize and prepare the dummy classifier mannequin
dummy_clf = DummyClassifier(technique='uniform')
dummy_clf.match(X_train, y_train)

# Make predictions on the check information
y_pred = dummy_clf.predict(X_test)

# Calculate and print the mannequin's accuracy on the check information
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")