Not like the baseline strategy of dummy classifiers or the similarity-based reasoning of KNN, Naive Bayes leverages chance idea. It combines the person chances of every “clue” (or characteristic) to make a closing prediction. This simple but highly effective methodology has confirmed invaluable in varied machine studying purposes.
Naive Bayes is a machine studying algorithm that makes use of chance to categorise knowledge. It’s primarily based on Bayes’ Theorem, a components for calculating conditional chances. The “naive” half refers to its key assumption: it treats all options as unbiased of one another, even when they may not be in actuality. This simplification, whereas typically unrealistic, drastically reduces computational complexity and works properly in lots of sensible situations.
There are three major forms of Naive Bayes classifiers. The important thing distinction between these varieties lies within the assumption they make in regards to the distribution of options:
- Bernoulli Naive Bayes: Fitted to binary/boolean options. It assumes every characteristic is a binary-valued (0/1) variable.
- Multinomial Naive Bayes: Sometimes used for discrete counts. It’s typically utilized in textual content classification, the place options may be phrase counts.
- Gaussian Naive Bayes: Assumes that steady options comply with a standard distribution.
It’s a good begin to deal with the best one which is Bernoulli NB. The “Bernoulli” in its title comes from the idea that every characteristic is binary-valued.
All through this text, we’ll use this synthetic golf dataset (impressed by [1]) for example. This dataset predicts whether or not an individual will play golf primarily based on climate situations.
# IMPORTING DATASET #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as npdataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# ONE-HOT ENCODE 'Outlook' COLUMN
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
# CONVERT 'Windy' (bool) and 'Play' (binary) COLUMNS TO BINARY INDICATORS
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)
# Set characteristic matrix X and goal vector y
X, y = df.drop(columns='Play'), df['Play']
# Cut up the info into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
print(pd.concat([X_train, y_train], axis=1), finish='nn')
print(pd.concat([X_test, y_test], axis=1))
We’ll adapt it barely for Bernoulli Naive Bayes by changing our options to binary.
# One-hot encode the categorized columns and drop them after, however do it individually for coaching and take a look at units
# Outline classes for 'Temperature' and 'Humidity' for coaching set
X_train['Temperature'] = pd.minimize(X_train['Temperature'], bins=[0, 80, 100], labels=['Warm', 'Hot'])
X_train['Humidity'] = pd.minimize(X_train['Humidity'], bins=[0, 75, 100], labels=['Dry', 'Humid'])# Equally, outline for the take a look at set
X_test['Temperature'] = pd.minimize(X_test['Temperature'], bins=[0, 80, 100], labels=['Warm', 'Hot'])
X_test['Humidity'] = pd.minimize(X_test['Humidity'], bins=[0, 75, 100], labels=['Dry', 'Humid'])
# One-hot encode the categorized columns
one_hot_columns_train = pd.get_dummies(X_train[['Temperature', 'Humidity']], drop_first=True, dtype=int)
one_hot_columns_test = pd.get_dummies(X_test[['Temperature', 'Humidity']], drop_first=True, dtype=int)
# Drop the categorized columns from coaching and take a look at units
X_train = X_train.drop(['Temperature', 'Humidity'], axis=1)
X_test = X_test.drop(['Temperature', 'Humidity'], axis=1)
# Concatenate the one-hot encoded columns with the unique DataFrames
X_train = pd.concat([one_hot_columns_train, X_train], axis=1)
X_test = pd.concat([one_hot_columns_test, X_test], axis=1)
print(pd.concat([X_train, y_train], axis=1), 'n')
print(pd.concat([X_test, y_test], axis=1))
Bernoulli Naive Bayes operates on knowledge the place every characteristic is both 0 or 1.
- Calculate the chance of every class within the coaching knowledge.
- For every characteristic and sophistication, calculate the chance of the characteristic being 1 and 0 given the category.
- For a brand new occasion: For every class, multiply its chance by the chance of every characteristic worth (0 or 1) for that class.
- Predict the category with the best ensuing chance.
The coaching course of for Bernoulli Naive Bayes includes calculating chances from the coaching knowledge:
- Class Likelihood Calculation: For every class, calculate its chance: (Variety of situations on this class) / (Complete variety of situations)
from fractions import Fractiondef calc_target_prob(attr):
total_counts = attr.value_counts().sum()
prob_series = attr.value_counts().apply(lambda x: Fraction(x, total_counts).limit_denominator())
return prob_series
print(calc_target_prob(y_train))
2.Function Likelihood Calculation: For every characteristic and every class, calculate:
- (Variety of situations the place characteristic is 0 on this class) / (Variety of situations on this class)
- (Variety of situations the place characteristic is 1 on this class) / (Variety of situations on this class)
from fractions import Fractiondef sort_attr_label(attr, lbl):
return (pd.concat([attr, lbl], axis=1)
.sort_values([attr.name, lbl.name])
.reset_index()
.rename(columns={'index': 'ID'})
.set_index('ID'))
def calc_feature_prob(attr, lbl):
total_classes = lbl.value_counts()
counts = pd.crosstab(attr, lbl)
prob_df = counts.apply(lambda x: [Fraction(c, total_classes[x.name]).limit_denominator() for c in x])
return prob_df
print(sort_attr_label(y_train, X_train['sunny']))
print(calc_feature_prob(X_train['sunny'], y_train))
for col in X_train.columns:
print(calc_feature_prob(X_train[col], y_train), "n")
3. Smoothing (Elective): Add a small worth (normally 1) to the numerator and denominator of every chance calculation to keep away from zero chances
# In sklearn, all processes above is summarized on this 'match' methodology:
from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB(alpha=1)
nb_clf.match(X_train, y_train)
4. Retailer Outcomes: Save all calculated chances to be used throughout classification.
Given a brand new occasion with options which can be both 0 or 1:
- Likelihood Assortment: For every attainable class:
- Begin with the chance of this class occurring (class chance).
- For every characteristic within the new occasion, acquire the chance of this characteristic being 0/1 for this class.
2. Rating Calculation & Prediction: For every class:
- Multiply all of the collected chances collectively
- The result’s the rating for this class
- The category with the best rating is the prediction
y_pred = nb_clf.predict(X_test)
print(y_pred)
# Consider the classifier
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Bernoulli Naive Bayes has a couple of vital parameters:
- Alpha (α): That is the smoothing parameter. It provides a small rely to every characteristic to forestall zero chances. Default is normally 1.0 (Laplace smoothing) as what was proven earlier than.
- Binarize: In case your options aren’t already binary, this threshold converts them. Any worth above this threshold turns into 1, and any worth beneath turns into 0.
3. Match Prior: Whether or not to be taught class prior chances or assume uniform priors (50/50).
Like several algorithm in machine studying, Bernoulli Naive Bayes has its strengths and limitations.
- Simplicity: Simple to implement and perceive.
- Effectivity: Quick to coach and predict, works properly with massive characteristic areas.
- Efficiency with Small Datasets: Can carry out properly even with restricted coaching knowledge.
- Handles Excessive-Dimensional Information: Works properly with many options, particularly in textual content classification.
- Independence Assumption: Assumes all options are unbiased, which is commonly not true in real-world knowledge.
- Restricted to Binary Options: In its pure kind, solely works with binary knowledge.
- Sensitivity to Enter Information: Will be delicate to how the options are binarized.
- Zero Frequency Downside: With out smoothing, zero chances can strongly have an effect on predictions.
The Bernoulli Naive Bayes classifier is a straightforward but highly effective machine studying algorithm for binary classification. It excels in textual content evaluation and spam detection, the place options are sometimes binary. Identified for its pace and effectivity, this probabilistic mannequin performs properly with small datasets and high-dimensional areas.
Regardless of its naive assumption of characteristic independence, it typically rivals extra complicated fashions in accuracy. Bernoulli Naive Bayes serves as a superb baseline and real-time classification device.
# Import wanted libraries
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# Put together knowledge for mannequin
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)
# Cut up knowledge into coaching and testing units
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical options (for automated binarization)
scaler = StandardScaler()
float_cols = X_train.select_dtypes(embody=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.remodel(X_test[float_cols])
# Prepare the mannequin
nb_clf = BernoulliNB()
nb_clf.match(X_train, y_train)
# Make predictions
y_pred = nb_clf.predict(X_test)
# Examine accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")