Logistic Regression, Defined: A Visible Information with Code Examples for Novices | by Samy Baladram | Sep, 2024

CLASSIFICATION ALGORITHM

Discovering the proper weights to suit the information in

Whereas some probabilistic-based machine studying fashions (like Naive Bayes) make daring assumptions about function independence, logistic regression takes a extra measured method. Consider it as drawing a line (or airplane) that separates two outcomes, permitting us to foretell possibilities with a bit extra flexibility.

All visuals: Writer-created utilizing Canva Professional. Optimized for cell; might seem outsized on desktop.

Logistic regression is a statistical technique used for predicting binary outcomes. Regardless of its identify, it’s used for classification fairly than regression. It estimates the likelihood that an occasion belongs to a specific class. If the estimated likelihood is bigger than 50%, the mannequin predicts that the occasion belongs to that class (or vice versa).

All through this text, we’ll use this synthetic golf dataset (impressed by [1]) for instance. This dataset predicts whether or not an individual will play golf based mostly on climate circumstances.

Similar to in KNN, logistic regression requires the information to be scaled first. Convert categorical columns into 0 & 1 and likewise scale the numerical options in order that no single function dominates the space metric.

Columns: ‘Outlook’, ‘Temperature’, ‘Humidity’, ‘Wind’ and ‘Play’ (goal function). The explicit columns (Outlook & Windy) are encoded utilizing one-hot encoding whereas the numerical columns are scaled utilizing customary scaling (z-normalization).
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Create dataset from dictionary
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Put together knowledge: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Break up knowledge into options and goal
X, y = df.drop(columns='Play'), df['Play']

# Break up knowledge into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical options
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.remodel(X_test[['Temperature', 'Humidity']])

# Print outcomes
print("Coaching set:")
print(pd.concat([X_train, y_train], axis=1), 'n')
print("Check set:")
print(pd.concat([X_test, y_test], axis=1))

Logistic regression works by making use of the logistic perform to a linear mixture of the enter options. Right here’s the way it operates:

  1. Calculate a weighted sum of the enter options (just like linear regression).
  2. Apply the logistic perform (additionally known as sigmoid perform) to this sum, which maps any actual quantity to a worth between 0 and 1.
  3. Interpret this worth because the likelihood of belonging to the constructive class.
  4. Use a threshold (sometimes 0.5) to make the ultimate classification determination.
For our golf dataset, logistic regression would possibly mix the climate components right into a single rating, then remodel this rating right into a likelihood of taking part in golf.

The coaching course of for logistic regression entails discovering the very best weights for the enter options. Right here’s the overall define:

  1. Initialize the weights (typically to small random values).
# Initialize weights (together with bias) to 0.1
initial_weights = np.full(X_train_np.form[1], 0.1)

# Create and show DataFrame for preliminary weights
print(f"Preliminary Weights: {initial_weights}")

2. For every coaching instance:
a. Calculate the anticipated likelihood utilizing the present weights.

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def calculate_probabilities(X, weights):
z = np.dot(X, weights)
return sigmoid(z)

def calculate_log_loss(possibilities, y):
return -y * np.log(possibilities) - (1 - y) * np.log(1 - possibilities)

def create_output_dataframe(X, y, weights):
possibilities = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(possibilities, y)

df = pd.DataFrame({
'Likelihood': possibilities,
'Label': y,
'Log Loss': log_losses
})

return df

def calculate_average_log_loss(X, y, weights):
possibilities = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(possibilities, y)
return np.imply(log_losses)

# Convert X_train and y_train to numpy arrays for simpler computation
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()

# Add a column of 1s to X_train_np for the bias time period
X_train_np = np.column_stack((np.ones(X_train_np.form[0]), X_train_np))

# Create and show DataFrame for preliminary weights
initial_df = create_output_dataframe(X_train_np, y_train_np, initial_weights)
print(initial_df.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print(f"nAverage Log Loss: {calculate_average_log_loss(X_train_np, y_train_np, initial_weights):.6f}")

b. Evaluate this likelihood to the precise class label by calculating its log loss.

3. Replace the weights to attenuate the loss (normally utilizing some optimization algorithm, like gradient descent. This embrace repeatedly do Step 2 till log loss can’t get smaller).

def gradient_descent_step(X, y, weights, learning_rate):
m = len(y)
possibilities = calculate_probabilities(X, weights)
gradient = np.dot(X.T, (possibilities - y)) / m
new_weights = weights - learning_rate * gradient # Create new array for up to date weights
return new_weights

# Carry out one step of gradient descent (one of many easiest optimization algorithm)
learning_rate = 0.1
updated_weights = gradient_descent_step(X_train_np, y_train_np, initial_weights, learning_rate)

# Print preliminary and up to date weights
print("nInitial weights:")
for function, weight in zip(['Bias'] + listing(X_train.columns), initial_weights):
print(f"{function:11}: {weight:.2f}")

print("nUpdated weights after one iteration:")
for function, weight in zip(['Bias'] + listing(X_train.columns), updated_weights):
print(f"{function:11}: {weight:.2f}")

# With sklearn, you will get the ultimate weights (coefficients)
# and ultimate bias (intercepts) simply.
# The result's nearly the identical as doing it manually above.

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(penalty=None, solver='saga')
lr_clf.match(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

y_train_prob = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(y_train_prob) + (1 - y_train) * np.log(1 - y_train_prob))

print(f"Weights & Bias Ultimate: {coefficients[0].spherical(2)}, {spherical(intercept[0],2)}")
print("Loss Ultimate:", loss.spherical(3))

As soon as the mannequin is skilled:
1. For a brand new occasion, calculate the likelihood with the ultimate weights (additionally known as coefficients), identical to through the coaching step.

2. Interpret the output by seeing the likelihood: if p ≥ 0.5, predict class 1; in any other case, predict class 0

# Calculate prediction likelihood
predicted_probs = lr_clf.predict_proba(X_test)[:, 1]

z_values = np.log(predicted_probs / (1 - predicted_probs))

result_df = pd.DataFrame({
'ID': X_test.index,
'Z-Values': z_values.spherical(3),
'Possibilities': predicted_probs.spherical(3)
}).set_index('ID')

print(result_df)

# Make predictions
y_pred = lr_clf.predict(X_test)
print(y_pred)

Analysis Step

result_df = pd.DataFrame({
'ID': X_test.index,
'Label': y_test,
'Possibilities': predicted_probs.spherical(2),
'Prediction': y_pred,
}).set_index('ID')

print(result_df)

Logistic regression has a number of essential parameters that management its habits:

1.Penalty: The kind of regularization to make use of (‘l1’, ‘l2’, ‘elasticnet’, or ‘none’). Regularization in logistic regression prevents overfitting by including a penalty time period to the mannequin’s loss perform, that encourages easier fashions.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

regs = [None, 'l1', 'l2']
coeff_dict = {}

for reg in regs:
lr_clf = LogisticRegression(penalty=reg, solver='saga')
lr_clf.match(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[reg] = {
'Coefficients': coefficients,
'Intercept': intercept,
'Loss': loss,
'Accuracy': accuracy
}

for reg, vals in coeff_dict.objects():
print(f"{reg}: Coeff: {vals['Coefficients'][0].spherical(2)}, Intercept: {vals['Intercept'].spherical(2)}, Loss: {vals['Loss'].spherical(3)}, Accuracy: {vals['Accuracy'].spherical(3)}")

2. Regularization Energy (C): Controls the trade-off between becoming the coaching knowledge and retaining the mannequin easy. A smaller C means stronger regularization.

# Record of regularization strengths to strive for L1
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for energy in strengths:
lr_clf = LogisticRegression(penalty='l1', C=energy, solver='saga')
lr_clf.match(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L1_{strength}'] = {
'Coefficients': coefficients[0].spherical(2),
'Intercept': spherical(intercept[0],2),
'Loss': spherical(loss,3),
'Accuracy': spherical(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

# Record of regularization strengths to strive for L2
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for energy in strengths:
lr_clf = LogisticRegression(penalty='l2', C=energy, solver='saga')
lr_clf.match(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L2_{strength}'] = {
'Coefficients': coefficients[0].spherical(2),
'Intercept': spherical(intercept[0],2),
'Loss': spherical(loss,3),
'Accuracy': spherical(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

3. Solver: The algorithm to make use of for optimization (‘liblinear’, ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’). Some regularization would possibly require a specific algorithm.

4. Max Iterations: The utmost variety of iterations for the solver to converge.

For our golf dataset, we’d begin with ‘l2’ penalty, ‘liblinear’ solver, and C=1.0 as a baseline.

Like all algorithm in machine studying, logistic regression has its strengths and limitations.

Execs:

  1. Simplicity: Straightforward to implement and perceive.
  2. Interpretability: The weights instantly present the significance of every function.
  3. Effectivity: Doesn’t require an excessive amount of computational energy.
  4. Probabilistic Output: Gives possibilities fairly than simply classifications.

Cons:

  1. Linearity Assumption: Assumes a linear relationship between options and log-odds of the result.
  2. Function Independence: Assumes options are usually not extremely correlated.
  3. Restricted Complexity: Could underfit in instances the place the choice boundary is extremely non-linear.
  4. Requires Extra Information: Wants a comparatively giant pattern dimension for steady outcomes.

In our golf instance, logistic regression would possibly present a transparent, interpretable mannequin of how every climate issue influences the choice to play golf. Nevertheless, it would wrestle if the choice entails complicated interactions between climate circumstances that may’t be captured by a linear mannequin.

Logistic regression shines as a strong but simple classification instrument. It stands out for its capability to deal with complicated knowledge whereas remaining simple to interpret. Not like another primary fashions, it supplies easy likelihood estimates and works effectively with many options. In the true world, from predicting buyer habits to medical diagnoses, logistic regression typically performs surprisingly effectively. It’s not only a stepping stone — it’s a dependable mannequin that may match extra complicated fashions in lots of conditions.

# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Put together knowledge: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)

# Break up knowledge into coaching and testing units
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical options
scaler = StandardScaler()
float_cols = X_train.select_dtypes(embrace=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.remodel(X_test[float_cols])

# Prepare the mannequin
lr_clf = LogisticRegression(penalty='l2', C=1, solver='saga')
lr_clf.match(X_train, y_train)

# Make predictions
y_pred = lr_clf.predict(X_test)

# Consider the mannequin
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")