Explainable Generic ML Pipeline with MLflow | by Mena Wang, PhD | Nov, 2024

Many machine studying algorithms — reminiscent of linear fashions (e.g., linear regression, SVM), distance-based fashions (e.g., KNN, PCA), and gradient-based fashions (e.g., gradient boosting strategies or gradient descent optimization) — are likely to carry out higher with scaled enter options, as a result of scaling prevents options with bigger ranges from dominating the training course of. Moreover, real-world information usually comprises lacking values. Subsequently, on this first iteration, we are going to construct a pre-processor that may be skilled to scale new information and impute lacking values, making ready it for mannequin consumption.

As soon as this pre-processor is constructed, I’ll then demo methods to simply plug it into pyfunc ML pipeline. Sounds good? Let’s go. 🤠

class PreProcessor(BaseEstimator, TransformerMixin):
"""
Customized preprocessor for numeric options.

- Handles scaling of numeric information
- Performs imputation of lacking values

Attributes:
transformer (Pipeline): Pipeline for numeric preprocessing
options (Record[str]): Names of enter options
"""

def __init__(self):
"""
Initialize preprocessor.

- Creates placeholder for transformer pipeline
"""
self.transformer = None

def match(self, X, y=None):
"""
Suits the transformer on the supplied dataset.

- Configures scaling for numeric options
- Units up imputation for lacking values
- Shops characteristic names for later use

Parameters:
X (pd.DataFrame): The enter options to suit the transformer.
y (pd.Sequence, non-obligatory): Goal variable, not used on this technique.

Returns:
PreProcessor: The fitted transformer occasion.
"""
self.options = X.columns.tolist()

if self.options:
self.transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
self.transformer.match(X[self.features])

return self

def remodel(self, X):
"""
Rework enter information utilizing fitted pipeline.

- Applies scaling to numeric options
- Handles lacking values via imputation

Parameters:
X (pd.DataFrame): Enter options to remodel

Returns:
pd.DataFrame: Remodeled information with scaled and imputed options
"""
X_transformed = pd.DataFrame()

if self.options:
transformed_data = self.transformer.remodel(X[self.features])
X_transformed[self.features] = transformed_data

X_transformed.index = X.index

return X_transformed

def fit_transform(self, X, y=None):
"""
Suits the transformer on the enter information after which transforms it.

Parameters:
X (pd.DataFrame): The enter options to suit and remodel.
y (pd.Sequence, non-obligatory): Goal variable, not used on this technique.

Returns:
pd.DataFrame: The remodeled information.
"""
self.match(X, y)
return self.remodel(X)

This pre-processor might be fitted on prepare information after which used to course of any new information. It would turn into a component within the ML pipeline beneath, however in fact, we will use or take a look at it independently. Let’s create an artificial dataset and use the pre-processor to remodel it.

# Set parameters for artificial information
n_feature = 10
n_inform = 4
n_redundant = 0
n_samples = 1000

# Generate artificial classification information
X, y = make_classification(
n_samples=n_samples,
n_features=n_feature,
n_informative=n_inform,
n_redundant=n_redundant,
shuffle=False,
random_state=12
)

# Create characteristic names
feat_names = [f'inf_{i+1}' for i in range(n_inform)] +
[f'rand_{i+1}' for i in range(n_feature - n_inform)]

# Convert to DataFrame with named options
X = pd.DataFrame(X, columns=feat_names)

# Cut up information into prepare and take a look at units
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=22
)

Under are screenshots from {sweetViz} studies earlier than vs after scaling; you’ll be able to see that scaling didn’t change the underlying form of every characteristic’s distribution however merely rescaled and shifted it. BTW, it takes two strains to generate a reasonably complete EDA report with {sweetViz}, code accessible within the GitHub repo linked above. 🥂

Screenshots from SweetViz studies earlier than vs after preprocessing

Now, let’s create an ML pipeline within the mlflow.pyfunc flavour that may encapsulate this preprocessor.

class ML_PIPELINE(mlflow.pyfunc.PythonModel):
"""
Customized ML pipeline for classification and regression.

- work with any scikit-learn suitable mannequin
- Combines preprocessing and mannequin coaching
- Handles mannequin predictions
- Suitable with MLflow monitoring
- Helps MLflow deployment

Attributes:
mannequin (BaseEstimator or None): A scikit-learn suitable mannequin occasion
preprocessor (Any or None): Information preprocessing pipeline
config (Any or None): Non-obligatory config for mannequin settings
process(str): Sort of ML process ('classification' or 'regression')
"""

def __init__(self, mannequin=None, preprocessor=None, config=None):
"""
Initialize the ML_PIPELINE.

Parameters:
mannequin (BaseEstimator, non-obligatory):
- Scikit-learn suitable mannequin
- Defaults to None

preprocessor (Any, non-obligatory):
- Transformer or pipeline for information preprocessing
- Defaults to None

config (Any, non-obligatory):
- Extra mannequin settings
- Defaults to None
"""
self.mannequin = mannequin
self.preprocessor = preprocessor
self.config = config
self.process = "classification" if hasattr(self.mannequin, "predict_proba") else "regression"

def match(self, X_train: pd.DataFrame, y_train: pd.Sequence):
"""
Prepare the mannequin on supplied information.

- Applies preprocessing to options
- Suits mannequin on remodeled information

Parameters:
X_train (pd.DataFrame): Coaching options
y_train (pd.Sequence): Goal values
"""
X_train_preprocessed = self.preprocessor.fit_transform(X_train.copy())
self.mannequin.match(X_train_preprocessed, y_train)

def predict(
self, context: Any, model_input: pd.DataFrame
) -> np.ndarray:
"""
Generate predictions utilizing skilled mannequin.

- Applies preprocessing to new information
- Makes use of mannequin to make predictions

Parameters:
context (Any): Non-obligatory context info supplied
by MLflow through the prediction part
model_input (pd.DataFrame): Enter options

Returns:
Any: Mannequin predictions or possibilities
"""
processed_model_input = self.preprocessor.remodel(model_input.copy())
if self.process == "classification":
prediction = self.mannequin.predict_proba(processed_model_input)[:,1]
elif self.process == "regression":
prediction = self.mannequin.predict(processed_model_input)
return prediction

The ML pipeline outlined above takes the preprocessor and ML algorithm as parameters. Utilization instance beneath

# outline the ML pipeline occasion with lightGBM classifier
ml_pipeline = ML_PIPELINE(mannequin = lgb.LGBMClassifier(),
preprocessor = PreProcessor())

It is so simple as that! 🎉 If you wish to experiment with one other algorithm, simply swap it like proven beneath. As a wrapper, it could possibly encapsulate each regression and classification algorithms. For the latter, predicted possibilities are returned, as proven within the instance above.

# outline the ML pipeline occasion with random forest regressor
ml_pipeline = ML_PIPELINE(mannequin = RandomForestRegressor(),
preprocessor = PreProcessor())

As you’ll be able to see from the code chunk beneath, passing hyperparameters to the algorithms is simple, making this ML pipeline an ideal instrument for hyperparameter tuning. I’ll elaborate on this subject within the following articles.

params = {
'n_estimators': 100,
'max_depth': 6,
'learning_rate': 0.1
}
mannequin = xgb.XGBClassifier(**params)
ml_pipeline = ML_PIPELINE(mannequin = mannequin,
preprocessor = PreProcessor())

As a result of this ml pipeline is constructed within the mlflow.pyfunc flavour. We will log it with wealthy metadata saved robotically by mlflow for downstream use. When deployed, we will feed the metadata as context for the mannequin within the predict operate as proven beneath. Extra information and demos can be found in my earlier article, which is linked in the beginning.

# prepare the ML pipeline
ml_pipeline.match(X_train, y_train)

# use the skilled pipeline for prediction
y_prob = ml_pipeline.predict(
context=None, # present metadata for mannequin in manufacturing
model_input=X_test
)
auc = roc_auc_score(y_test, y_prob)
print(f"auc: {auc:.3f}")

The above pre-processor has labored properly thus far, however let’s enhance it in two methods beneath after which show methods to swap between pre-processors simply.

  1. Enable customers to customise the pre-processing course of. For example, to specify the impute technique.
  2. Increase pre-processor capability to deal with categorical options.
    class PreProcessor_v2(BaseEstimator, TransformerMixin):
"""
Customized transformer for information preprocessing.

- Scales numeric options
- Encodes categorical options
- Handles lacking values by way of imputation
- Suitable with scikit-learn pipeline

Attributes:
num_impute_strategy (str): Numeric imputation technique
cat_impute_strategy (str): Categorical imputation technique
num_transformer (Pipeline): Numeric preprocessing pipeline
cat_transformer (Pipeline): Categorical preprocessing pipeline
transformed_cat_cols (Record[str]): One-hot encoded column names
num_features (Record[str]): Numeric characteristic names
cat_features (Record[str]): Categorical characteristic names
"""

def __init__(self, num_impute_strategy='median',
cat_impute_strategy='most_frequent'):
"""
Initialize the transformer.

- Units up numeric information transformer
- Units up categorical information transformer
- Configures imputation methods

Parameters:
num_impute_strategy (str): Technique for numeric lacking values
cat_impute_strategy (str): Technique for categorical lacking values
"""
self.num_impute_strategy = num_impute_strategy
self.cat_impute_strategy = cat_impute_strategy

def match(self, X, y=None):
"""
Match transformer on enter information.

- Identifies characteristic varieties
- Configures characteristic scaling
- Units up encoding
- Suits imputation methods

Parameters:
X (pd.DataFrame): Enter options
y (pd.Sequence, non-obligatory): Goal variable, not used

Returns:
CustomTransformer: Fitted transformer
"""
self.num_features = X.select_dtypes(embrace=np.quantity).columns.tolist()
self.cat_features = X.select_dtypes(exclude=np.quantity).columns.tolist()

if self.num_features:
self.num_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy=self.num_impute_strategy)),
('scaler', StandardScaler())
])
self.num_transformer.match(X[self.num_features])

if self.cat_features:
self.cat_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy=self.cat_impute_strategy)),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
self.cat_transformer.match(X[self.cat_features])

return self

def get_transformed_cat_cols(self):
"""
Get remodeled categorical column names.

- Creates names after one-hot encoding
- Combines class with encoded values

Returns:
Record[str]: One-hot encoded column names
"""
cat_cols = []
cats = self.cat_features
cat_values = self.cat_transformer['encoder'].categories_
for cat, values in zip(cats, cat_values):
cat_cols += [f'{cat}_{value}' for value in values]

return cat_cols

def remodel(self, X):
"""
Rework enter information.

- Applies fitted scaling
- Applies fitted encoding
- Handles numeric and categorical options

Parameters:
X (pd.DataFrame): Enter options

Returns:
pd.DataFrame: Remodeled information
"""
X_transformed = pd.DataFrame()

if self.num_features:
transformed_num_data = self.num_transformer.remodel(X[self.num_features])
X_transformed[self.num_features] = transformed_num_data

if self.cat_features:
transformed_cat_data = self.cat_transformer.remodel(X[self.cat_features]).toarray()
self.transformed_cat_cols = self.get_transformed_cat_cols()
transformed_cat_df = pd.DataFrame(transformed_cat_data, columns=self.transformed_cat_cols)
X_transformed = pd.concat([X_transformed, transformed_cat_df], axis=1)

X_transformed.index = X.index

return X_transformed

def fit_transform(self, X, y=None):
"""
Match and remodel enter information.

- Suits transformer to information
- Applies transformation
- Combines each operations

Parameters:
X (pd.DataFrame): Enter options
y (pd.Sequence, non-obligatory): Goal variable, not used

Returns:
pd.DataFrame: Remodeled information
"""
self.match(X, y)
return self.remodel(X)

There you could have it: a brand new preprocessor that’s 1) extra customizable and a pair of) handles each numerical and categorical options. Let’s outline an ML pipeline occasion with it.

# Outline a PreProcessor (V2) occasion whereas specifying impute technique
preprocessor = PreProcessor_v2(
num_impute_strategy = 'imply'
)
# Outline an ML Pipeline occasion with this preprocessor
ml_pipeline = ML_PIPELINE(
mannequin = xgb.XGBClassifier(), # change ML algorithms
preprocessor = PreProcessor # change pre-processors
)

Let’s take a look at this new ML pipeline occasion with one other artificial dataset containing each numerical and categorical options.

# add missings
np.random.seed(42)
missing_rate = 0.20
n_missing = int(np.flooring(missing_rate * X.dimension))
rows = np.random.randint(0, X.form[0], n_missing)
cols = np.random.randint(0, X.form[1], n_missing)
X.values[rows, cols] = np.nan
actual_missing_rate = X.isna().sum().sum() / X.dimension
print(f"Goal lacking fee: {missing_rate:.2%}")
print(f"Precise lacking fee: {actual_missing_rate:.2%}")

# change X['inf_1] to categorical
percentiles = [0, 0.1, 0.5, 0.9, 1]
labels = ['bottom', 'lower-mid', 'upper-mid', 'top']
X['inf_1'] = pd.qcut(X['inf_1'], q=percentiles, labels=labels)

There you could have it—the ML pipeline runs easily with the brand new information. As anticipated, nevertheless, if we outline the ML pipeline with the earlier preprocessor after which run it on this dataset, we are going to encounter errors as a result of the earlier preprocessor was not designed to deal with categorical options.

# create an ML pipeline occasion with PreProcessor v1
ml_pipeline = ML_PIPELINE(
mannequin = lgb.LGBMClassifier(verbose = -1),
preprocessor = PreProcessor()
)

strive:
ml_pipeline.match(X_train, y_train)
besides Exception as e:
print(f"Error: {e}")

Error: Can't use median technique with non-numeric information:
couldn't convert string to drift: 'lower-mid'

Including an explainer to an ML pipeline might be tremendous useful in a number of methods:

  1. Mannequin Choice: It helps us choose one of the best mannequin by evaluating the soundness of its reasoning. Two algorithms might carry out equally on metrics like AUC or precision, however the important thing options they depend on might differ. Reviewing mannequin reasoning with area consultants to debate which mannequin makes extra sense in such situations is a good suggestion.
  2. Troubleshooting: One useful technique for mannequin enchancment is to investigate the reasoning behind errors. For instance, in classification issues, we will establish false positives the place the mannequin was most assured (i.e., produced the best predicted potentialities) and examine what went unsuitable within the reasoning and what key options contributed to the errors.
  3. Mannequin Monitoring: In addition to the standard monitoring parts reminiscent of information drift and efficiency metrics, it’s informative to observe mannequin reasoning as properly. If there’s a vital shift in key options that drive the selections made by a mannequin in manufacturing, I wish to be alerted.
  4. Mannequin Implementation: In some situations, supplying mannequin reasoning together with mannequin predictions might be extremely useful to our finish customers. For instance, to assist a customer support agent finest retain a churning buyer, we will present the churn rating alongside the client options that contributed to this rating.

As a result of our ML pipeline is algorithm agnostic, it’s crucial that the explainer may work throughout algorithms.

SHAP (SHapley Additive exPlanations) values are a superb alternative for our function as a result of they supply theoretically strong explanations based mostly on recreation principle. They’re designed to work persistently throughout algorithms, together with each tree-based and non-tree-based fashions, with some approximations for the latter. Moreover, SHAP gives wealthy visualization capabilities and is broadly considered an trade normal.

Within the notebooks beneath, I’ve dug into the similarities and variations between SHAP implementations for numerous ML algorithms.

To create a generic explainer for our ML pipeline, the important thing variations to deal with are

1. Whether or not the mannequin is immediately supported by shap.Explainer

The model-specific SHAP explainers are considerably extra environment friendly than the model-agnostic ones. Subsequently, the method we take right here is

  • first makes an attempt to make use of the direct SHAP explainer for the mannequin sort,
  • If that fails, falls again to a model-agnostic explainer utilizing the predict operate.

2. The form of SHAP values

For binary classification issues, SHAP values can are available in two codecs/shapes.

  • Format 1: Solely reveals influence on optimistic class
form = (n_samples, n_features) # 2nd array
  • Format 2: Reveals influence on each courses
form = (n_samples, n_features, n_classes) # 3d array
  • The explainer implementation beneath at all times reveals the influence on the optimistic class. When the influence on each courses is accessible in SHAP values, it selects those on the optimistic class.

Please see the code beneath for the implementation of the method mentioned above.

class ML_PIPELINE(mlflow.pyfunc.PythonModel):
"""
Customized ML pipeline for classification and regression.

- Works with scikit-learn suitable fashions
- Handles information preprocessing
- Manages mannequin coaching and predictions
- Present international and native mannequin rationalization
- Suitable with MLflow monitoring
- Helps MLflow deployment

Attributes:
mannequin (BaseEstimator or None): A scikit-learn suitable mannequin occasion
preprocessor (Any or None): Information preprocessing pipeline
config (Any or None): Non-obligatory config for mannequin settings
process(str): Sort of ML process ('classification' or 'regression')
both_class (bool): Whether or not SHAP values embrace each courses
shap_values (shap.Clarification): SHAP values for mannequin rationalization
X_explain (pd.DataFrame): Processed options for SHAP rationalization
"""

# ------- similar code as above ---------

def explain_model(self,X):
"""
Generate SHAP values and plots for mannequin interpretation.
This technique:
1. Transforms the enter information utilizing the fitted preprocessor
2. Creates a SHAP explainer applicable for the mannequin sort
3. Calculates SHAP values for characteristic significance
4. Generates a abstract plot of characteristic significance

Parameters:
X : pd.DataFrame
Enter options to generate explanations for.

Returns: None
The tactic shops the next attributes within the class:
- self.X_explain : pd.DataFrame
Remodeled information with authentic numeric values for interpretation
- self.shap_values : shap.Clarification
SHAP values for every prediction
- self.both_class : bool
Whether or not the mannequin outputs possibilities for each courses
"""
X_transformed = self.preprocessor.remodel(X.copy())
self.X_explain = X_transformed.copy()
# get pre-transformed values for numeric options
self.X_explain[self.preprocessor.num_features] = X[self.preprocessor.num_features]
self.X_explain.reset_index(drop=True)
strive:
# Try and create an explainer that immediately helps the mannequin
explainer = shap.Explainer(self.mannequin)
besides:
# Fallback for fashions or shap variations the place direct assist could also be restricted
explainer = shap.Explainer(self.mannequin.predict, X_transformed)
self.shap_values = explainer(X_transformed)

# get the form of shap values and extract accordingly
self.both_class = len(self.shap_values.values.form) == 3
if self.both_class:
shap.summary_plot(self.shap_values[:,:,1])
elif self.both_class == False:
shap.summary_plot(self.shap_values)

def explain_case(self,n):
"""
Generate SHAP waterfall plot for one particular case.

- Reveals characteristic contributions
- Begins from base worth
- Ends at remaining prediction
- Reveals authentic characteristic values for higher interpretability

Parameters:
n (int): Case index (1-based)
e.g., n=1 explains the primary case.

Returns:
None: Shows SHAP waterfall plot

Notes:
- Requires explain_model() first
- Reveals optimistic class for binary duties
"""
if self.shap_values is None:
print("""
Please clarify mannequin first by working
`explain_model()` utilizing a specific dataset
""")
else:
self.shap_values.information = self.X_explain
if self.both_class:
shap.plots.waterfall(self.shap_values[:,:,1][n-1])
elif self.both_class == False:
shap.plots.waterfall(self.shap_values[n-1])

Now, the up to date ML pipeline occasion can create explanatory graphs for you in only one line of code. 😎

SHAP plot for global explanation of the model
SHAP plot for international rationalization of the mannequin
SHAP plot for local explanation of any specific case
SHAP plot for native rationalization of any particular case

After all, you’ll be able to log a skilled ML pipeline utilizing mlflow and revel in all of the metadata for mannequin deployment and reproducibility. Within the screenshot beneath, you’ll be able to see that along with the pickled pyfunc mannequin itself, the Python surroundings, metrics, and hyperparameters have all been logged in only a few strains of code beneath. To be taught extra, please check with my earlier article on mlflow.pyfunc, which is linked in the beginning.

# Log the mannequin with MLflow
with mlflow.start_run() as run:

# Log the customized mannequin with auto-captured conda surroundings
model_info = mlflow.pyfunc.log_model(
artifact_path="mannequin",
python_model=ml_pipeline,
conda_env=mlflow.sklearn.get_default_conda_env()
)
# Log mannequin parameters
mlflow.log_params(ml_pipeline.mannequin.get_params())

# Log metrics
mlflow.log_metric("rmse", rmse)

# Get the run ID
run_id = run.information.run_id

Wealthy mannequin metadata and artifacts logged with mlflow

That is it, a generic and explainable ML pipeline that works for each classification and regression algorithms. Take the code and lengthen it to fit your use case. 🤗 For those who discover this convenient, please give me a clap 👏🥰

To additional our journey on the mlflow.pyfunc collection, beneath are some matters I’m contemplating. Be at liberty to go away a remark and let me know what you want to see. 🥰

  • Characteristic choice
  • Hyperparameter tuning
  • If as an alternative of selecting between off-the-shelf algorithms, one decides to ensemble a number of algorithms or have extremely personalized options, they will nonetheless take pleasure in a generic mannequin illustration and seamless migration by way of mlflow.pyfunc.

Keep tuned and comply with me on Medium. 😁

💼LinkedIn | 😺GitHub | 🕊️Twitter/X