Algorithm-Agnostic Mannequin Constructing with Mlflow | by Mena Wang, PhD | Aug, 2024

Sounds attention-grabbing? If sure, this text is right here to get you began with mlflow.pyfunc. 🥂

  • Firstly, let’s undergo a easy toy instance of making mlflow.pyfunc class.
  • Then, we are going to outline a mlflow.pyfunc class that encapsulates a machine studying pipeline (an estimator plus some preprocessing logic for instance). We can even practice, log and cargo this ML pipeline for inference.
  • Lastly, let’s take a deep dive into the encapsulated mlflow.pyfunc object, discover the wealthy metadata and artifacts mechanically tracked for us by mlflow, and get a greater grasp of the total energy that mlflow.pyfunc gives.

🔗 All code and config can be found on GitHub. 🧰

First, let’s create a easy toy mlflow.pyfunc mannequin after which use it with the mlflow workflow.

  • Step 1: Create the mannequin
  • Step 2: Log the mannequin
  • Step 3: Load the logged mannequin to carry out the inference
# Step 1: Create a mlflow.pyfunc mannequin
class ToyModel(mlflow.pyfunc.PythonModel):
"""
ToyModel is an easy instance implementation of an MLflow Python mannequin.
"""

def predict(self, context, model_input):
"""
A fundamental predict operate that takes a model_input record and returns a brand new record
the place every factor is elevated by one.

Parameters:
- context (Any): An non-compulsory context parameter supplied by MLflow.
- model_input (record of int or float): A listing of numerical values that the mannequin will use for prediction.

Returns:
- record of int or float: A listing with every factor in model_input is elevated by one.
"""
return [x + 1 for x in model_input]

As you may see from the instance above, you may create an mlflow.pyfunc mannequin to implement any customed Python operate you see match in your ML resolution, which doesn’t need to be an off-the-shelf machine studying algorithm.

You possibly can then log this mannequin and cargo it later to carry out the inference.

# Step 2: log this mannequin as an mlflow run
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path = "mannequin",
python_model=ToyModel()
)
run_id = mlflow.active_run().data.run_id
# Step 3: load the logged mannequin to carry out inference
mannequin = mlflow.pyfunc.load_model(f"runs:/{run_id}/mannequin")
# dummy new knowledge
x_new = [1,2,3]
# mannequin inference for the brand new knowledge
print(mannequin.predict(x_new))
[2, 3, 4]

Now, let’s create an ML pipeline encapsulating an estimator with further customized logic.

Within the instance under, the XGB_PIPELINE class is a wrapper that integrates the estimator with preprocessing steps, which might be fascinating for some MLOps implementations. Leveraging mlflow.pyfunc, this wrapper is estimator-agnostic and gives a uniform mannequin illustration. Particularly,

  • match(): As a substitute of utilizing XGBoost’s native API (xgboost.practice()), this class makes use of .match(), which adheres to sklearn conventions, enabling easy integration into sklearn pipelines and making certain consistency throughout totally different estimators.
  • DMatrix(): DMatrix is a core knowledge construction in XGBoost that optimizes knowledge for coaching and prediction. On this class, the step to rework a pandas DataFrame right into a DMatrix is wrapped inside the class, enabling seamless integration with pandas DataFrames like all different sklearn estimators.
  • predict() : That is the mlflow.pyfunc mannequin’s common inference API. It’s constant for this ML pipeline, for the toy mannequin above, for any machine studying algorithms or customized logic we wrap in an mlflow.pyfunc mannequin.
import json
import xgboost as xgb
import mlflow.pyfunc
from typing import Any, Dict, Union
import pandas as pd

class XGB_PIPELINE(mlflow.pyfunc.PythonModel):
"""
XGBWithPreprocess is an instance implementation of an MLflow Python mannequin with XGBoost.
"""

def __init__(self, params: Dict[str, Union[str, int, float]]):
"""
Initialize the mannequin with given parameters.

Parameters:
- params (Dict[str, Union[str, int, float]]): Parameters for the XGBoost mannequin.
"""
self.params = params
self.xgb_model = None
self.config = None

def preprocess_input(self, model_input: pd.DataFrame) -> pd.DataFrame:
"""
Preprocess the enter knowledge.

Parameters:
- model_input (pd.DataFrame): The enter knowledge to preprocess.

Returns:
- pd.DataFrame: The preprocessed enter knowledge.
"""
processed_input = model_input.copy()
# put any desired preprocessing logic right here
processed_input.drop(processed_input.columns[0], axis=1, inplace=True)

return processed_input

def match(self, X_train: pd.DataFrame, y_train: pd.Sequence):
"""
Prepare the XGBoost mannequin.

Parameters:
- X_train (pd.DataFrame): The coaching enter knowledge.
- y_train (pd.Sequence): The goal values.
"""
processed_model_input = self.preprocess_input(X_train.copy())
dtrain = xgb.DMatrix(processed_model_input, label=y_train)
self.xgb_model = xgb.practice(self.params, dtrain)

def predict(self, context: Any, model_input: pd.DataFrame) -> Any:
"""
Predict utilizing the educated XGBoost mannequin.

Parameters:
- context (Any): An non-compulsory context parameter supplied by MLflow.
- model_input (pd.DataFrame): The enter knowledge for making predictions.

Returns:
- Any: The prediction outcomes.
"""
processed_model_input = self.preprocess_input(model_input.copy())
dmatrix = xgb.DMatrix(processed_model_input)
return self.xgb_model.predict(dmatrix)

Now, let’s practice and log this mannequin.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd

# Generate artificial datasets for demo
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# practice and log the mannequin
with mlflow.start_run(run_name = 'xgb_demo') as run:

# Create an occasion of XGB_PIPELINE
params = {
'goal': 'reg:squarederror',
'max_depth': 3,
'learning_rate': 0.1,
}
mannequin = XGB_PIPELINE(params)

# Match the mannequin
mannequin.match(X_train=pd.DataFrame(X_train), y_train=y_train)

# Log the mannequin
model_info = mlflow.pyfunc.log_model(
artifact_path = 'mannequin',
python_model = mannequin,
)

run_id = mlflow.active_run().data.run_id

The mannequin has been logged efficiently. ✌ ️Now, let’s load it for inference-making.

loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
loaded_model.predict(pd.DataFrame(X_test))
array([ 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
...

The above course of is fairly clean, isn’t it? This represents the essential performance of the mlflow.pyfunc object. Now, let’s dive deeper to discover the total energy that mlflow.pyfunc has to supply.

1. model_info

Within the instance above, the model_info object returned by mlflow.pyfunc.log_model() is an occasion of mlflow.fashions.mannequin.ModelInfo class. It comprises metadata and details about the logged mannequin. For instance

screenshot displaying some of the attributes of the model_info object
Some attributes of the model_info object

Be happy to run dir(model_info) to discover additional or take a look at the supply code for all of the attributes outlined. The attribute I exploit essentially the most is model_uri, which signifies the place the logged mannequin might be discovered inside the mlflow monitoring system.

2. loaded_model

It’s worthwhile clarifying that the loaded_model just isn’t an occasion of the XGB_PIPELINE class, however slightly a wrapper object supplied by mlflow.pyfunc for algorithm-agnostic inference making. As proven under, an error will likely be returned should you try to retrieve attributes of the XGB_PIPELINE class from the loaded_model.

print(loaded_model.params)
AttributeError: 'PyFuncModel' object has no attribute 'params'

3. unwrapped_model

All proper, it’s possible you’ll ask, then the place is the educated occasion of XGB_PIPELINE? Is it logged and retrievable by means of mlflow, too?

Don’t fear; it’s stored secure so that you can unwrap simply, as proven under.

unwrapped_model = loaded_model.unwrap_python_model()
print(unwrapped_model.params)
{'goal': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}

That’s how it’s achieved. 😎 With the unwrapped_model, you may entry any properties or strategies of your customized ML pipeline similar to this! I generally add helpful strategies similar to explain_model or post_processing within the customized pipeline, or embody useful attributes to hint the mannequin coaching course of and supply diagnostics 🤩… Properly, I’d higher cease right here and go away these for the next articles. Suffice it to say, you may be happy to customized your ML pipeline in your use case and know that

  1. You should have entry to all these tailored strategies and attributes for downstream use and
  2. This tailored customized mannequin will likely be wrapped inside the uniform mlflow.pyfunc inference API and therefore take pleasure in a clean migration to different estimators if crucial.

4. Context

You’ll have seen that there’s a context parameter for the predict strategies in each mlflow.pyfunc class outlined above. However curiously, this parameter just isn’t required after we make predictions with the loaded mannequin. Why❓

loaded_model = mlflow.pyfunc.load_model(model_uri)
# the context parameter just isn't wanted when calling `predict`
loaded_model.predict(model_input)

It’s because the loaded_model above is a wrapper object supplied by mlflow. If we use the unwrapped mannequin as a substitute, we have to outline the context explicitly, as proven under; in any other case, the code will return an error.

unwrapped_model = loaded_model.unwrap_python_model()
# want to offer context mannually
unwrapped_model.predict(context=None, model_input)

So, what is that this context? And what position does it play within the predict technique?

The context is a PythonModelContext object that comprises artifacts thepyfunc mannequin can use when performing inference. It’s created implicitly and mechanically by the log_method() technique.

Navigate to the mlruns subfolder in your venture repo, which is mechanically created by mlflow while you log an mlflow mannequin. Discover the folder named after the mannequin’s run_id. Inside, you’ll discover the mannequin artifacts mechanically logged for you, as proven under.

# get run_id of a loaded mannequin
print(loaded_model.metadata.run_id)
38a617d0f30645e8ae95eea4642a03c2
screenshot of the artifacts folder in a logged `mlflow.pyfunc` model
artifacts folder in a logged `mlflow.pyfunc` mannequin

Fairly neat, isn’t it?😁 Be happy to discover these artifacts at your leisure; under are the screenshots of the necessities and MLmodel file from the folder FYR.

The requiarements under specifies the variations of dependencies required to recreate the setting for operating the mannequin.

screenshot of the `requirements.txt` file content in the folder
The `necessities.txt` file within the artifacts folder

The MLmodel doc under defines the metadata and configuration essential to load and serve the mannequin in YAML format.

screenshot of the content of the MLModel file in the artifacts folder
The `MLmodel` file within the artifacts folder