KernelSHAP could be deceptive with correlated predictors | by Shuyang Xiang | Aug, 2024

A concrete case examine

“Like many different permutation-based interpretation strategies, the Shapley worth methodology suffers from inclusion of unrealistic information situations when options are correlated. To simulate {that a} function worth is lacking from a coalition, we marginalize the function. ..When options are dependent, then we’d pattern function values that don’t make sense for this occasion. ”— Interpretable-ML-E book.

SHAP (SHapley Additive exPlanations) values are designed to pretty allocate the contribution of every function to the prediction made by a machine studying mannequin, based mostly on the idea of Shapley values from cooperative recreation concept. The Shapley worth framework has a number of fascinating theoretical properties and might, in precept, deal with any predictive mannequin. Nonetheless, SHAP values can probably be deceptive, particularly when utilizing the KernelSHAP methodology for approximation. When predictors are correlated, these approximations could be imprecise and even have the other signal.

On this weblog submit, I’ll display how the unique SHAP values can differ considerably from approximations made by the SHAP framework, particularly the KernalSHAP and focus on the explanations behind these discrepancies.

Take into account a situation the place we purpose to foretell the churn fee of rental leases in an workplace constructing, based mostly on two key components: occupancy fee and the speed of reported issues.

The occupancy fee considerably impacts the churn fee. As an illustration, if the occupancy fee is just too low, tenants could depart as a result of workplace being underutilized. Conversely, if the occupancy fee is just too excessive, tenants would possibly depart due to overcrowding, in search of higher choices elsewhere.

Moreover, let’s assume that the speed of reported issues is very correlated with the occupancy fee, particularly that the reported drawback fee is the sq. of the occupancy fee.

We outline the churn fee operate as follows:

Picture by writer: churn fee operate

This operate with respect to the 2 variables could be represented by the next illustrations:

Picture by writer: Churn on two variables

SHAP Values Computed Utilizing Kernel SHAP

We’ll now use the next code to compute the SHAP values of the predictors:

# Outline the dataframe 
churn_df=pd.DataFrame(
{
"occupancy_rate":occupancy_rates,
"reported_problem_rate": reported_problem_rates,
"churn_rate":churn_rates,
}
)
X=churn_df.drop(["churn_rate"],axis=1)
y=churn_df["churn_rate"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
# append one speical level
X_test=pd.concat(objs=[X_test, pd.DataFrame({"occupancy_rate":[0.8], "reported_problem_rate":[0.64]})])

# Outline the prediction
def predict_fn(information):
occupancy_rates = information[:, 0]
reported_problem_rates = information[:, 1]

churn_rate= C_base +C_churn*(C_occ* occupancy_rates-reported_problem_rates-0.6)**2 +C_problem*reported_problem_rates
return churn_rate

# Create the SHAP KernelExplainer utilizing the proper prediction operate
background_data = shap.pattern(X_train,100)
explainer = shap.KernelExplainer(predict_fn, background_data)
shap_values = explainer(X_test)

The code above performs the next duties:

  1. Knowledge Preparation: A DataFrame named churn_df is created with columns occupancy_rate, reported_problem_rate, and churn_rate. Variables and goal (churn_rate ) are then created from and Knowledge is cut up into coaching and testing units, with 80% for coaching and 20% for testing. Word {that a} particular information level with particular occupancy_rate and reported_problem_rate values is added to the take a look at set X_test.
  2. Prediction Operate Definition: A operate predict_fn is outlined to calculate churn fee utilizing a selected method involving predefined constants.
  3. SHAP Evaluation: A SHAP KernelExplainer is initialized utilizing the prediction operate and background_data samples from X_train. SHAP values for X_test are computed utilizing the explainer.

Under, you may see a abstract SHAP bar plot, which represents the typical SHAP values for X_test :

Picture by writer: common shap values

Specifically, we see that on the information level (0.8, 0.64), the SHAP values of the 2 options are 0.10 and -0.03, illustrated by the next drive plot:

Picture by writer Drive Plot of 1 information level

SHAP Values by orignal definition

Let’s take a step again and compute the precise SHAP values step-by-step in accordance with their unique definition. The overall method for SHAP values is given by:

the place: S is a subset of all function indices excluding i, |S| is the dimensions of the subset S, M is the overall variety of options, f(XS​∪{xi​}) is the operate evaluated with the options in S with xi current whereas f(XS) is the operate evaluated with the function in S with xi absent.

Now, let’s calculate the SHAP values for 2 options: occupancy fee (denoted as x1x_1x1​) and reported drawback fee (denoted as x2x_2x2​) on the information level (0.8, 0.64). Recall that x1x_1x1​ and x2x_2x2​ are associated by x_1 = x_2².

We’ve got the SHAP worth for occupancy fee on the information level:

and, similary, for the function reported drawback fee:

First, let’s compute the SHAP worth for the occupancy fee on the information level:

  1. The primary time period is the expectation of the mannequin’s output when X1​ is mounted at 0.8 and X2​ is averaged over its distribution. Given the connection xx_1 = x_2², this expectation results in the mannequin’s output on the particular level (0.8, 0.64).
  2. The second time period is the unconditional expectation of the mannequin’s output, the place each X1 and X2 are averaged over their distributions. This may be computed by averaging the outputs over all information factors within the background dataset.
  3. The third time period is the mannequin’s output on the particular level (0.8, 0.64).
  4. The ultimate time period is the expectation of the mannequin’s output when X1​ is averaged over its distribution, provided that X2​ is mounted on the particular level 0.64. Once more, as a result of relationship x_1 = x_2²​, this expectation matches the mannequin’s output at (0.8, 0.64), just like step one.

Thus, the SHAP values compute from the unique definition for the 2 options occupancy fee and reported drawback fee on the information level (0.8, 0.64) are -0.0375 and -0.0375, respectively, which is kind of totally different from the values given by Kernel SHAP.

The place comes discrepancies?

As you’ll have seen, the discrepancy between the 2 strategies primarily arises from the second and fourth steps, the place we have to compute the conditional expectation. This includes calculating the expectation of the mannequin’s output when X1X_1X1​ is conditioned on 0.8.

  • Actual SHAP: When computing actual SHAP values, the dependencies between options (comparable to x1=x_2² in our instance​) are explicitly accounted for. This ensures correct calculations by contemplating how function interactions affect the mannequin’s output.
  • Kernel SHAP: By default, Kernel SHAP assumes function independence, which may result in inaccurate SHAP values when options are literally dependent. Based on the paper A Unified Method to Decoding Mannequin Predictions, this assumption is a simplification. In follow, options are sometimes correlated, making it difficult to attain correct approximations when utilizing Kernel SHAP.
Screenshot from the paper

Sadly, computing SHAP values straight based mostly on their unique definition could be computationally costly. Listed below are some different approaches to think about:

TreeSHAP

  • Designed particularly for tree-based fashions like random forests and gradient boosting machines, TreeSHAP effectively computes SHAP values whereas successfully managing function dependencies.
  • This methodology is optimized for tree ensembles, making it quicker and extra scalable in comparison with conventional SHAP computations.
  • When utilizing TreeSHAP throughout the SHAP framework, set the parameter feature_perturbation = "interventional" to account for function dependencies precisely.

Extending Kernel SHAP for Dependent Options

  • To deal with function dependencies, this paper includes extending Kernel SHAP. One methodology is to imagine that the function vector follows a multivariate Gaussian distribution. On this method:
  • Conditional distributions are modeled as multivariate Gaussian distributions.
  • Samples are generated from these conditional Gaussian distributions utilizing estimates from the coaching information.
  • The integral within the approximation is computed based mostly on these samples.
  • This methodology assumes a multivariate Gaussian distribution for options, which can not at all times be relevant in real-world situations the place options can exhibit totally different dependency constructions.

Enhancing Kernel SHAP Accuracy

  • Description: Improve the accuracy of Kernel SHAP by making certain that the background dataset used for approximation is consultant of the particular information distribution with independant options.

By using these strategies, you may tackle the computational challenges related to calculating SHAP values and improve their accuracy in sensible purposes. Nonetheless, you will need to word that no single answer is universally optimum for all situations.

On this weblog submit, we’ve explored how SHAP values, regardless of their sturdy theoretical basis and flexibility throughout varied predictive fashions, can endure from accuracy points when predictors are correlated, notably when approximations like KernelSHAP are employed. Understanding these limitations is essential for successfully deciphering SHAP values. By recognizing the potential discrepancies and choosing probably the most appropriate approximation strategies, we are able to obtain extra correct and dependable function attribution in our fashions.