Powering Experiments with CUPED and Double Machine Studying | by Ryan O’Sullivan | Aug, 2024

Causal AI, exploring the combination of causal reasoning into machine studying

Picture by Karsten Würth on Unsplash

Welcome to my sequence on Causal AI, the place we’ll discover the combination of causal reasoning into machine studying fashions. Anticipate to discover plenty of sensible purposes throughout totally different enterprise contexts.

Within the final article we lined safeguarding demand forecasting with causal graphs. Immediately, we flip our consideration to powering experiments utilizing CUPED and double machine studying.

For those who missed the final article on safeguarding demand forecasting, test it out right here:

On this article, we consider whether or not CUPED and double machine studying can improve the effectiveness of your experiments. We are going to use a case research to discover the next areas:

  • The constructing blocks of experimentation: Speculation testing, energy evaluation, bootstrapping.
  • What’s CUPED and the way can it assist energy experiments?
  • What are the conceptual similarities between CUPED and double machine studying?
  • When ought to we use double machine studying slightly than CUPED?

The total pocket book might be discovered right here:

Background

You’ve lately joined the experimentation workforce at a number one on-line retailer identified for its huge product catalog and dynamic person base. The information science workforce has deployed a complicated recommender system designed to reinforce person expertise and drive gross sales. This method integrates in real-time with the retailer’s platform and includes vital infrastructure and engineering prices.

The finance workforce is keen to grasp the system’s monetary influence, particularly how a lot extra income it generates in comparison with a baseline state of affairs with out suggestions. To judge the recommender system’s effectiveness, you intend to conduct a randomized managed experiment.

Knowledge-generating course of: Pre-experiment

We begin by creating some pre-experiment knowledge. The information-generating course of we use has the next traits:

  • 3 noticed covariates associated to the recency (x_recency), frequency (x_frequency) and worth (x_value) of earlier gross sales.
  • 1 unobserved covariate, the customers month-to-month earnings (u_income).
Person generated picture
  • A fancy relationship between covariates is used to estimate our goal metric, gross sales worth:
Person generated picture

The python code under is used to create the pre-experiment knowledge:

np.random.seed(123)

n = 10000 # Set variety of observations
p = 4 # Set variety of pre-experiment covariates

# Create pre-experiment covariates
X = np.random.uniform(measurement=n * p).reshape((n, -1))

# Nuisance parameters
b = (
1.5 * X[:, 0] +
2.5 * X[:, 1] +
X[:, 2] ** 3 +
X[:, 3] ** 2 +
X[:, 1] * X[:, 2]
)

# Create some noise
noise = np.random.regular(measurement=n)

# Calculate end result
y = np.most(b + noise, 0)

# Scale variables for interpretation
df_pre = pd.DataFrame({"noise": noise * 1000,
"u_income": X[:, 0] * 1000,
"x_recency": X[:, 1] * 1000,
"x_frequency": X[:, 2] * 1000,
"x_value": X[:, 3] * 1000,
"y_value": y * 1000
})

# Visualise goal metric
sns.histplot(df_pre['y_value'], bins=30, kde=False)
plt.xlabel('Gross sales Worth')
plt.ylabel('Frequency')
plt.title('Gross sales Worth')
plt.present()

Person generated picture

Earlier than we get onto CUPED, I assumed it might be worthwhile masking some foundational data on experimentation.

Speculation testing

Speculation testing helps decide if noticed variations in an experiment are statistically vital or simply random noise. In our experiment, we divide customers into two teams:

  • Management Group: Receives no suggestions.
  • Therapy Group: Receives personalised suggestions from the system.

We outline our hypotheses as follows:

  • Null Speculation (H₀): The recommender system doesn’t have an effect on income. Any noticed variations are because of probability.
  • Various Speculation (Hₐ): The recommender system will increase income. Customers receiving suggestions generate considerably extra income in comparison with those that don’t.

To evaluate the hypotheses you’ll be evaluating the imply income within the management and therapy group. Nonetheless, there are some things to concentrate on:

  • Kind I error (False optimistic): If the experiment concludes that the recommender system considerably will increase income when in actuality, it has no impact.
  • Kind II error (Beta, False unfavourable): If the experiment finds no vital enhance in income from the recommender system when in actuality, it does result in a significant enhance
  • Significance Stage (Alpha): For those who set the importance stage to 0.05, you might be accepting a 5% probability of incorrectly concluding that the recommender system improves income when it doesn’t (false optimistic).
  • Energy (1 — Beta): Reaching an influence of 0.80 means you’ve gotten an 80% probability of detecting a big enhance in income as a result of recommender system if it really has an impact. The next energy reduces the danger of false negatives.

As you begin to consider designing the experiment, you set some preliminary objectives:

  1. You need to reliably detect the impact — Ensuring you stability the dangers of detecting a non-existent impact vs the danger of not detecting an actual impact.
  2. As shortly as doable — Finance are in your case!
  3. Retaining the pattern measurement as value environment friendly as doable — The enterprise case from the info science workforce suggests the system goes to drive a big enhance in income so that they don’t need the management group being too large.

However how will you meet these objectives? Let’s delve into energy evaluation subsequent!

Energy evaluation

After we discuss powering experiments, we’re normally referring to the method of figuring out the minimal pattern measurement wanted to detect an impact of a sure measurement with a given confidence. There are 3 parts to energy evaluation:

  • Impact measurement — The distinction between the imply worth of H₀ and Hₐ. We typically must make wise assumptions round this based mostly on understanding what issues to the enterprise/business we’re working inside.
  • Significance stage — The chance of incorrectly concluding there may be an impact when there isn’t, sometimes set at 0.05.
  • Energy — The chance of accurately detecting an impact when there may be one, sometimes set at 0.80.

I discovered the instinct behind these fairly exhausting to know at first, however visualising it could actually actually assist. So lets give it a attempt! The important thing areas are the place H₀ and Hₐ crossover — See in the event you it helps you tie collectively the parts mentioned above…

Person generated picture

A bigger pattern measurement results in a smaller normal error. With a smaller normal error, the sampling distributions of H₀ and Hₐ develop into narrower and fewer overlapping. This decreased overlap makes it simpler to detect a distinction, resulting in larger energy.

The perform under reveals how we will use the statsmodels python bundle to hold out an influence evaluation:

from typing import Union
import pandas as pd
import numpy as np
import statsmodels.stats.energy as smp

def power_analysis(metric: Union[np.ndarray, pd.Series], exp_perc_change: float, alpha: float = 0.05, energy: float = 0.80) -> int:
'''
Carry out an influence evaluation to find out the minimal pattern measurement required for a given metric.

Args:
metric (np.ndarray or pd.Sequence): Array or Sequence containing the metric values for the management group.
exp_perc_change (float): The anticipated proportion change within the metric for the take a look at group.
alpha (float, elective): The importance stage for the take a look at. Defaults to 0.05.
energy (float, elective): The specified energy of the take a look at. Defaults to 0.80.

Returns:
int: The minimal pattern measurement required for every group to detect the anticipated proportion change with the desired energy and significance stage.

Raises:
ValueError: If `metric` isn't a NumPy array or pandas Sequence.
'''

# Validate enter sorts
if not isinstance(metric, (np.ndarray, pd.Sequence)):
elevate ValueError("metric needs to be a NumPy array or pandas Sequence.")

# Calculate statistics
control_mean = metric.imply()
control_std = np.std(metric, ddof=1) # Use ddof=1 for pattern normal deviation
test_mean = control_mean * (1 + exp_perc_change)
test_std = control_std # Assume the take a look at group has the identical normal deviation because the management group

# Calculate (Cohen's D) impact measurement
mean_diff = control_mean - test_mean
pooled_std = np.sqrt((control_std**2 + test_std**2) / 2)
effect_size = abs(mean_diff / pooled_std) # Cohen's d needs to be optimistic

# Run energy evaluation
power_analysis = smp.TTestIndPower()
sample_size = spherical(power_analysis.solve_power(effect_size=effect_size, alpha=alpha, energy=energy))

print(f"Management imply: {spherical(control_mean, 3)}")
print(f"Management std: {spherical(control_std, 3)}")
print(f"Min pattern measurement: {sample_size}")

return sample_size

So let’s check it out with our pre-experiment knowledge!

exp_perc_change = 0.05 # Set the anticipated proportion change within the chosen metric attributable to the therapy

min_sample_size = power_analysis(df_pre["y_value"], exp_perc_change

Person generated picture

We will see that given the distribution of our goal metric, we would wish a pattern measurement of 1,645 to detect a rise of 5%.

Knowledge-generating course of: Experimental knowledge

Reasonably than rush into organising the experiment, you resolve to take the pre-experiment knowledge and simulate the experiment.

The next perform randomly selects customers to be handled and applies a therapy impact. On the finish of the perform we file the imply distinction earlier than and after the therapy was utilized in addition to the true ATE (common therapy impact):

def exp_data_generator(t_perc_change, t_samples):

# Create copy of pre-experiment knowledge prepared to control into experiment knowledge
df_exp = df_pre.reset_index(drop=True)

# Calculate the preliminary therapy impact
treatment_effect = spherical((df_exp["y_value"] * (t_perc_change)).imply(), 2)

# Create therapy column
treated_indices = np.random.selection(df_exp.index, measurement=t_samples, exchange=False)
df_exp["treatment"] = 0
df_exp.loc[treated_indices, "treatment"] = 1

# therapy impact
df_exp["treatment_effect"] = 0
df_exp.loc[df_exp["treatment"] == 1, "treatment_effect"] = treatment_effect

# Apply therapy impact
df_exp["y_value_exp"] = df_exp["y_value"]
df_exp.loc[df_exp["treatment"] == 1, "y_value_exp"] = df_exp["y_value"] + df_exp["treatment_effect"]

# Calculate imply diff earlier than therapy
mean_t0_pre = df_exp[df_exp["treatment"] == 0]["y_value"].imply()
mean_t1_pre = df_exp[df_exp["treatment"] == 1]["y_value"].imply()
mean_diff_pre = spherical(mean_t1_pre - mean_t0_pre)

# Calculate imply diff after therapy
mean_t0_post = df_exp[df_exp["treatment"] == 0]["y_value_exp"].imply()
mean_t1_post = df_exp[df_exp["treatment"] == 1]["y_value_exp"].imply()
mean_diff_post = spherical(mean_t1_post - mean_t0_post)

# Calculate ate
treatment_effect = spherical(df_exp[df_exp["treatment"]==1]["treatment_effect"].imply())

print(f"Diff-in-means earlier than therapy: {mean_diff_pre}")
print(f"Diff-in-means after therapy: {mean_diff_post}")
print(f"ATE: {treatment_effect}")

return df_exp

We will feed by means of the minimal pattern measurement we beforehand calculated:

np.random.seed(123)
df_exp_1 = exp_data_generator(exp_perc_change, min_sample_size)

Let’s begin by inspecting the info we created for handled customers that can assist you perceive what the perform is doing:

Person generated picture

Subsequent let’s check out the outcomes which the perform prints:

Person generated picture

Attention-grabbing, we see that after we choose customers to be handled, however earlier than we deal with them, there may be already a distinction in means. This distinction is because of probability. Which means that once we have a look at the distinction after customers are handled we don’t accurately estimate the ATE (common therapy impact). We are going to come again thus far once we cowl CUPED.

Person generated picture

Subsequent let’s discover a extra subtle method of constructing an inference than simply taking the distinction in means…

Bootstrapping

Bootstrapping is a strong statistical approach that includes resampling knowledge with substitute. These resampled datasets, known as bootstrap samples, assist us estimate the variability of a statistic (just like the imply or median) from our authentic knowledge. That is significantly enticing relating to experimentation because it permits us to calculate confidence intervals. Let’s stroll by means of it step-by-step utilizing a easy instance…

You’ve gotten run an experiment with a management and therapy group every made up of 1k customers.

  1. Create bootstrap samples — Randomly choose (with substitute) 1k customers from the management after which therapy group. This offers us 1 bootstrap pattern for management and one for therapy.
  2. Repeat this course of n occasions (e.g. 10k occasions).
  3. For every pair of bootstrap samples calculate the imply distinction between management and therapy.
  4. We now have a distribution (made up of the imply distinction between 10k bootstrap samples) which we will use to calculate confidence intervals.
Person generated picture

Making use of it to our case research

Let’s use our case research for instance the way it works. Beneath we use the sciPy stats python bundle to assist calculate bootstrap confidence intervals:

from typing import Union
import pandas as pd
import numpy as np
from scipy import stats

def mean_diff(group_a: Union[np.ndarray, pd.Series], group_b: Union[np.ndarray, pd.Series]) -> float:
'''
Calculate the distinction in means between two teams.

Args:
group_a (Union[np.ndarray, pd.Series]): The primary group of knowledge factors.
group_b (Union[np.ndarray, pd.Series]): The second group of knowledge factors.

Returns:
float: The distinction between the imply of group_a and the imply of group_b.
'''
return np.imply(group_a) - np.imply(group_b)

def bootstrapping(df: pd.DataFrame, adjusted_metric: str, n_resamples: int = 10000) -> np.ndarray:
'''
Carry out bootstrap resampling on the adjusted metric of two teams within the dataframe to estimate the imply distinction and confidence intervals.

Args:
df (pd.DataFrame): The dataframe containing the info. Should embody a 'therapy' column indicating group membership.
adjusted_metric (str): The title of the column within the dataframe representing the metric to be resampled.
n_resamples (int, elective): The variety of bootstrap resamples to carry out. Defaults to 1000.

Returns:
np.ndarray: The array of bootstrap resampled imply variations.
'''

# Separate the info into two teams based mostly on the 'therapy' column
group_a = df[df["treatment"] == 1][adjusted_metric]
group_b = df[df["treatment"] == 0][adjusted_metric]

# Carry out bootstrap resampling
res = stats.bootstrap((group_a, group_b), statistic=mean_diff, n_resamples=n_resamples, technique='percentile')
ci = res.confidence_interval

# Extract the bootstrap distribution and confidence intervals
bootstrap_means = res.bootstrap_distribution
bootstrap_ci_lb = spherical(ci.low,)
bootstrap_ci_ub = spherical(ci.excessive)
bootstrap_mean = spherical(np.imply(bootstrap_means))

print(f"Bootstrap confidence interval decrease sure: {bootstrap_ci_lb}")
print(f"Bootstrap confidence interval higher sure: {bootstrap_ci_ub}")
print(f"Bootstrap imply diff: {bootstrap_mean}")

return bootstrap_means

After we run it for our case research knowledge we will see that we now have some confidence intervals:

bootstrap_og_1 = bootstrapping(df_exp_1, "y_value_exp")
Person generated picture

Our floor reality ATE is 143 (the precise therapy impact from our experiment knowledge generator perform), which falls inside our confidence intervals. Nonetheless, it’s value noting that the imply distinction hasn’t modified (it’s nonetheless 93 as earlier than once we merely calculated the imply distinction of management and therapy), and the pre-treatment distinction continues to be there.

So what if we wished to give you narrower confidence intervals? And is there any method we will cope with the pre-treatment variations? This leads us properly into CUPED…

Background

CUPED (managed experiments utilizing pre-experiment knowledge) is a strong approach for enhancing the accuracy of experiments developed by researchers at Microsoft. The unique paper is an insightful learn for anybody keen on experimentation:

https://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTheWebSurvey.pdf

The core concept of CUPED is to make use of knowledge collected earlier than your experiment begins to scale back the variance in your goal metric. By doing so, you can also make your experiment extra delicate, which has two main advantages:

  1. You may detect smaller results with the identical pattern measurement.
  2. You may detect the identical impact with a smaller pattern measurement.

Consider it like eradicating the “background noise” so you’ll be able to see the “sign” extra clearly.

Variance, normal deviation, normal error

If you examine CUPED it’s possible you’ll hear folks discuss it decreasing the variance, normal deviation or normal error. In case you are something like me, you may end up forgetting how these are associated, so earlier than we go any additional let’s recap on this!

  • Variance: Variance measures the typical squared deviation of every knowledge level from the imply, reflecting the general unfold or dispersion inside a dataset.
  • Customary Deviation: Customary deviation is the sq. root of variance, representing the typical distance of every knowledge level from the imply, and offering a extra interpretable measure of unfold.
  • Customary Error: Customary error quantifies the precision of the pattern imply as an estimate of the inhabitants imply, calculated as the usual deviation divided by the sq. root of the pattern measurement.

How does CUPED work?

To know how CUPED works, let’s break it down…

Pre-experiment covariate — Within the lightest implementation of CUPED, the pre-experiment covariate could be the goal metric measured in a time interval earlier than the experiment. So in case your goal metric was gross sales worth, your covariate could possibly be every prospects gross sales worth 4 weeks previous to the experiment.

It’s essential that your covariate is correlated along with your goal metric and that it’s unaffected by the therapy. Because of this we might sometimes use pre-treatment knowledge from the management group.

Regression adjustment — Linear regression is used to mannequin the connection between the covariate (measured earlier than the experiment) and the goal metric (measured throughout the experiment interval). We will then calculate the CUPED adjusted goal metric by eradicating the affect of the covariate:

Person generated picture

It’s value noting that taking away the imply of the covariate is finished to centre the end result variable across the imply to make it interpretable when in comparison with the unique goal metric.

Variance discount — After the regression adjustment the variance in our goal metric has diminished. Decrease variance implies that the variations between the management and therapy group are simpler to detect, thus growing the statistical energy of the experiment.

Making use of it to our case research

Let’s use our case research for instance the way it works. Beneath we code CUPED up in a perform:

from typing import Union
import pandas as pd
import numpy as np
import statsmodels.api as sm

def cuped(df: pd.DataFrame, pre_covariates: Union[str, list], target_metric: str) -> pd.Sequence:
'''
Implements the CUPED (Managed Experiments Utilizing Pre-Experiment Knowledge) approach to regulate the goal metric
by eradicating predictable variation utilizing pre-experiment covariates. This reduces the variance of the metric and
will increase the statistical energy of the experiment.

Args:
df (pd.DataFrame): The enter DataFrame containing each the pre-experiment covariates and the goal metric.
pre_covariates (Union[str, list]): The column title(s) within the DataFrame similar to the pre-experiment covariates used for the adjustment.
target_metric (str): The column title within the DataFrame representing the metric to be adjusted.

Returns:
pd.Sequence: A pandas Sequence containing the CUPED-adjusted goal metric.
'''

# Match management mannequin utilizing pre-experiment covariates
control_group = df[df['treatment'] == 0]
X_control = control_group[pre_covariates]
X_control = sm.add_constant(X_control)
y_control = control_group[target_metric]
model_control = sm.OLS(y_control, X_control).match()

# Compute residuals and modify goal metric
X_all = df[pre_covariates]
X_all = sm.add_constant(X_all)
residuals = df[target_metric].to_numpy().flatten() - model_control.predict(X_all)
adjustment_term = model_control.params['const'] + sum(model_control.params[covariate] * df[pre_covariates].imply()[covariate] for covariate in pre_covariates)
adjusted_target = residuals + adjustment_term

return adjusted_target

After we apply it to our case research knowledge and examine the adjusted goal metric to the unique goal metric, we see that the variance has diminished:

# Apply CUPED
pre_covariates = ["x_recency", "x_frequency", "x_value"]
target_metric = ["y_value_exp"]
df_exp_1["adjusted_target"] = cuped(df_exp_1, pre_covariates, target_metric)

# Plot outcomes
plt.determine(figsize=(10, 6))
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="adjusted_target", hue="therapy", fill=True, palette="Set1", label="Adjusted Worth")
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="y_value_exp", hue="therapy", fill=True, palette="Set2", label="Unique Worth")
plt.title(f"Distribution of Worth by Unique vs CUPED")
plt.xlabel("Worth")
plt.ylabel("Density")
plt.legend(title="Distribution")

Person generated picture

Does it cut back the usual error?

Now now we have utilized CUPED and diminished the variance, lets run our bootstrapping perform to see what influence it has:

bootstrap_cuped_1 = bootstrapping(df_exp_1, "adjusted_target")
Person generated picture

For those who examine this to our earlier end result utilizing the unique goal metric you see that the arrogance intervals are narrower:

bootstrap_1 = pd.DataFrame({
'authentic': bootstrap_og_1,
'cuped': bootstrap_cuped_1
})

# Plot the KDE plots
plt.determine(figsize=(10, 6))
sns.kdeplot(bootstrap_1['original'], fill=True, label='Unique', colour='blue')
sns.kdeplot(bootstrap_1['cuped'], fill=True, label='CUPED', colour='orange')

# Add imply traces
plt.axvline(bootstrap_1['original'].imply(), colour='blue', linestyle='--', linewidth=1)
plt.axvline(bootstrap_1['cuped'].imply(), colour='orange', linestyle='--', linewidth=1)
plt.axvline(spherical(df_exp_1[df_exp_1["treatment"]==1]["treatment_effect"].imply(), 3), colour='inexperienced', linestyle='--', linewidth=1, label='Therapy impact')

# Customise the plot
plt.title('Distribution of Worth by Unique vs CUPED')
plt.xlabel('Worth')
plt.ylabel('Density')
plt.legend()

# Present the plot
plt.present()

Person generated picture

The bootstrap distinction in means additionally strikes nearer to the bottom reality therapy impact. It is because CUPED can be very efficient at coping with pre-existing variations between the management and therapy group.

Does it cut back the minimal pattern measurement?

The subsequent query is does it cut back the minimal pattern measurement we’d like. Nicely lets discover out!

treatment_effect_1 = spherical(df_exp_1[df_exp_1["treatment"]==1]["treatment_effect"].imply(), 2)
cuped_sample_size = power_analysis(df_exp_1[df_exp_1['treatment'] == 0]['adjusted_target'], treatment_effect_1 / df_exp_1[df_exp_1['treatment'] == 0]['adjusted_target'].imply())
Person generated picture

The minimal pattern measurement wanted has diminished from 1,645 to 901. Each Finance and the Knowledge Science workforce are going to be happy as we will run the experiment for a shorter time interval with a smaller management pattern!

Background

Once I first examine CUPED, I considered double machine studying and the similarities. For those who aren’t acquainted with double machine studying, take a look at my article from earlier within the sequence:

Take note of the primary stage end result mannequin in double machine studying:

  • Final result mannequin (de-noising): Machine studying mannequin used to estimate the end result utilizing simply the management options. The result mannequin residuals are then calculated.

That is conceptually similar to what we’re doing with CUPED!

How does it examine to CUPED?

Let’s feed by means of our case research knowledge and see if we get the same end result:

# Prepare DML mannequin
dml = LinearDML(discrete_treatment=False)
dml.match(df_exp_1[target_metric].to_numpy().ravel(), T=df_exp_1['treatment'].to_numpy().ravel(), X=df_exp_1[pre_covariates], W=None)
ate_dml = spherical(dml.ate(df_exp_1[pre_covariates]))
ate_dml_lb = spherical(dml.ate_interval(df_exp_1[pre_covariates])[0])
ate_dml_ub = spherical(dml.ate_interval(df_exp_1[pre_covariates])[1])

print(f'DML confidence interval decrease sure: {ate_dml_lb}')
print(f'DML confidence interval higher sure: {ate_dml_ub}')
print(f'DML ate: {ate_dml}')

Person generated picture

We get an virtually equivalent end result!

After we plot the residuals we will see that the variance is diminished like in CUPED (though we don’t add the imply to scale for interpretation):

# Match mannequin end result mannequin utilizing pre-experiment covariates
X_all = df_exp_1[pre_covariates]
X_all = sm.add_constant(X)
y_all = df_exp_1[target_metric]
outcome_model = sm.OLS(y_all, X_all).match()

# Compute residuals and modify goal metric
df_exp_1['outcome_residuals'] = df_exp_1[target_metric].to_numpy().flatten() - outcome_model.predict(X_all)

# Plot outcomes
plt.determine(figsize=(10, 6))
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="outcome_residuals", hue="therapy", fill=True, palette="Set1", label="Adjusted Goal")
sns.kdeplot(knowledge=df_exp_1[df_exp_1['treatment'] == 0], x="y_value_exp", hue="therapy", fill=True, palette="Set2", label="Unique Worth")
plt.title(f"Distribution of Worth by Unique vs DML")
plt.xlabel("Worth")
plt.ylabel("Density")
plt.legend(title="Distribution")

plt.present()

Person generated picture

“So what?” I hear you ask!

Firstly, I believe it’s an fascinating commentary for anybody utilizing double machine studying — The primary stage end result mannequin assist cut back the variance and subsequently we must always get comparable advantages to CUPED.

Secondly, it raises the query when is every technique acceptable? Let’s shut issues off by masking off this query…

There are a number of the explanation why it might make sense to have a tendency in the direction of CUPED:

  • It’s simpler to grasp.
  • It’s less complicated to implement.
  • It’s one mannequin slightly than three, that means you’ve gotten much less challenges with overfitting.

Nonetheless, there are a few exceptions the place double machine studying outperforms CUPED:

  • Biased therapy task — When the therapy task is biased, for instance if you find yourself utilizing observational knowledge, double machine studying can cope with this. My article from earlier within the sequence builds on this:
  • Heterogenous therapy results — If you need to perceive results at a person stage, for instance discovering out who it’s value sending reductions to, double machine studying will help with this. There’s a good case research which illustrates this in my earlier article on optimising therapy methods:

Immediately we did a whistle cease tour of experimentation, masking speculation testing, energy evaluation and bootstrapping. We then explored how CUPED can cut back the usual error and enhance the facility of our experiments. Lastly, we touched on it’s similarities to double machine studying and mentioned when every technique needs to be used. There are a number of extra key factors that are value mentioning in phrases CUPED:

  • We don’t have to make use of linear regression — If now we have a number of covariates, perhaps some with non-linear relationships, we might use a machine studying approach like boosting.
  • If we do go down the route of utilizing a machine studying approach, we’d like to ensure to not overfit the info.
  • Some cautious thought ought to go into when to run CUPED — Are you going to run it earlier than you begin your experiment after which run an influence evaluation to find out your diminished pattern measurement? Or are you simply going to run it after your experiment to scale back the usual error?