I’d like to put out the foundational components wanted to totally grasp how choice bias impacts our linear mannequin estimation course of. We’ve got a dependent random variable, Y, which we assume has a linear relationship (topic to some error phrases) with one other variable, X, generally known as the impartial variable.
Identification Assumptions
Given a subsample Y’, X’ of the inhabitants variables Y, X –
- The error phrases (of the unique mannequin !!!) and X’ are usually not correlated.
- The imply of the error phrases is zero.
- Y and X are actually associated in a linear method —
It’s vital to notice that in empirical analysis, we observe X and Y (or a subsample of them), however we don’t observe the error phrases, making assumption (1) not possible to check or validate straight. At this level, we often depend on a theoretical clarification to justify this assumption. A typical justification is thru randomized managed trials (RCTs), the place the subsample, X, is collected completely at random, guaranteeing that it’s uncorrelated with some other variable, significantly with the error phrases.
Conditional Expectation
Given the assumptions talked about earlier, we will exactly decide the type of the conditional expectation of Y given X —
The final transition follows from the identification assumptions. It’s vital to notice that it is a operate of x, which means it represents the typical of all noticed values of y provided that x is the same as a selected worth (Or the native common of y’s given a small vary of values of x’s — extra info might be discovered right here)
OLS
Given a pattern of X that meets the identification assumptions, it’s well-established that the unusual least squares (OLS) methodology supplies a closed-form answer for constant and unbiased estimators of the linear mannequin parameters, alpha and beta, and thus for the conditional expectation operate of Y given X.
At its core, OLS is a method for becoming a linear line (or linear hyperplane within the case of a multivariate pattern) to a set of (y_i, x_i) pairs. What’s significantly fascinating about OLS is that —
- If Y and X have a linear relationship (accounting for traditional error phrases), we’ve seen that the conditional expectation of Y given X is completely linear. On this situation, OLS successfully uncovers this operate with robust statistical accuracy.
- OLS achieves this even with any subsample of X that meets the identification assumptions beforehand mentioned — with massive sufficient pattern.
Let’s start with a simple instance utilizing simulated information. We’ll simulate the linear mannequin from above.
A major benefit of working with simulated information is that it permits us to higher perceive relationships between variables that aren’t observable in real-world eventualities, such because the error phrases within the mannequin.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as smN = 5000
BETA = 0.7
INTERCEPT = -2
SIGMA = 5
df = pd.DataFrame({
"x": np.random.uniform(0, 50, N),
"e": np.random.regular(0, SIGMA, N)
})
df["y"] = INTERCEPT + BETA * df["x"] + df["e"]
and operating OLS for the total pattern –
r1 = sm.OLS(df["y"], sm.add_constant(df["x"])).match()
plt.scatter(df["x"], df["y"], label="inhabitants")
plt.plot(df["x"], r1.fittedvalues, label="OLS Inhabitants", coloration="r")
Now, let’s generate a random subsample of our inhabitants, X, and apply OLS to this subsample. I’ll randomly choose 100 x’s from the five hundred samples I beforehand generated, after which run OLS on this subset.
sample1 = df.pattern(100)
r2 = sm.OLS(sample1["y"], sm.add_constant(sample1["x"])).match()plt.scatter(sample1["x"], sample1["y"], label="pattern")
plt.plot(sample1["x"], r2.fittedvalues, label="OLS Random pattern")
and plot –
It seems we acquire constant estimators for the random subsample, as each OLS outcomes produce fairly related conditional expectation strains. Moreover, you may observe the correlation between X and the error phrases —
print(f"corr {np.corrcoef(df['x'], df['e'])}")
print(f"E(e|x) {np.imply(df['e'])}")# corr [[ 1. -0.02164744]
# [-0.02164744 1. ]]
# E(e|x) 0.004016713100777963
This means that the identification assumptions are being met. In observe, nevertheless, we can not straight calculate these for the reason that errors are usually not observable. Now, let’s create a brand new subsample — I’ll choose all (y, x) pairs the place y ≤ 10:
sample2 = df[df["y"] <= 10]
r3 = sm.OLS(sample2["y"], sm.add_constant(sample2["x"])).match()plt.scatter(sample2["x"], sample2["y"], label="Chosen pattern")
plt.plot(pattern["x"], r3.fittedvalues, label="OLS Chosen Pattern")
and we get –
Now, OLS has offered us with a totally totally different line. Let’s examine the correlation between the subsample X’s and the errors.
print(f"corr {np.corrcoef(df['x'], df['e'])}")
print(f"E(e|x) {np.imply(df['e'])}")# corr [[ 1. -0.48634973]
# [-0.48634973 1. ]]
# E(e|x) -2.0289245650303616
Looks as if the identification assumptions are violated. Let’s additionally plot the sub-sample error phrases, as a operate of X –
As you may see, as X will increase, there are fewer massive errors, indicating a transparent correlation that leads to biased and inconsistent OLS estimators.
Let’s discover this additional.
So, what’s happening right here?
I’ll reference the mannequin launched by James Heckman, who, together with Daniel McFadden, acquired the Nobel Memorial Prize in Financial Sciences in 2000. Heckman is famend for his pioneering work in econometrics and microeconomics, significantly for his contributions to addressing choice bias and self-selection in quantitative evaluation. His well-known Heckman correction will probably be mentioned later on this context.
In his paper from 1979, “Pattern Choice Bias as a Specification Error,” Heckman illustrates how choice bias arises from censoring the dependent variable — a selected case of choice that may be prolonged to extra non-random pattern choice processes.
Censoring the dependent variable is precisely what we did when creating the final subsample within the earlier part. Let’s study Heckman’s framework.
We begin with a full pattern (or inhabitants) of (y_i, x_i) pairs. On this situation, given x_i, ε_i can fluctuate — it may be constructive, unfavorable, small, or massive, relying solely on the error distribution. We check with this entire pattern of the dependent variable as y*. We then outline y because the censored dependent variable, which incorporates solely the values we really observe.
Now, let’s calculate the conditional expectation of the censored variable, y:
As you may see, this operate resembles the one we noticed earlier, nevertheless it contains a further time period that differs from earlier than. This final time period can’t be ignored, which implies the conditional expectation operate is not purely linear when it comes to x (with some noise). Consequently, operating OLS on the uncensored values will produce biased estimators for alpha and beta.
Furthermore, this equation illustrates how the choice bias drawback might be considered as an omitted variable drawback. Because the final time period is dependent upon X, it shares a major quantity of variance with the dependent variable.
Inverse Mills ratio
Heckman’s correction methodology relies on the next precept: Given a random variable Z that follows a traditional distribution with imply μ and customary deviation σ, the next equations apply:
Given any fixed α, Φ (capital phi) represents the usual regular distribution’s CDF, and ɸ denotes the usual regular distribution’s PDF. These values are generally known as the inverse Mills ratio.
So, how does this assist us? Let’s revisit the final time period of the earlier conditional expectation equation —
Mixed with the truth that our error phrases observe a traditional distribution, we will use the inverse Mills ratio to characterize their habits.
Again to our mannequin
The benefit of the inverse Mills ratio is that it transforms the earlier conditional expectation operate into the next type —
This leads to a linear operate with a further covariate — the inverse Mills ratio. Due to this fact, to estimate the mannequin parameters, we will apply OLS to this revised components.
Let’s first calculate the inverse Mills ratio –
and in code:
from scipy.stats import normpattern["z"] = (CENSOR-INTERCEPT-BETA*pattern["x"])/SIGMA
pattern["mills"] = -SIGMA*(norm.pdf(pattern["z"])/(norm.cdf(pattern["z"])))
and run OLS —
correcred_ols = sm.OLS(pattern["y"], sm.add_constant(pattern[["x", "mills"]])).match(cov_type="HC1")
print(correcred_ols.abstract())
And the output —
OLS Regression Outcomes
==============================================================================
Dep. Variable: y R-squared: 0.313
Mannequin: OLS Adj. R-squared: 0.313
Technique: Least Squares F-statistic: 443.7
Date: Mon, 12 Aug 2024 Prob (F-statistic): 3.49e-156
Time: 16:47:01 Log-Chance: -4840.1
No. Observations: 1727 AIC: 9686.
Df Residuals: 1724 BIC: 9702.
Df Mannequin: 2
Covariance Sort: HC1
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.8966 0.268 -7.088 0.000 -2.421 -1.372
x 0.7113 0.047 14.982 0.000 0.618 0.804
mills 1.0679 0.130 8.185 0.000 0.812 1.324
==============================================================================
Omnibus: 96.991 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 115.676
Skew: -0.571 Prob(JB): 7.61e-26
Kurtosis: 3.550 Cond. No. 34.7
==============================================================================
In Actuality
α and β are the unobserved parameters of the mannequin that we purpose to estimate, so in observe, we can not straight calculate the inverse Mills ratio as we did beforehand. Heckman introduces a preliminary step in his correction methodology to help in estimating the inverse Mills ratio. For this reason the Heckman’s correction is called a two stage estimator.
To recap, our challenge is that we don’t observe all of the values of the dependent variable. As an example, if we’re analyzing how training (Z) influences wage (Y), however solely observe wages above a sure threshold, we have to develop a theoretical clarification for the training ranges of people with wages beneath this threshold. As soon as now we have that, we will estimate a probit mannequin of the next type:
and use the estimated parameters of this probit mannequin to calculate an estimator for the inverse Mills ratio. In our case (discover I don’t use α and β) —
from statsmodels.discrete.discrete_model import Probitpbit = Probit(df["y"] <= CENSOR, sm.add_constant(df["x"])).match()
pattern["z_pbit"] = pattern["z"] = (pbit.params.const + pattern["x"]*pbit.params.x)
pattern["mills_pbit"] = -SIGMA*(norm.pdf(pattern["z_pbit"])/(norm.cdf(pattern["z_pbit"])))
correcred_ols = sm.OLS(pattern["y"], sm.add_constant(pattern[["x", "mills_pbit"]])).match(cov_type="HC1")
and once more, OLS for the second stage provides us constant estimators —
OLS Regression Outcomes
...
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.8854 0.267 -7.068 0.000 -2.408 -1.363
x 0.7230 0.049 14.767 0.000 0.627 0.819
mills_pbit 1.1005 0.135 8.165 0.000 0.836 1.365
==============================================================================
We used simulated information to reveal a pattern choice bias drawback ensuing from censoring dependent variable values. We explored how this challenge pertains to OLS causal identification assumptions by analyzing the simulated errors of our mannequin and the biased subsample. Lastly, we launched Heckman’s methodology for correcting the bias, permitting us to acquire constant and unbiased estimators even when working with a biased pattern.