Omitted Variable Bias. An intro to an particularly sneaky bias… | by Sachin Date | Aug, 2024

Rothstein, A., photographer. (1939) Farm family at dinner. Fairfield Bench Farms, Montana. Montana Fairfield Bench Farms United States Teton County, 1939. May. [Photograph] Retrieved from the Library of Congress, https://www.loc.gov/item/2017777606/.
Rothstein, A., photographer. (1939) Farm household at dinner. Fairfield Bench Farms, Montana. Montana Fairfield Bench Farms United States Teton County, 1939. Could. [Photograph] Retrieved from the Library of Congress, https://www.loc.gov/merchandise/2017777606/.

An intro to an particularly sneaky bias that invades many regression fashions

From 2000 to 2013, a flood of analysis confirmed a hanging correlation between the speed of dangerous habits amongst adolescents, and the way usually they ate meals with their household.

Research after research appeared to succeed in the identical conclusion:

The better the variety of meals per week that adolescents had with their household, the decrease their odds of indulging in substance abuse, violence, delinquency, vandalism, and plenty of different drawback behaviors.

The next frequency of household meals additionally correlated with lowered stress, lowered incidence of childhood melancholy, and lowered frequency of suicidal ideas. Consuming collectively correlated with elevated vanity, and a typically elevated emotional well-being amongst adolescents.

Quickly, the media acquired wind of those outcomes, and so they have been packaged and distributed as easy-to-consume sound bites, comparable to this one:

“Research present that the extra usually households eat collectively, the much less probably youngsters are to smoke, drink, do medication, get depressed, develop consuming issues and take into account suicide, and the extra probably they’re to do effectively in class, delay having intercourse, eat their greens, study massive phrases and know which fork to make use of.” — TIME Journal, “The magic of the household meal”, June 4, 2006

One of many largest research on the subject was carried out in 2012 by the Nationwide Heart on Habit and Substance Abuse (CASA) at Columbia College. CASA surveyed 1003 American youngsters aged 12 to 17 about varied facets of their lives.

CASA found the identical, and in some circumstances, startlingly clear correlations between the variety of meals adolescents had with their household and a broad vary of behavioral and emotional parameters.

There was no escaping the conclusion.

Household meals make well-adjusted teenagers.

Till you learn what’s actually the final sentence in CASA’s 2012 white paper:

“As a result of it is a cross-sectional survey, the info can’t be used to ascertain causality or measure the course of the relationships which can be noticed between pairs of variables within the White Paper.”

And so right here we come to some salient factors.

Frequency of household meals is probably not the one driver of the discount in dangerous behaviors amongst adolescents. It could not even be the first driver.

Households who eat collectively extra regularly might achieve this just because they already share a snug relationship and have good communication with each other.

Consuming collectively might even be the impact of a wholesome, well-functioning household.

And youngsters from such households might merely be much less more likely to bask in dangerous behaviors and extra more likely to get pleasure from higher psychological well being.

A number of different elements are additionally at play. Components comparable to demography, the kid’s character, and the presence of the precise function fashions at dwelling, faculty, or elsewhere would possibly make youngsters much less prone to dangerous behaviors and poor psychological well being.

Clearly, the reality, as is commonly the case, is murky and multivariate.

Though, make no mistake, ‘Eat collectively’ is just not dangerous recommendation, as recommendation goes. The difficulty with it’s the following:

A lot of the research on this matter, together with the CASA research, in addition to a very thorough meta-analysis revealed by Goldfarb et al in 2013 of 14 different research, did in actual fact fastidiously measure and tease out the partial results of precisely all of those elements on adolescent dangerous habits.

So what did the researchers discover?

They discovered that the partial impact of the frequency of household meals on the noticed charge of dangerous behaviors in adolescents was significantly diluted when different elements comparable to demography, character, and nature of relationship with the household have been included within the regression fashions. The researchers additionally discovered that in some circumstances, the partial impact of frequency of household meals, utterly disappeared.

Right here, for instance, is a discovering from Goldfarb et al (2013) (FFM=Frequency of Household Meals):

“The associations between FFM and the end result in query have been most definitely to be statistically vital with unadjusted fashions or univariate analyses. Associations have been much less more likely to be vital in fashions that managed for demographic and household traits or household/parental connectedness. When strategies like propensity rating matching have been used, no vital associations have been discovered between FFM and alcohol or tobacco use. When strategies to regulate for time-invariant particular person traits have been used, the associations have been vital about half the time for substance use, 5 of 16 instances for violence/delinquency, and two of two instances for melancholy/suicide ideation.”

Wait, however what does all this must do with bias?

The relevance to bias comes from two sadly co-existing properties of the frequency of household meals variable:

  1. On one hand, most research on the subject discovered that the frequency of household meals does have an intrinsic partial impact on the susceptibility to dangerous habits. However, the impact is weak if you consider different variables.
  2. On the similar time, the frequency of household meals can also be closely correlated with a number of different variables, comparable to the character of inter-personal relationships with different members of the family, the character of communication throughout the household, the presence of function fashions, the character of the kid, and demographics comparable to family revenue. All of those variables, it was discovered, have a robust joint correlation with the speed of indulgence in dangerous behaviors.

The way in which the mathematics works is that if you happen to unwittingly omit even a single considered one of these different variables out of your regression mannequin, the coefficient of the frequency of household meals will get biased within the unfavorable course. Within the subsequent two sections, I’ll present precisely why that occurs.

This unfavorable bias on the coefficient of frequency of household meals will make it seem that merely rising the variety of instances households sit collectively to eat must, by itself, significantly cut back the incidence of — oh, say — alcohol abuse amongst adolescents.

The above phenomenon is named Omitted Variable Bias. It’s one of the crucial regularly occurring, and simply missed, biases in regression research. If not noticed and accounted for, it could result in unlucky real-world penalties.

For instance, any social coverage that disproportionately stresses the necessity for rising the variety of instances households eat collectively as a serious means to cut back childhood substance abuse will inevitably miss its design aim.

Now, you would possibly ask, isn’t a lot of this drawback brought on by choosing explanatory variables that correlate with one another so strongly? Isn’t it simply an instance of a sloppily carried out variable-selection train? Why not choose variables which can be correlated solely with the response variable?

In spite of everything, shouldn’t a talented statistician be capable to make use of their ample coaching and creativeness to establish a set of things which have not more than a passing correlation with each other and which can be more likely to be robust determinants of the response variable?

Sadly, in any real-world setting, discovering a set of explanatory variables which can be solely barely (or by no means) correlated is the stuff of goals, if even that.

However to paraphrase G. B. Shaw, in case your creativeness is stuffed with ‘fairy princesses and noble natures and fearless cavalry expenses’, you would possibly simply come throughout a whole set of completely orthogonal explanatory variables, as statisticians wish to so evocatively name them. However once more, I’ll guess you the Brooklyn Bridge that even in your sweetest statistical dreamscapes, you’ll not discover them. You usually tend to stumble into the non-conforming Loukas and the reality-embracing Captain Bluntschlis as a substitute of greeting the quixotic Rainas and the Main Saranoffs.

An idealized depiction of family life by Norman Rockwell, “Freedom from Want”, Published: March 6, 1943
An idealized depiction of household life by Norman Rockwell. “Freedom from Need”, Printed: March 6, 1943, (Public area paintings)

And so, we should study to dwell in a world the place explanatory variables freely correlate with each other, whereas on the similar time influencing the response of the mannequin to various levels.

In our world, omitting considered one of these variable s— both by chance, or by the harmless ignorance of its existence, or by the dearth of means to measure it, or via sheer carelessness — causes the mannequin to be biased. We’d as effectively develop a greater appreciation of this bias.

In the remainder of this text, I’ll discover Omitted Variable Bias in nice element. Particularly, I’ll cowl the next:

  • Definition and properties of omitted variable bias.
  • Components for estimating the omitted variable bias.
  • An evaluation of the omitted variable bias in a mannequin of adolescent dangerous habits.
  • A demo and calculation of omitted variable bias in a regression mannequin skilled on a real-world dataset.

From a statistical perspective, omitted variable bias is outlined as follows:

When an necessary explanatory variable is omitted from a regression mannequin and the truncated mannequin is fitted on a dataset, the anticipated values of the estimated coefficients of the non-omitted variables within the fitted mannequin shift away from their true inhabitants values. This shift is named omitted variable bias.

Even when a single necessary variable is omitted, the anticipated values of the coefficients of all the non-omitted explanatory variables within the mannequin change into biased. No variable is spared from the bias.

Magnitude of the bias

In linear fashions, the magnitude of the bias relies on the next three portions:

  1. Covariance of the non-omitted variable with the omitted variable: The bias on a non-omitted variable’s estimated coefficient is straight proportional to the covariance of the non-omitted variable with the omitted variable, conditioned upon the remainder of the variables within the mannequin. In different phrases, the extra tightly correlated the omitted variable is with the variables which can be left behind, the heavier the worth you pay for omitting it.
  2. Coefficient of the omitted variable: The bias on a non-omitted variable’s estimated coefficient is straight proportional to the inhabitants worth of the coefficient of the omitted variable within the full mannequin. The better the affect of the omitted variable on the mannequin’s response, the larger the outlet you dig for your self by omitting it.
  3. Variance of the non-omitted variable: The bias on a non-omitted variable’s estimated coefficient is inversely proportional to the variance of the non-omitted variable, conditioned upon the remainder of the variables within the mannequin. The extra scattered the non-omitted variable’s values are round its imply, the much less affected it’s by the bias. That is yet one more place during which the well-known impact of bias-variance tradeoff makes its presence felt.

Path of the bias

Most often, the course of omitted variable bias on the estimated coefficient of a non-omitted variable, is sadly exhausting to evaluate. Whether or not the bias will enhance or attenuate the estimate is tough to inform with out truly figuring out the omitted variable’s coefficient within the full mannequin, and understanding the conditional covariance and conditional variance of non-omitted variable.

On this part, I’ll current the method for Omitted Variable Bias that’s relevant to coefficients of solely linear fashions. However the common ideas and ideas of how the bias works, and the elements it relies on carry over easily to varied different kinds of fashions.

Think about the next linear mannequin which regresses y on x_1 via x_m and a relentless:

A linear model that regresses y on x_1 through x_m and a constant
A linear mannequin that regresses y on x_1 via x_m and a relentless (Picture by Creator)

On this mannequin, γ_1 via γ_m are the inhabitants values of the coefficients of x_1 via x_m respectively, and γ_0 is the intercept (a.okay.a. the regression fixed). ϵ is the regression error. It captures the variance in y that x_1 via x_m and γ_0 are collectively unable to elucidate.

As a aspect be aware, y, x_1 via x_m, 1, and ϵ are all column vectors of measurement n x 1, which means they every include n rows and 1 column, with ‘n’ being the variety of samples within the dataset on which the mannequin operates.

Lest you get able to take flight and flee, let me guarantee you that past mentioning the above truth, I can’t go any additional into matrix algebra on this article. However you must let me say the next: if it helps, I discover it helpful to think about an n x 1 column vector as a vertical cupboard with (n — 1) inner cabinets and a quantity sitting on every shelf.

Anyway.

Now, let’s omit the variable x_m from this mannequin. After omitting x_m, the truncated mannequin seems like this:

A truncated linear model that regresses y on x_1 through x_(m-1) and a constant
A truncated linear mannequin that regresses y on x_1 via x_(m-1) and a relentless (Picture by Creator)

Within the above truncated mannequin, I’ve changed all of the gammas with betas to remind us that after dropping x_m, the coefficients of the truncated mannequin might be decidedly completely different than within the full mannequin.

The query is, how completely different are the betas from the gammas? Let’s discover out.

For those who match (prepare) the truncated mannequin on the coaching knowledge, you’ll get a fitted mannequin. Let’s signify the fitted mannequin as follows:

The fitted truncated model
The fitted truncated mannequin (Picture by Creator)

Within the fitted mannequin, the β_0_cap via β_(m — 1)_cap are the fitted (estimated) values of the coefficients β_0 via β_(m — 1). ‘e’ is the residual error, which captures the variance within the noticed values of y that the fitted mannequin is unable to elucidate.

The speculation says that the omission of x_m has biased the anticipated worth of each single coefficient from β_0_cap via β_(m — 1)_cap away from their true inhabitants values γ_1 via γ_(m — 1).

Let’s study the bias on the estimated coefficient β_k_cap of the kth regression variable, x_k.

The quantity by which the anticipated worth of β_k_cap within the truncated fitted mannequin is biased is given by the next equation:

The expected value of the coefficient β_k_cap in the truncated fitted model is biased by an amount equal to the scaled ratio of the conditional covariance of x_k and x_m, and the conditional variance of x_k
The anticipated worth of the coefficient β_k_cap within the truncated fitted mannequin is biased by an quantity equal to the scaled ratio of the conditional covariance of x_k and x_m, and the conditional variance of x_k (Picture by Creator)

Let’s be aware the entire following issues in regards to the above equation:

  • β_k_cap is the estimated coefficient of the non-omitted variable x_k within the truncated mannequin. You get this estimate of β_k from becoming the truncated mannequin on the info.
  • E( β_k_cap | x_1 via x_m) is the anticipated worth of the above talked about estimate, conditioned on all of the noticed values of x_1 via x_m. Be aware that x_m is definitely not noticed. We’ve omitted it, bear in mind? Anyway, the expectation operator E() has the next which means: if you happen to prepare the truncated mannequin on 1000’s of randomly drawn datasets, you’ll get 1000’s of various estimates of β_k_cap. E(β_k_cap) is the imply of all these estimates.
  • γ_k is the true inhabitants worth of the coefficient of x_k within the full mannequin.
  • γ_m is the true inhabitants worth of the coefficient of the variable x_m that was omitted from the total mannequin.
  • The covariance time period within the above equation represents the covariance of x_k with x_m, conditioned on the remainder of the variables within the full mannequin.
  • Equally, the variance time period represents the variance of x_k conditioned on all the opposite variables within the full mannequin.

The above equation tells us the next:

  • At first, had x_m not been omitted, the anticipated worth of β_k_cap within the fitted truncated mannequin would have been γ_k. It is a property of all linear fashions fitted utilizing the OLS method: the anticipated worth of every estimated coefficient within the fitted mannequin is the unbiased inhabitants worth of the respective coefficient.
  • Nonetheless, as a result of lacking x_m within the truncated mannequin, the anticipated worth β_k_cap has change into biased away from its inhabitants worth, γ_k.
  • The quantity of bias is the ratio of, the conditional covariance of x_k with x_m, and the conditional variance of x_k, scaled by γ_m.

The above method for the omitted variable bias ought to provide you with a primary glimpse of the appalling carnage wreaked in your regression mannequin, must you unwittingly omit even a single explanatory variable that occurs to be not solely extremely influential but in addition closely correlated with a number of non-omitted variables within the mannequin.

As we’ll see within the following part, that’s, regrettably, simply what occurs in a selected sort of flawed mannequin for estimating the speed of dangerous behaviour in adolescents.

Let’s apply the method for the omitted variable bias to a mannequin that tries to elucidate the speed of dangerous habits in adolescents. We’ll study a situation during which one of many regression variables is omitted.

However first, we’ll take a look at the total (non-omitted) model of the mannequin. Particularly, let’s take into account a linear mannequin during which the speed of dangerous habits is regressed on the suitably quantified variations of the next 4 elements:

  1. frequency of household meals
  2. how well-informed a baby thinks their dad and mom are about what’s happening of their life,
  3. the standard of the connection between mother or father and youngster, and
  4. the kid’s intrinsic character.

For simplicity, we’ll use the variables x_1, x_2, x_3 and x_4 to signify the above 4 regression variables.

Let y signify the response variable, specifically, the speed of dangerous behaviors.

The linear mannequin is as follows:

A linear model of y regressed on x_1, x_2, x_3 and x_4 and a constant
A linear mannequin of y regressed on x_1, x_2, x_3 and x_4 and a relentless (Picture by Creator)

We’ll research the biasing impact of omitting x_2(=how well-informed a baby thinks their dad and mom are about what’s happening of their life) on the coefficient of x_1(=frequency of household meals).

If x_2 is omitted from the above linear mannequin, and the truncated mannequin is fitted, the fitted mannequin seems like this:

The truncated and fitted model
The truncated and fitted mannequin (Picture by Creator)

Within the fitted mannequin, β_1_cap is the estimated coefficient of the frequency of household meals. Thus, β_1_cap quantifies the partial impact of frequency of household meals on the speed of dangerous habits in adolescents.

Utilizing the method for the omitted variable bias, we are able to state the anticipated worth of the partial impact of x_1 as follows:

Formula for the expected value of β_1_cap
Components for the anticipated worth of β_1_cap (Picture by Creator)

Research have proven that frequency of household meals (x_1) occurs to be closely correlated with how well-informed a baby thinks their dad and mom are about what’s happening of their life (x_2). Now take a look at the covariance within the numerator of the bias time period. Since x_1 is extremely correlated with x_2, the massive covariance makes the numerator massive.

If that weren’t sufficient, the identical research have proven that x_2 (=how well-informed a baby thinks their dad and mom are about what’s happening of their life) is itself closely correlated (inversely) with the speed of dangerous habits that the kid indulges in (y). Subsequently, we’d count on the coefficient γ_2 within the full mannequin to be massive and unfavorable.

The big covariance and the massive unfavorable γ_2 be part of forces to make the bias time period massive and unfavorable. It’s straightforward to see how such a big unfavorable bias will drive down the anticipated worth of β_1_cap deep into unfavorable territory.

It’s this massive unfavorable bias that may make it look like the frequency of household meals has an outsized partial impact on explaining the speed of dangerous habits in adolescents.

All of this bias happens by the inadvertent omission of a single extremely influential variable.

Till now, I’ve relied on equations and formulae to supply a descriptive demonstration of how omitting an necessary variable biases a regression mannequin.

On this part, I’ll present you the bias in motion on actual world knowledge.

For illustration, I’ll use the next dataset of vehicles revealed by UC Irvine.

The automobiles dataset (License: CC BY 4.0)
The vehicles dataset (License: CC BY 4.0) (Picture by Creator)

Every row within the dataset accommodates 26 completely different options of a singular car. The traits embrace make, variety of doorways, engine options comparable to gas sort, variety of cylinders, and engine aspiration, bodily dimensions of the car comparable to size, breath, top, and wheel base, and the car’s gas effectivity on metropolis and freeway roads.

There are 205 distinctive automobiles on this dataset.

Our aim is to construct a linear mannequin for estimating the gas effectivity of a car within the metropolis.

Out of the 26 variables lined by the info, solely two variables — curb weight and horsepower — occur to be probably the most potent determiners of gas effectivity. Why these two specifically? As a result of, out of the 25 potential regression variables within the dataset, solely curb weight and horsepower have statistically vital partial correlations with gas effectivity. In case you are curious how I went in regards to the strategy of figuring out these variables, check out my article on the partial correlation coefficient.

A linear mannequin of gas effectivity (within the metropolis) regressed on curb weight and horsepower is as follows:

A linear model of fuel efficiency
A linear mannequin of gas effectivity (Picture by Creator)

Discover that the above mannequin has no intercept. That’s so as a result of when both of curb weight and horsepower is zero, the opposite one needs to be zero. And you’ll agree that will probably be fairly uncommon to return throughout a car with zero weight and horsepower however in some way sporting a optimistic mileage.

So subsequent, we’ll filter out the rows within the dataset containing lacking knowledge. And from the remaining knowledge, we’ll carve out two randomly chosen datasets for coaching and testing the mannequin in a 80:20 ratio. After doing this, the coaching knowledge occurs to include 127 automobiles.

For those who have been to coach the mannequin in equation (1) on the coaching knowledge utilizing Unusual Least Squares, you’ll get the estimates γ_1_cap and γ_2_cap for the coefficients γ_1 and γ_2.

On the finish of this text, you’ll discover the hyperlink to the Python code for doing this coaching plus all different code used on this article.

In the meantime, following is the equation of the skilled mannequin:

The fitted linear model of automobile fuel efficiency
The fitted linear mannequin of vehicle gas effectivity (Picture by Creator)

Now suppose you have been to omit the variable horsepower from the mannequin. The truncated mannequin seems like this:

The truncated linear model of automobile fuel efficiency
The truncated linear mannequin of vehicle gas effectivity (Picture by Creator)

For those who have been to coach the mannequin in equation (3) on the coaching knowledge utilizing OLS, you’ll get the next estimate for β_1:

The fitted model
The fitted mannequin (Picture by Creator)

Thus, β_1_cap is 0.01. That is completely different than the 0.0193 within the full mannequin.

Due to the omitted variable, the anticipated worth of β_1_cap has gotten biased as follows:

Formula for the expected value of β_1_cap
Components for the anticipated worth of β_1_cap (Picture by Creator)

As talked about earlier, in a non-biased linear mannequin fitted utilizing OLS, the anticipated worth of β_1_cap would be the inhabitants worth of β_1_cap which is γ_1. Thus, in a non-biased mannequin:

E(β_1_cap) = γ_1

However the omission of horsepower has biased this expectation as proven in equation (5).

To calculate the bias, you want to know three portions:

  1. γ_2: That is the inhabitants worth of the coefficient of horsepower within the full mannequin proven in equation (1).
  2. Covariance(curb_weight, horsepower): That is the inhabitants worth of the covariance.
  3. Variance(curb_weight): That is the inhabitants worth of the variance.

Sadly, not one of the three values are computable as a result of the general inhabitants of all automobiles is inaccessible to you. All you’ve is a pattern of 127 automobiles.

In follow although, you may estimate this bias by substituting pattern values for the inhabitants values.

Thus, rather than γ_2, you should utilize γ_2_cap= — 0.2398 from equation (2).

Equally, utilizing the coaching knowledge of 127 automobiles as the info pattern, you may calculate the pattern covariance of curb_weight and horsepower, and the pattern variance of curb_weight.

The pattern covariance comes out to be 11392.85. The pattern variance of curb_weight comes out to be 232638.78.

With these values, the bias time period in equation (5) might be estimated as follows:

Estimated bias on E(β_1_cap)
Estimated bias on E(β_1_cap) (Picture by Creator)

Getting a really feel for the impression of the omitted variable bias

To get a way of how robust this bias is, let’s return to the fitted full mannequin:

The fitted linear model of automobile fuel efficiency
The fitted linear mannequin of vehicle gas effectivity (Picture by Creator)

Within the above mannequin, γ_1_cap = 0.0193. Our calculation reveals that the bias on the estimated worth of γ_1 is 0.01174 within the unfavorable course. The magnitude of this bias (0.01174) is 0.01174/0.0193*100 = 60.93 , in different phrases an alarming 60.83% of the estimated worth of γ_1.

There isn’t any light technique to say this: Omitting the extremely influential variable horsepower has wreaked havoc in your easy linear regression mannequin.

Omitting horsepower has precipitously attenuated the anticipated worth of the estimated coefficient of the non-omitted variable curb_weight. Utilizing equation (5), it is possible for you to to approximate the attenuated worth of this coefficient as follows:

E(β_1_cap | curb_weight, horsepower)
= γ_1_cap + bias = 0.0193—0.01174 = 0.00756

Bear in mind as soon as once more that you’re working with estimates as a substitute of the particular values of γ_1 and bias.

Nonetheless, the estimated attenuated worth of γ_1_cap (0.00756) matches intently with the estimate of 0.01 returned by becoming the truncated mannequin of city_mpg (equation 4) on the coaching knowledge. I’ve reproduced it under.

The fitted model
The fitted truncated mannequin (Picture by Creator)

Listed here are the hyperlinks to the Python code and the info used for constructing and coaching the total and the truncated fashions and for calculating the Omitted Variable Bias on E(β_1_cap).

Hyperlink to the auto dataset.

By the best way, every time you run the code, it should pull a randomly chosen set of coaching knowledge from the general autos dataset. Coaching the total and truncated fashions on this coaching knowledge will result in barely completely different estimated coefficient values. Subsequently, every time you run the code, the bias on E(β_1_cap) may even be barely completely different. In reality, this illustrates quite properly why the estimated coefficients are themselves random variables and why they’ve their very own estimated values.

Let’s summarize what we realized.

  • Omitted variable bias is precipitated when a number of necessary variables are omitted from a regression mannequin.
  • The bias impacts the anticipated values of the estimated coefficients of all non-omitted variables. The bias causes the anticipated values to change into both greater or smaller from their true inhabitants values.
  • Omitted variable bias will make the non-omitted variables look both extra necessary or much less necessary than what they really are when it comes to their affect on the response variable of the regression mannequin.
  • The magnitude of the bias on every non-omitted variable is straight proportional to how correlated is the non-omitted variable with the omitted variable(s), and in addition how influential is/are the omitted variables on the the response variable of the mannequin. The bias is inversely proportional to how dispersed is the non-omitted variable.
  • In most real-world circumstances, the course of the bias is tough to evaluate with out computing it.