Omitted Variable Bias: A Comprehensive Econometrics Review

Introduction

Suppose you want to find out what factors determine the price of homes in your area. What could you set up to monitor all the variables? You decide to run a multiple regression to estimate the price of houses. For this, you thought of all the factors you want to include in your regression. You included variables like number of rooms in the house, the number of bathrooms, whether the house is furnished or not, and how old the house is. However, you forgot to include a very important variable – the size of the house in square feet. Your regression is likely to give you biased results. Think it over, and the reason is simple! Two houses with exactly similar values of the variables you have taken can have drastically different prices if the size of the house (or say the size of the room) is different. In missing this important variable, your regression suffers from Omitted Variable Bias.

The problem of omitted variables occurs due to misspecification of a linear regression model, which may be because either the effect of the omitted variable on the dependent variable is unknown or because the data is not available. This forces you to omit that variable from your regression, which results in over-estimating (upward bias) or under-estimating (downward) the effect of one of more other explanatory variables.

Two conditions must hold for omitted variable bias to exist.

a) The omitted variable must be correlated with the dependent variable.

b) The omitted variable must be correlated with one or more other explanatory/ independent variables.

In the example above, the size of the house in square feet is correlated with the price of the house as well as the number of rooms. Hence, omitting the size of house variable results in omitted variable bias.

Understanding Omitted Variable Bias Through Venn Diagrams

Let the dependent variable be Y and the independent variables (the factors that affect Y) be A and B. You may think of Y as your scores in the exam, A as your level of presence and attentiveness during class lectures, and B as a number of hours you study. Here, both A and B are important factors that impact Y.

Omitted Variable Bias Grades in Exam Diagram

Area 1 is the impact of variable A on Y. Area 3 is the impact of variable B on Y. Suppose you included A in the regression and omitted variable B. By doing so, you are now estimating the impact of variable A on Y by areas 1 and 2 and not just area 1.

Omitted Variable Bias Grades in Exam Diagram

What can you say about the estimate of the coefficient of variable A in the regression?

a) It is biased since area 2 actually belongs to both variables A and B.

b) Also, since the coefficient of variable A is estimated by both areas A and B, its variance is reduced.

c) Moreover, the unexplained variance of Y (the dependent variable) increases because you have omitted an important variable.

Explaining the Bias

In the above example, suppose you want to see the effect of A (level of presence/attentiveness in class) on Y (scores in the exam).

The population regression equation is:

Y={ \beta }_{ 0 }+{ \beta }_{ 1 }A+{ \beta }_{ 2 }B

You omitted variable B (hours you study) and instead estimated:

\hat { Y } =\hat { { \beta }_{ 0 } } +\hat { { \beta }_{ 1 } } A

What sign do you expect for \hat { { \beta }_{ 1 } }?

You would typically expect that if you are present and more attentive in class, your scores in the exam would be higher. Hence, \hat { { \beta }_{ 1 } } >0

Pause and think!

You also expect that the greater number of hours you study, your scores will be higher on the exam. Also, you expect that those who are present in class or are more attentive, in general, are sincere students and would spend a higher number of hours studying.

So,

a) Y and B are positively correlated

b) A and B are positively correlated.

Think intuitively!

A higher level of presence/attentiveness improves scores in the exam. But, if you are more attentive, you tend to study more hours. Thus, presence/attentiveness (variable A) may actually be accounting for the effects of studying more hours and not only presence/attentiveness.

Thus \hat { { \beta }_{ 1 } }, suffers from an upward bias.

Thus, the true { \beta }_{ 1 }<\hat { { \beta }_{ 1 } }. This shows that if \hat { { \beta }_{ 1 } } >0 then it is not necessarily true that { \beta }_{ 1 } >0.

Another Example to Explain the Bias

Look at another example to understand the concept of omitted variable bias better.

You are interested to find the factors that affect crime.

Assume that Y is the crime rate, W is the education level, and X is the amount of drugs used/consumed.

The population regression equation is:

Y={ \beta }_{ 0 }+{ \beta }_{ 1 }W+{ \beta }_{ 2 }X

You omitted variable X (amount of drugs) and instead estimated:

\hat { Y } =\hat { { \beta }_{ 0 } } +\hat { { \beta }_{ 1 } } W

What sign do you expect for \hat { { \beta }_{ 1 } }?

You would typically expect that if you are more educated then education would teach you about the ills of crime and hence your likelihood of committing a crime would be less.

Therefore, \hat { { \beta }_{ 1 } } <0

You also expect that the more the drugs you consume, your propensity to commit a crime would be higher. Also, you expect that higher the education you receive, the fewer drugs you would consume because education creates awareness.

So,

a) Y and X are positively correlated

b) W and X are negatively correlated.

Think intuitively!

A higher level of education reduces crime rates or the likelihood of committing a crime. But, if you have reached a higher level of education, you are less likely to consume drugs. Thus, the level of education (variable W) may actually be accounting for the propensity of drug consumption and not only the effects of education.

Thus \hat { { \beta }_{ 1 } }, suffers from a downward bias.

Thus, the true { \beta }_{ 1 }>\hat { { \beta }_{ 1 } }. This shows that if \hat { { \beta }_{ 1 } } <0 then it is not necessarily true that { \beta }_{ 1 }<0.

Go back to the example of estimating prices of houses in your area discussed in the Introduction. Let Y be the prices of houses. Let the price of houses be determined by only two variables in the population – the number of rooms and size of the house. You accidentally omitted the variable size of the house. Can you work out the omitted variable bias intuitively as discussed in the previous examples?

The direction of the omitted variable bias can be summarized in the table below:

Let Y be the dependent variable, A and B the independent variables, and B the omitted variable.

A and B are positively correlated A and B are negatively correlated
B has a positive effect on Y Positive bias Negative bias
B has a negative effect on Y Negative bias Positive bias

Effects of Omitted Variable Bias on Ordinary Least Squares (OLS) Estimation

In studying linear regression models, you would often come across a set of assumptions under the Gauss-Markov theorem. This theorem states that if your regression model fulfills all the assumptions of CLRM (classical linear regression model), then your estimates would be BLUE – best, linear, and unbiased estimators. While estimating ordinary least squares (OLS), one of the important assumptions of CLRM is that the error term must be uncorrelated with your explanatory/independent variables.

What happens to this assumption when you omit an important variable? The omitted variable goes into the error term in the regression equation. For omitted variable bias to exist, you know that one of the conditions is that it is correlated with at least one other explanatory variable. So, clearly, your error term and independent variables are not uncorrelated. Violation of this assumption of CLRM causes the OLS estimator to be biased and inconsistent. While looking at the discussion using Venn diagrams, you must have noted that omitting an important variable causes the unexplained variance of Y (the dependent variable) to increase and also the variance of the estimated coefficient decreases.

How Serious is the Problem and What Can Be Done?

The problem of omitted variable bias is quite serious because if your estimates are biased and inconsistent, they are not reliable. You have also seen what effect it can have on the signs of the coefficients. They also become unreliable. Hence your model fails.

To deal with this problem, if data is available, you can try to include as many variables as you can in the regression. Of course, this will have two possible consequences. First, you need to have a sufficient number of data points or else you won’t be able to estimate the regression equation. Second, the problems of including unnecessary variables will start to arise.

If you think that a variable is important and you can’t omit it because it will cause omitted variable bias, but at the same time you do not have data for it, you can look for proxies or find instrument variables for the omitted variables. For example, in the discussion of scores in exam example, the omitted variable was the number of hours you study. Suppose you don’t have this data, then the amount of hours one is seen in the playground can be taken as a proxy for the amount of hour one studies. However, using proxies and instrumental variables comes with its whole set of assumptions and problems, which are quite complicated and not easily met.

Conclusion

To conclude, omitted variable bias is a serious problem. When you are doing any research or working with estimating linear regression models, you should pay close attention to them. In particular, you should ask yourself, what are the possible variables that could impact the dependent variable but are not included in the model? Are those variables likely to be correlated with the independent variables that you have taken in the model? What is going to be the sign of such correlations – positive or negative? What bias — upward or downward — can the estimates suffer from? You should then ask what is the magnitude of the bias. Is it strong enough to impact your regression totally? In the long run, once you use models, you will become clearer on what variables are important and relevant. You saw that increasing bias decreases variance. Sometime you may want to weight the two and make a trade-off. If you want to decrease variance, the trade-off is to increase bias, and if you want to decrease bias, the trade-off is increased variance. When, in the short run, you are not sure of important factors, always remember that no model is correct. It is all about making the model such that it is useful as well as making the suitable decisions, the suitable assumptions, and suitable trade-offs.

Having read the article above, can you think of a real life example of a regression equation, list the factors that may impact the dependent variable and comment on the consequences of any omitted variable bias that the equation may suffer from?

Let’s put everything into practice. Try this Econometrics practice question:

Omitted Variable Bias Direction 1-Econometrics Practice Question

Looking for more Econometrics practice?

Check out our other articles on Econometrics.

You can also find thousands of practice questions on Albert.io. Albert.io lets you customize your learning experience to target practice where you need the most help. We’ll give you challenging practice questions to help you achieve mastery of Econometrics.

Start practicing here.

Are you a teacher or administrator interested in boosting Econometrics student outcomes?

Learn more about our school licenses here.

Article written by The Albert.io Team

Learn anything through interactive practice with Albert.io. Thousands of practice questions in college math and science, Advanced Placement, SAT, ACT, GRE, GMAT, literature, social science, history, and more.