How to break regression

(Photo illustration by Pew Research Center and iStock.com/Eshma)

Regression models are a cornerstone of modern social science. They’re at the heart of efforts to estimate causal relationships between variables in a multivariate environment and are the basic building blocks of many machine learning models. Yet social scientists can run into a lot of situations where regression models break.

Famed social psychologist Richard Nisbett recently argued that regression analysis is so misused and misunderstood that analyses based on multiple regression “are often somewhere between meaningless and quite damaging.” (He was mainly talking about cases in which researchers publish correlational results that are covered in the media as causal statements about the world.)

Below, I’ll walk through some of the potential pitfalls you might encounter when you fire up your favorite statistical software package and run regressions. Specifically, I’ll be using simulation in R as an educational tool to help you better understand the ways in which regressions can break.

Using simulations to unpack regression

The idea of using R simulations to help understand regression models was inspired by Ben Ogorek’s post on regression confounders and collider bias.

The great thing about using simulation in this way is that you control the world that generates your data. The code I’ll introduce below represents the true data-generating process, since I’m using R’s random number generators to simulate the data. In real life, of course, we only have the data we observe, and we don’t really know how the data-generating process works unless we have a solid theory (like Newtonian physics or evolution) where the system of relevant variables and causal relationships is well understood and to which there is really no analogous phenomenon in social science.

What I’ll do here is create a dataset based on two random standard normal variables by simulating them using the rnorm() function, which draws random values from a normal distribution with mean 0 and standard deviation 1, unless you specify otherwise. I’ll create a functional relationship between y and x such that a 1 unit increase in x will be associated with a .4 unit increase in y.

1 # make the code reproducible by setting a random number seed
2 set.seed(100)
3
4 # When everything works:
5 N <- 1000
6 x <- rnorm(N)
7 y <- .4 * x + rnorm(N)
8 hist(x)
9 hist(y)
10
11 # Now estimate our model:
12 summary(lm(y ~ x))
13
14 Call:
15 lm(formula = y ~ x)
16 Residuals:
17    Min      1Q  Median      3Q     Max 
18 -3.0348 -0.7013  0.0085  0.6212  3.1688 
19 Coefficients:
20            		Estimate Std. Error t value Pr(>|t|)    
21 (Intercept) 	0.003921   0.031039   0.126    0.899    
22 x           	0.413415   0.030129  13.722   <2e-16 ***
23 ---
24 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
25 Residual standard error: 0.9814 on 998 degrees of freedom
26 Multiple R-squared:  0.1587,	Adjusted R-squared:  0.1579 
27 F-statistic: 188.3 on 1 and 998 DF,  p-value: < 2.2e-16
28
29 # Plot it
30 library(ggplot2)
31 qplot(x, y) +
32  geom_smooth(method='lm') +
33  theme_bw() +
34  ggtitle("The Perfect Regression")

Notice that the model estimates the functional relationship between x and y that I simulated quite well. The plot looks like this:

What about omitted variables? Our machinery actually still works if there is another factor causing y, as long as it is uncorrelated with x.

The dreaded omitted variable bias

Omitted variable bias (OVB) is much feared, and judging by the top internet search results, not well understood. Some top sources say it occurs when “an important” variable is missing or when a variable that “is correlated” with both x and y is missing. I even found a university econometrics course that defined OVB this way.

But neither of those definitions are quite right. OVB occurs when a variable that causes y is missing from the model (and is correlated with x). Let’s call that variable w. Because w is in play when we consider the causal relationship between x and y, it’s often referred to as “endogenous” or a “confounding variable.”

The example below first demonstrates that w, our confounding variable, will bias our results if we fail to include it in our model. The next two examples are essentially a re-telling of the post I mentioned above on collider bias, but emphasizing slightly different points.

1 w <- rnorm(N)
2 x <- .5 * w + rnorm(N)
3 y <- .4 * x + .3 * w + rnorm(N)
4
5 m1 <- lm(y ~ x)
6 summary (m1) # Omitted variable bias
7
8 Call:
9 lm(formula = y ~ x)
10
11 Residuals:
12    Min      1Q  Median      3Q     Max 
13 -3.2190 -0.7025  0.0314  0.7120  3.1158 
14
15 Coefficients:
16            Estimate Std. Error t value Pr(>|t|)    
17 (Intercept)  0.01126    0.03310    0.34    0.734    
18 x            0.50179    0.03049   16.46   <2e-16 ***
19 ---
20 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21
22 Residual standard error: 1.046 on 998 degrees of freedom
23 Multiple R-squared:  0.2135,	Adjusted R-squared:  0.2127 
24 F-statistic: 270.9 on 1 and 998 DF,  p-value: < 2.2e-16

There it is: classic omitted variable bias. We only observed x, and the influence of the omitted variable w was attributed to x in our model. If you re-rerun the regression with w in the model, you no longer get biased estimates.

1 m2 <- lm(y ~ x + w)
2 summary (m2) # No omitted variable bias after conditioning on w
3
4 Call:
5 lm(formula = y ~ x + w)
6
7 Residuals:
8    Min      1Q  Median      3Q     Max 
9 -3.2748 -0.6632 -0.0001  0.6933  2.9664 
10
11 Coefficients:
12            Estimate Std. Error t value Pr(>|t|)    
13 (Intercept)  0.02841    0.03141   0.905    0.366    
14 x            0.40627    0.03132  12.973   <2e-16 ***
15 w            0.32344    0.03439   9.405   <2e-16 ***
16 ---
17 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
18
19 Residual standard error: 0.9927 on 997 degrees of freedom
20 Multiple R-squared:  0.3024,	Adjusted R-squared:  0.301 
21 F-statistic: 216.1 on 2 and 997 DF,  p-value: < 2.2e-16
22

Note that the regression errors, also known as residuals, are correlated with w:

1 cor(w,m1$residuals)
2 [1] 0.2597859

Now, recall above that I wrote that it’s wrong to say that OVB occurs when our omitted variable is correlated with both x and y. And yet w, x and w and y are all correlated in this first example:

1 cormatrix <- cor(as.matrix(data.frame(x,y,w)))
2 round(matrix, 2)
3
4     x    y    w
5 x 1.00 0.49 0.41
6 y 0.49 1.00 0.43
7 w 0.41 0.43 1.00

So why can’t we just say that OVB occurs when our omitted variable is correlated with both x and y? As the next example will show, correlation isn’t enough — w needs to cause both x and y. We can easily imagine a case in which we don’t have causality but we still see this kind of correlation — when x and y both cause w.

Let’s make this a little more concrete. Suppose we care about the effect of news media consumption (x) on voter turnout (y). One factor that some researchers think may cause both news media consumption and turnout is political interest (w). If we only measure media consumption and voter turnout, political interest is likely to confound our estimates.

But another school of thought from social psychology — along the lines of self-perception theory and cognitive dissonance — suggests that the causality could be reversed: Voting behavior might be mostly determined by other factors, and casting a ballot might prompt us to be more interested in political developments in the future. Similarly, watching the news might prompt us to become moreinterested in politics. Let’s suppose that second school of thought is right. If so, our simulated data will look like this:

1 media_consumption_x <- rnorm(N)
2 voter_turnout_y <- .1 * media_consumption_x + rnorm(N)
3
4 # Political interest increases after consuming media and participating, and, 
5 # in this hypothetical world, does *not* increase media consuption or participation
6 political_interest_w <- 1.2 * media_consumption_x + .6 * voter_turnout_y + rnorm(N)
7
8 cormat <- cor(as.matrix(data.frame(media_consumption_x, voter_turnout_y, political_interest_w)))
9 round(cormat, 2)
10
11                     media_consumption_x voter_turnout_y 
12 political_interest_w
media_consumption_x                 1.00            0.11                 0.70
13 voter_turnout_y                     0.11            1.00                 0.46
14 political_interest_w                0.70            0.46                 1.00

As you can see, all factors are again correlated with each other. But this time, if we only include x (media consumption) and y (turnout) in the equation, we get the correct estimate:

1 summary(lm(voter_turnout_y ~ media_consumption_x))
2
3 Call:
4 lm(formula = voter_turnout_y ~ media_consumption_x)
5
6 Residuals:
7    Min      1Q  Median      3Q     Max 
8 -2.8460 -0.6972 -0.0076  0.6702  3.3925 
9
10 Coefficients:
11                    Estimate Std. Error t value Pr(>|t|)    
12 (Intercept)         -0.01202    0.03217  -0.374 0.708839    
13 media_consumption_x  0.11719    0.03321   3.529 0.000436 ***
14 ---
15 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
16
17 Residual standard error: 1.014 on 998 degrees of freedom
18 Multiple R-squared:  0.01233,	Adjusted R-squared:  0.01134 
19 F-statistic: 12.46 on 1 and 998 DF,  p-value: 0.0004359

What makes defining omitted variable bias based on correlation so dangerous is that if we now include w (political interest), we will get a different kind of bias — what’s called collider bias or endogenous selection bias.

1 summary(lm(voter_turnout_y ~ media_consumption_x +  political_interest_w))
2
3 Call:
4 lm(formula = voter_turnout_y ~ media_consumption_x + political_interest_w)
5
6 Residuals:
7    Min      1Q  Median      3Q     Max 
8 -2.1569 -0.5981 -0.0129  0.5701  2.8356 
9
10 Coefficients:
11                      Estimate Std. Error t value Pr(>|t|)    
12 (Intercept)           0.003155   0.027098   0.116    0.907    
13 media_consumption_x  -0.437084   0.039102 -11.178   <2e-16 ***
14 political_interest_w  0.444571   0.021928  20.274   <2e-16 ***
15 ---
16 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
17 
18 Residual standard error: 0.854 on 997 degrees of freedom
19 Multiple R-squared:  0.3007,	Adjusted R-squared:  0.2993 
20 F-statistic: 214.3 on 2 and 997 DF,  p-value: < 2.2e-16

Simpson’s paradox

Simpson’s paradox often occurs in social science (and medicine, too) when you pool data instead of conditioning it on group membership (i.e., adding it as a factor in your regression model).

Suppose that, all other things being equal, consuming media causes a slight shift in policy preferences toward the left. But, on average, Republicans consume more news than non-Republicans. And we know that generally Republicans have much more right-leaning preferences.

If we just measure media consumption and policy preferences without including Republicans in the model, we’ll actually estimate that the effect goes in the direction opposite of the true causal effect.

1 N <- 1000
2
3 # Let's say that 40% of people in this population are Republicans
4 republican <- rbinom(N, 1, .4)
5
6 # And they consume more media
7 media_consumption <- .75 * republican + rnorm(N)
8
9 # Consuming more media causes a slight leftward shift in policy
10 # preferences, and Republicans have more right-leaning preferences
11 policy_prefs <- -.2 * media_consumption + 2 * republican + rnorm(N)
12
13 # for easier plotting later
14 df <- data.frame(media_consumption, policy_prefs, republican)
15 df$republican = factor(c("non-republican", "republican")[df$republican + 1])
16
17 # If we don't condition on being Republican, we'll actually estimate
18 # that the effect goes in the *opposite* direction
19 summary(lm(policy_prefs ~ media_consumption))
20
21
22 Call:
23 lm(formula = policy_prefs ~ media_consumption)
24
25 Residuals:
26    Min      1Q  Median      3Q     Max 
27 -3.6108 -0.9559 -0.0198  0.9257  3.9537 
28
29 Coefficients:
30                  Estimate Std. Error t value Pr(>|t|)    
31 (Intercept)        0.68923    0.04323   15.94  < 2e-16 ***
32 media_consumption  0.15269    0.03966    3.85 0.000126 ***
33 ---
34 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
35
36 Residual standard error: 1.317 on 998 degrees of freedom
37 Multiple R-squared:  0.01463,	Adjusted R-squared:  0.01365 
38 F-statistic: 14.82 on 1 and 998 DF,  p-value: 0.0001257
39
40 # Naive plot
41 qplot(media_consumption, policy_prefs) +
42   geom_smooth(method='lm') +
43   theme_bw() +
44   ggtitle("Naive estimate (Simpson's Paradox)")

The estimate goes in the opposite direction of the true effect! Here’s what the plot looks like:

To resolve this paradox, we need to add a factor in the model that indicates whether or not a respondent is a Republican. Adding that factor lets us estimate separate slopes for Republicans and non-Republicans. Note that this is not like estimating an interaction term, where two explanatory variables are multiplied together. It’s not that the slopes are different, we just need to estimate separate ones for Republicans and non-Republicans.

1 # Condition on being a Republican to get the right estimates
2 summary(lm(policy_prefs ~ media_consumption + republican))
3
4 Call:
5 lm(formula = policy_prefs ~ media_consumption + republican)
6
7 Residuals:
8    Min      1Q  Median      3Q     Max 
9 -3.5518 -0.6678 -0.0186  0.6562  3.3009 
10
11 Coefficients:
12                  Estimate Std. Error t value Pr(>|t|)    
13 (Intercept)        0.05335    0.03904   1.366    0.172    
14 media_consumption -0.13615    0.03111  -4.376 1.34e-05 ***
15 republican         1.93049    0.06758  28.565  < 2e-16 ***
16 ---
17 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 
18
19 Residual standard error: 0.9774 on 997 degrees of freedom
20 Multiple R-squared:  0.4581,	Adjusted R-squared:  0.457 
21 F-statistic: 421.4 on 2 and 997 DF,  p-value: < 2.2e-16
22 
23 # Conditioning on being Republican
24 qplot(media_consumption, policy_prefs, data=df, colour = republican) +
25   scale_color_manual(values = c("blue","red")) +
26   geom_smooth(method='lm') +
27  theme_bw() +
28  ggtitle("Conditioning on being a Republican (Simpson's Paradox)")

Here’s what the plot looks like:

Correlated errors

Another cardinal sin — and one that we should worry a lot about because it often arises from social desirability bias in survey responses — is the phenomenon of correlated errors. This example is inspired by Vavreck (2007).

Here, self-reported turnout and media consumption are caused by a combination of social desirability bias and true turnout and true consumption, respectively:

1 N <- 1000
2
3 # The "Truth"
4 true_media_consumption <- rnorm(N)
5 true_vote <- .1 * media_consumption + rnorm(N)
6
7 # social desirability bias
8 social_desirability <- rnorm(N)
9 #what we actually observe from self reports:
10 self_report_media_consumption <- true_media_consumption + social_desirability
11 self_report_vote <- true_vote + social_desirability

Let’s compare the estimated effect sizes of the self-reported data and the “true” data:

1 # Self reports
2 summary(lm(self_report_vote ~ self_report_media_consumption))
3
4 Call:
5 lm(formula = self_report_vote ~ self_report_media_consumption)
6
7 Residuals:
8    Min      1Q  Median      3Q     Max 
9 -3.9604 -0.7766  0.0142  0.8465  4.1811 
10
11 Coefficients:
12                              Estimate Std. Error t value 
   Pr(>|t|)    
13 (Intercept)                    0.02020    0.03951   0.511    0.609    
14 self_report_media_consumption  0.54605    0.02716  20.102   <2e-16 ***
15 ---
16 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
17
18 Residual standard error: 1.248 on 998 degrees of freedom
19 Multiple R-squared:  0.2882,	Adjusted R-squared:  0.2875 
20 F-statistic: 404.1 on 1 and 998 DF,  p-value: < 2.2e-16
21
22 # "Truth"
23 summary(lm(true_vote ~ true_media_consumption))
24
25 Call:
26 lm(formula = true_vote ~ true_media_consumption)
27
28 Residuals:
29     Min      1Q  Median      3Q     Max 
30 -3.5814 -0.6677 -0.0077  0.6829  3.4799 
31
32 Coefficients:
33                       Estimate Std. Error t value Pr(>|t|)
34 (Intercept)             0.01372    0.03217   0.426    0.670
35 true_media_consumption  0.01313    0.03245   0.404    0.686
36
37 Residual standard error: 1.017 on 998 degrees of freedom
38 Multiple R-squared:  0.0001639,	Adjusted R-squared:  -0.000838 
39 F-statistic: 0.1636 on 1 and 998 DF,  p-value: 0.686

The self-reported data is biased toward over-estimating the effect size, a very dangerous problem. How could we fix this? Well, one way is to actually measure social desirability and include it in the model:

1 summary(lm(self_report_vote ~ self_report_media_consumption + social_desirability))
2
3 Call:
4 lm(formula = self_report_vote ~ self_report_media_consumption + 
5    social_desirability)
6
7 Residuals:
8     Min      1Q  Median      3Q     Max 
9 -3.6042 -0.6774 -0.0127  0.6899  3.4470 
10
11 Coefficients:
12                               Estimate Std. Error t value Pr(>|t|)    
13 (Intercept)                    0.01208    0.03220   0.375    0.708    
14 self_report_media_consumption  0.01220    0.03246   0.376    0.707    
15 social_desirability            1.02245    0.04547  22.487   <2e-16 ***
16 ---
17 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
18
19 Residual standard error: 1.017 on 997 degrees of freedom
20 Multiple R-squared:  0.5277,	Adjusted R-squared:  0.5268 
21 F-statistic:   557 on 2 and 997 DF,  p-value: < 2.2e-16

Note that this while most people think about social desirability as being a problem related to measurement error, it is essentially the same problem as omitted variable bias, as described above.

It’s important to remember that omitted variable bias and correlated errors are just two potential problems with regression analysis. Regression models are also not immune to issues associated with low levels of statistical power, the failure to account for the influence of extreme values, and heteroskedasticity, among others. But by simulating the data-generating process, researchers can get a good sense of some of the more common ways in which statistical models might depart from reality.

How to break regression

Using simulations to unpack regression

The dreaded omitted variable bias

Simpson’s paradox

Correlated errors

More from Decoded

Charting Congress on Social Media in the 2016 and 2020 Elections

Confronting 2016 and 2020 Polling Limitations

Q&A: After misses in 2016 and 2020, does polling need to be fixed again? What our survey experts say

A voter data resource: Detailed demographic tables about verified voters in 2016, 2018

A majority of Americans continue to favor replacing Electoral College with a nationwide popular vote