Search
Duplicate
Try Notion
1.3 Inference with Linear Regression
To fix the concepts I've introduced in the previous sections we're going to take a long, hard look at how linear regression might be used to estimate the effect of a treatment in an interventional study (either observational or randomized). First we'll motivate the definition of a treatment effect as an estimand. We'll then describe the usual maximum-likelihood version of linear regression as an estimator, which behaves well under the assumptions of the linear model. We'll then take this estimator outside of its "natural habitat" and show how it can be badly biased in generic observational studies. Lastly, we will show that this estimator is actually asymptotically unbiased in randomized trials, despite being motivated by a generally incorrect parametric model!
The point of this long exercise is to give you an idea of the danger of using "simple" methods without thinking critically about the relevant statistical model and estimand. It should also motivate our search for estimators that behave predictably and optimally across larger classes of problems.
Interventional Data and Average Treatment Effect
Throughout this section (and throughout most of this book) the estimand we'll be interested in is the (statistical) average treatment effect (ATE). Given the joint distribution of variables Y,A,XY, A, X with A{0,1}A \in \{0,1\} the average treatment effect is defined as ψ=E[E[YX,A=1]E[YX,A=0]]\psi = E[E[Y|X,A=1] - E[Y|X,A=0]]. Note that the outer expectation here is over the full marginal distribution of XX, not XAX|A.
We usually call YY the outcome, AA the treatment, and XX the covariates. You can imagine, for example, a situation where XiX_i represents a person's age, sex, and family history of disease prior to intervention, AiA_i indicates whether that person received a new experimental treatment or a placebo treatment, and YiY_i indicates their condition after the treatment. Throughout this book I'll refer to this kind of data structure (where we have covariates, treatment, and an outcome) as interventional data.
The conditional expectation function μa(x)=E[YX=x,A=a]\mu_a(x)=E[Y|X=x,A=a] can thus be interpreted as the average outcome among those people with covariates X=xX=x who were treated with A=aA=a. The difference μ1(x)μ0(x)\mu_1(x) - \mu_0(x) describes how the much average outcome increases for people with X=xX=x if they are given the treatment A=1A=1 instead of A=0A=0. The ATE is just the average of this difference across the entire population (i.e. the average over the distribution of XX).
One useful way to think of the conditional distribution P(YX=x,A=a)P(Y|X=x,A=a) is as a set of distributions of YY indexed by (x,a)(x,a). The conditional mean μa(x)=E[YX=x,A=a]\mu_a(x) = E[Y|X=x,A=a] is what you get when you connect together the means of each of these with a line.
The average of the difference of the two conditional means is just their average difference weighted by the density of X=xX=x.
You can see why this quantity is of immense interest. If we can assure ourselves that the treatment can be changed independently of any other unobserved variables that would also affect the outcome, then the ATE tells us exactly how much better off we'd be prescribing the new treatment relative to the placebo in this population. It directly answers the question: "how well does this treatment work?" in terms of the population and outcome in question.
Strictly speaking, what we're talking about here is called the statistical ATE. That distinguishes it from something called the causal ATE which will be part of our investigations in the next chapter. We'll see that the two are the same as long as we can satisfy certain assumptions.
Remember that, as with all estimands, the ATE is a function of the joint distribution of the data. You can see that if you write things more explicitly as ψ(P)=EPX[EPYA,X[YX,A=1]EPYA,X[YX,A=0]]\psi(P) = E_{P_X}[ E_{P_{Y|A,X}}[Y|X,A=1] - E_{P_{Y|A,X}}[Y|X,A=0] ]. Different distribution, possibly different ATE.
A Linear Regression Estimator
Now we've defined our general data structure and the estimand- all without having collected any real data at all. But let's say now we go and collect nn observations of (Yi,Ai,Xi)(Y_i, A_i, X_i). What do we do with that data in order to get an estimate of the ATE?
One very popular approach (but often unjustified, as we will see) is to use a linear regression estimator. The linear regression estimator is defined as the solution
γ^,β^,α^=arg minγ,β,αin(YiAiγXiβα)2\hat\gamma, \hat\beta, \hat\alpha = \argmin_{\gamma, \beta, \alpha} \sum_i^n (Y_i - A_i\gamma - X_i^\top\beta - \alpha)^2
Where we'll take ψ^=γ^\hat\psi = \hat\gamma to be our estimate of the true, unknown ATE ψ\psi. If we abbreviate Z=[1,X,A]Z=[1,X^\top,A]^\top and θ=[α,β,γ]\theta = [\alpha, \beta^\top, \gamma]^\top we can abbreviate this as θ^=arg minθ(YZθ)2\hat\theta = \argmin_\theta \sum (Y-Z^\top\theta)^2. Taking a derivative with respect to θ\theta and setting equal to zero gives the well-known analytical solution θ^=[(ZiZi)]1ZiYi\hat\theta = [\sum(Z_iZ_i^\top)]^{-1}\sum Z_iY_i. The law of large numbers and limit arithmetic show that θ^pθ=E[ZZ]1E[ZY]\hat\theta \overset{p}{\to} \theta^* = E[ZZ^\top]^{-1}E[ZY] so as the number of samples we collect gets big, our estimates converge to this point θ\theta^* that depends on what the data-generating distribution is, whatever it may be.
Unless you're already familiar with linear regression, this may look completely arbitrary. So why is this a common approach? Well, let's imagine that our statistical model consists only of parametric distributions that have the following form:
XUXAUAY=Aγ+Xβ+α+N(0,σ2)\begin{align*} X &\sim U_X \\ A &\sim U_A \\ Y &= A\gamma + X\beta + \alpha + \mathcal N(0,\sigma^2) \end{align*}
We'll call this statistical model Mlinnorm\mathcal M_{lin-norm} (linear-normal). In other words, we assume that (A,X)(A,X) has some unspecified joint distribution, but that YY is linearly related to AA and XX plus some independent, normally distributed "error". Assuming linearity or gaussian errors is generally ridiculous in the real world, but let's go with it for now so we can motivate our estimator.
Since we have a parametric model, each distribution P(Y,A,X)MlinnormP(Y,A,X) \in \mathcal M_{lin-norm} can be represented by the parameters (γ,β,α,σ)(\gamma, \beta, \alpha, \sigma) of P(YA,X)P(Y|A,X) in addition to the nonparametrically unspecified P(A,X)P(A,X). We can therefore construct the (log-)likelihood function of the parameters and maximize it to find the parameters representing the conditional distribution most likely to have generated the data (leaving UX,UAU_X, U_A nonparametrically unspecified). When you work out the algebra, the optimization problem comes out looking exactly like what we have above. Computationally, it can be solved analytically or by gradient descent, but that's not our concern here. This is where the linear regression estimator comes from. Moreover, the theory of maximum likelihood estimation ensures that γ^\hat\gamma has a sampling distribution that itself converges to a normal centered at the true value γ\gamma.
However, this argument only justifies why γ^\hat\gamma is a reasonable estimator of the parameter γ\gamma that exists in Mlinnorm\mathcal M_{lin-norm}, but not why it is a reasonable estimator of the ATE ψ\psi! The coefficient γ\gamma is just some number that describes the distributions within the linear-normal model. Without further justification, it has no inherent meaning.
Within the linear-normal model, however, it turns out that γ\gamma is in fact the same as the ATE. The proof is straightforward: taking the conditional expectation of YY shows μa(x)=E[YA=a,X=x]=aγ+xβ+α\mu_a(x) = E[Y|A=a,X=x] = a\gamma + x\beta + \alpha, so μ1(X)μ0(X)=γ\mu_1(X) - \mu_0(X) = \gamma. Taking the expectation over XX changes nothing. Thus, for the linear-normal model, we have that ψ=γ\psi = \gamma.
That justifies linear regression when we know that the true probability distribution is in Mlinnorm\mathcal M_{lin-norm}. However, in reality, that is never the case. If PMlinnormP\notin \mathcal M_{lin-norm}, then γ\gamma doesn't even exist! The estimator γ^\hat\gamma can still be computed using arbitrary interventional data, but what does it mean? What can it mean? These are precisely the questions we will now turn to.
Linear Regression in an Observational Study
Let's keep the linear regression estimator, but change our statistical model to represent a generic observational study. In an observational study we follow a group of subjects who are assigned to their treatments by an unknown process and then we observe the outcomes later on. Based on that information alone, we can make no assumptions at all about the structure of the data-generating distribution:
XUXAUAY=μA(X)+UY\begin{align*} X &\sim U_X \\ A &\sim U_A \\ Y &= \mu_A(X) + U_Y \end{align*}
The variables UX,UA,UYU_X,U_A, U_Y could be anything with the right domains (e.g. UA{0,1}U_A \in \{0,1\}) and could have any arbitrary joint distribution. The only exception is that here we've written YY as the sum of its conditional mean μA(X)=E[YA,X]\mu_A(X) =E[Y|A,X] and the exogenous term UYU_Y, which forces UYU_Y to be mean-zero. This is just a notational convenience, though: we could have just as easily defined some U~Y\tilde U_Y to be arbitrary and then set UY=U~YE[YA,X]U_Y = \tilde U_Y - E[Y|A,X]. We call this statistical model Mobs\mathcal M_{obs} (observational). Note that Mobs\mathcal M_{obs} is much "bigger" than Mlinnorm\mathcal M_{lin-norm} because it places fewer assumptions on what the joint distribution of Y,A,XY,A,X is.
Our question now is: does the linear regression estimator still work to estimate the ATE in this larger statistical model? To answer this we need to look at what θ^\hat\theta converges toward as we gather near-infinite amounts of data. We already claimed above that θ^Pθ=E[ZZ]1E[ZY]\hat\theta \overset{P}{\to} \theta^* = E[ZZ^\top]^{-1}E[ZY]. This result was arrived at without any mention of a specific statistical model, so it is true generically. Therefore the question comes down to establishing
θ=E[ZZ]1E[ZY]=?[E[μ1(X)μ0(X)]]=[ψ]\theta^* = E[ZZ^\top]^{-1}E[ZY] \overset{?}{=} \left[\begin{array}{c} \dots \\ \dots \\ E[\mu_1(X) - \mu_0(X)] \\ \end{array}\right] = \left[\begin{array}{c} \dots \\ \dots \\ \psi \\ \end{array}\right]
Where the dots are stuff we don't care about (in the linear model, they correspond to estimates of α\alpha and β\beta). If the central equality does not hold, then the estimate we get out of our linear regression estimator is likely to be near some value that isn't the ATE. If that were the case it would make no sense to use linear regression to try and estimate the ATE.
Unfortunately, that equality does not hold for Mobs\mathcal M_{obs} in general. Obviously it holds inside Mlinnorm\mathcal M_{lin-norm}, but the problem is that there are distributions outside of Mlinnorm\mathcal M_{lin-norm} where things go wrong. We can give one such example. Presume the true distribution is specified by this probability mass function:
P(X,A,Y)={1/12(1,1,1)1/12(1,0,0)1/6(1,1,1)1/6(1,0,0)1/6(0,1,0)1/3(0,0,1)0else\begin{align*} P(X,A,Y) &= \begin{cases} 1/12 & (-1,1,1) \\ 1/12 & (-1,0,0) \\ 1/6 & (1,1,1) \\ 1/6 & (1,0,0) \\ 1/6 & (0,1,0) \\ 1/3 & (0,0,1) \\ 0 & \text{else} \\ \end{cases} \\ \end{align*}
Here XX has three categories and AA is binary with a joint distribution such that they are more likely to be equal in absolute value (i.e. (X,A)=(1,1)(X,A) =(1,1) or (0,0)(0,0) or (1,1)(-1,1)) than not (i.e. (1,0)(1,0)). The outcome YY is 1 if any only if XX and AA are equal in absolute value. You might imagine this is a scenario where XiX_i is some kind of U-shaped indicator of the type of treatment someone would benefit from. If Xi=1X_i=1 or 1-1 then the best treatment is Ai=1A_i=1 which results in the good outcome Yi=1Y_i=1, but Ai=0A_i=0 results in the bad outcome Yi=0Y_i=0. On the other hand if X=0X=0 then it is A=0A=0 that gives the good outcome.
You can crank through the calculations (outlined in the toggle below) to find that in this scenario the ATE is ψ=16\psi = \frac{1}{6}. However, you can also calculate that for this distribution the limit of the linear regression coefficients is θ=[,10137]\theta^*= [\dots, \frac{101}{37}]^\top. Clearly 1610137\frac{1}{6} \ne \frac{101}{37} so we conclude that linear regression is not a consistent estimator of the ATE in a general observational model! Even if we had infinite data, we would still get the wrong answer.
Calculations
Since everything is discrete, notice that we can compute expectations like E[YA=a,X=x]=A=a,X=xyP(y,a,x)E[Y|A=a,X=x] = \sum_{A=a,X=x} y P(y,a,x) using the values P(x,a,y)P(x,a,y) in the table above.
First let's calculate ψ=E[YA=1]E[YA=0]\psi = E[Y|A=1] - E[Y|A=0]. We obtain [(0×1/12)+(0×1/6)+(1×1/3)][(1×1/12)+(1×1/6)+(1×1/6)][(0\times1/12) + (0\times 1/6) + (1\times 1/3)] - [(1\times1/12) + (1\times 1/6) + (1\times 1/6)]  which means ψ=1/6\psi = 1/6.
Now we tackle θ=E[ZZ]1E[ZY]\theta^* = E[ZZ^\top]^{-1} E[ZY] using the same type of expansion to calculate each expectation in the terms below:
E[ZZ]=E[1XAXX2AXAAXA2]=112[123261313]    E[ZZ]1=1237[173163651652]E[ZZ^\top] = E\left[\begin{array}{ccc} 1 & X & A \\ X & X^2 & AX \\ A & AX & A^2 \end{array}\right] = \frac{1}{12} \left[\begin{array}{ccc} 1 & 2 & 3 \\ 2 & 6 & 1 \\ 3 & 1 & 3 \end{array}\right]\\ \implies E[ZZ^\top]^{-1} = \frac{12}{37} \left[\begin{array}{rrr} -17 & 3 & 16 \\ 3 & 6 & -5 \\ 16 & -5 & -2 \end{array}\right]
E[ZY]=E[YXYAY]=112[713]E[ZY] = E\left[\begin{array}{c} Y \\ XY \\ AY \end{array}\right] = \frac{1}{12} \left[\begin{array}{c} 7 \\ 1 \\ 3 \end{array}\right]
And multiplying these two together gives us the somewhat horrific 137[68,12,101]\frac{1}{37} [-68, 12, 101]^\top.
Linear Regression in a Randomized Trial
Now let's work with a generic randomized trial. In a randomized trial we take a group of subjects, flip a coin to decide who gets what treatment, and then observe the outcomes later on. Based on that information alone, we can make no assumptions about the distribution of the covariates and how the affect the outcome (e.g. linearly), nor can we assume how the treatment factors into the outcome. What we can say for sure is that the treatment is independent of the measured covariates and of anything else that could possibly affect the outcome.
XUXABinom(1/2)Y=μA(X)+UY\begin{align*} X &\sim U_X \\ A &\sim \text{Binom}(1/2) \\ Y &= \mu_A(X) + U_Y \end{align*}
Here we are assured that UY,AXU_Y, A\perp X and E[UY]=0E[U_Y]=0. We'll call this statistical model MRCT\mathcal M_{RCT} (randomized controlled trial). Any distribution of the form above is a special case of our other model Mobs\mathcal M_{obs}, so we have that MRCTMobs\mathcal M_{RCT} \subset \mathcal M_{obs}. The relationship with Mlinnorm\mathcal M_{lin-norm} is also clear: there is some overlap (when UYU_Y is normal and μA(X)\mu_A(X) is linear in AA and XX) but there are also distributions in Mlinnorm\mathcal M_{lin-norm} that aren't in MRCT\mathcal M_{RCT} because the latter imposes a very specific form on the distribution of AA. Therefore the two intersect but neither is a subset of the other.
We know the linear regression estimator is consistent in Mlinnorm\mathcal M_{lin-norm}. We also know there are distributions in Mobs\mathcal M_{obs} for which it is not. But are any of those distributions in MRCT\mathcal M_{RCT}?
To figure this out all we need to do is calculate the linear regression limit θ=E[ZZ]1E[ZY]\theta^* = E[ZZ^\top]^{-1}E[ZY] under the assumption that UY,AXU_Y, A\perp X and check that its last element is indeed the ATE ψ=E[μ1(X)μ0(X)]\psi = E[\mu_1(X) - \mu_0(X)].
We'll use one additional simplifying "assumption" here, which is that E[X]=0E[X] = 0. This isn't actually an assumption because we can always center the observed covariates (call them X~\tilde X) and in the limit the empirical centering is equivalent to de-meaning. This is useful because E[XA]=0E[XA]=0 (one of the terms in E[ZZ]E[ZZ^\top]) follows from the independence AXA \perp X and E[X]=0E[X]=0. Moreover E[XX]=V[X]E[XX^\top] = V[X]. Thus
E[ZZ]=E[1XAXX2AXAAXA2]=[101/20V[X]01/201/2]    block inversionE[ZZ]1=[2020V[X]10204]E[ZZ^\top] = E\left[\begin{array}{ccc} 1 & X & A \\ X & X^2 & AX \\ A & AX & A^2 \end{array}\right] = \left[\begin{array}{ccc} 1 & 0 & 1/2 \\ 0 & V[X] & 0 \\ 1/2 & 0 & 1/2 \end{array}\right]\\ \overset{\text{block inversion}}{\implies} E[ZZ^\top]^{-1} = \left[\begin{array}{rcr} 2 & 0 & -2 \\ 0 & V[X]^{-1} & 0 \\ -2 & 0 & 4\end{array}\right]
Now we turn our attention to E[ZY]E[ZY]. The terms in E[ZY]E[ZY] are [E[Y],E[XY],E[AY]][E[Y], E[X^\top Y], E[AY]]^\top but since we're only interested in the last element of the vector E[ZZ]1E[ZY]E[ZZ^\top]^{-1}E[ZY] and we see there is a 0 in the middle position of the last row of E[ZZ]1E[ZZ^\top]^{-1}, it turns out that E[XY]E[X^\top Y] doesn't matter. Thus γ=2E[Y]+4E[AY]\gamma^* = -2E[Y] +4E[AY]. Now we use the fact that E[UY]=0E[U_Y]=0 and also that E[UYA]=0E[U_YA]=0 by the independence AUYA \perp U_Y.
E[Y]=E[μA(X)+UY]=12E[μ1(X)]+12E[μ0(X)]\begin{align*} E[Y] &= E[\mu_A(X) +U_Y] \\ &= \frac{1}{2}E[\mu_1(X)] + \frac{1}{2}E[\mu_0(X)] \end{align*}
E[AY]=E[A(μA(X)+UY)]=12E[μ1(X)]\begin{align*} E[AY] &= E[A(\mu_A(X)+U_Y)] \\ &= \frac{1}{2}E[\mu_1(X)] \end{align*}
And thus by multiplication and subtraction we arrive at γ=E[μ1(X)μ0(X)]\gamma^*=E[\mu_1(X) - \mu_0(X)], which is exactly ψ\psi, the ATE. We have therefore proven that the linear regression coefficient is a consistent estimator of ATE in a randomized trial.
Conclusion
The takeaway message from all of this is that you need to think critically about the estimand, the estimator and the statistical model together to be able to claim that your method "works". Linear regression is widely used in the observational study literature because introductory statistics classes fail to emphasize that the maximum likelihood results require totally unrealistic assumptions (linearity, normal errors). In fact, many curricula fail to even mention that generally what is of interest is a parameter that is defined outside of a parametric model (e.g. the ATE ψ\psi vs. some fictitious coefficient γ\gamma).
However, we also showed that, in some cases, linear regression still works to estimate the ATE despite the maximum likelihood assumptions not holding. This justifies the use of linear regression to estimate the ATE in randomized trials, which is also common practice. The point is, however, that this should be surprising rather than expected!
We haven't discussed this, but it turns out that the usual way of estimating the standard error (and thus confidence intervals and p-values) is not consistent outside of the linear-normal model, even in a randomized trial! That's why the sandwich estimator exists.
And what if we change the estimand? If YY is binary, we might consider using logistic regression to estimate a marginal odds ratio instead of the ATE. It's common practice to exponentiate the coefficient from a logistic regression fit and call it an odds ratio representing some effect of treatment. But this, too, is a problem, because even in the relatively safe haven of a randomized trial, one can show that this estimator is consistent for a weird estimand called the conditional odds ratio and not the marginal odds ratio that is most often of interest! Things are even worse in survival settings: the hazard ratio is essentially meaningless without the assumption of proportional hazards!
Use your brain!
The fact that inference is not taught from this perspective continues to create confusion in many fields. The big problem isn't about p-hacking, or frequentist vs. bayesian statistics, or this or that model. It's about not thinking clearly about the question and the assumptions.
Is it sometimes difficult to keep all of this in your head? Sure. But we promise that also gets much easier with practice! And it's certainly easier in the long run to understand the goal of your analysis and a framework in which to think than it is to try and memorize a flowchart of a hundred different methods that may all be wrong anyways. So think about the model, think about the estimand, and make sure your estimator makes sense in that context!