Try Notion
3.3 Deriving EIFs
In the previous section we learned that the canonical gradient is also the influence function of the most efficient RAL estimator (the efficient influence function, EIF). So, if we want to know what that estimator is (so we can use it!), the first step is to derive the canonical gradient of our statistical estimand in our statistical model. Since the point of doing this is to arrive at the EIF, we typically just say we’re “deriving the EIF”- it’s the same thing either way.
Knowing the relationships between the tangent space, pathwise derivative, etc. etc. in abstract generality doesn't immediately tell us what the EIF is for any specific problem. Basically, we know in theory what it is. But, given a particular statistical model and parameter, how do we actually find it? There are many ways to do this.
Things are typically easier in the fully nonparametric (i.e. saturated) setting. Since there are no restrictions on the tangent space, we can “move” in any direction and that ends up making it easier to derive the EIF. We’ll tackle this case first. There are a few tricks we’ll go over to make this even easier.
When the statistical model is semi-parametric (i.e. non-saturated) things are a harder because there are certain directions we can’t go without departing the tangent space. That makes the math more hairy. Thankfully, due to the factorization of the tangent space, there is a simple-enough method that works in most practical cases.
That said, no one method works for all models and estimands. If you’re doing something very different from what’s already out there you might need new and interesting math!
Before we get to any of that, though, we’ll lay out some generic algebraic tools that will help us build up complicated EIFs from simpler ones.
Gradient Algebra
Sometimes your statistical parameter can be expressed as a fixed function of another statistical parameter, or as a sum of two other statistical parameters, etc. In these cases, it's often easier to find the canonical gradient of the component parameters and then combine them to get the canonical gradient for the original target parameter. Thankfully, it's really easy to do that!
The trick is to realize that the pathwise derivative (for a valid path given by hh) is just an ordinary unidimensional derivative for a function of ϵ\epsilon. In brief: hψ=dψ(P~ϵ)dϵ\nabla_h \psi = \frac{d \psi(\tilde P_\epsilon)}{d \epsilon}. So we can directly apply all of the relevant results from undergraduate calculus: the chain rule, addition of derivatives, etc.
We’ll go through the chain rule as an example. Imagine that you know the canonical gradient for a parameter ψ\psi but what you're really interested in estimating is the quantity g(ψ)g(\psi). Well, check this out:
hg(ψ(P))=def.dg(ψ(P~ϵ))dϵ=chain ruleg(ψ(P))dψ(P~ϵ)dϵ=Reisz rep.g(ψ(P))undefinedconstantE[hϕ]=linearityE[hϕg(ψ(P))undefinedϕgψ]\nabla_h g(\psi(P)) \overset{\text{def.}}{=} \frac{d g(\psi(\tilde P_\epsilon))}{d \epsilon} \overset{\text{chain rule}}{=} g'(\psi(P)) \frac{d \psi(\tilde P_\epsilon)}{d \epsilon} \overset{\text{Reisz rep.}}{=} \underbrace{g'(\psi(P))}_{\text{constant}} E[h\phi^\dagger] \overset{\text{linearity}}{=} E\big[h \underbrace{ \phi^\dagger g'(\psi(P)) }_{\phi^\dagger_{g \circ \psi}} \big]
This looks intimidating but each step is a something we're already familiar with so it's just about putting it together. At the end of the day, what we've shown is that the function g(ψ(P))ϕg'(\psi(P))\phi^\dagger is exactly the Reisz representer for hg(ψ(P))\nabla_h g(\psi(P)) and is thus by definition the canonical gradient of g(ψ)g(\psi). Note that multiplying by the constant gg' is a linear operation so we can't have left the tangent space (which hasn't changed) and somehow obtained a non-canonical gradient. We therefore have a simple formula (effectively the chain rule) to compute gradients for functions of parameters.
It helps to think of an “EIF operator” Φ\Phi for a given model that takes a parameter ψ\psi and returns its efficient influence function ϕ\phi. We can write out a few useful algebra rules this way:
Φ(g(ψ))=g(ψ)Φ(ψ)(chain rule)Φ(ψ1ψ2)=Φ(ψ1)ψ2+ψ1Φ(ψ2)(product rule)Φ(a1ψ1+a2ψ2)=a1Φ(ψ1)+a2Φ(ψ2)(linearity)\begin{align*} \Phi(g(\psi)) &= g'(\psi)\Phi(\psi) & \quad \text{(chain rule)} \\ \Phi(\psi_1 \psi_2) &= \Phi(\psi_1)\psi_2 + \psi_1\Phi(\psi_2) & \quad \text{(product rule)} \\ \Phi(a_1\psi_1+a_2\psi_2) &= a_1\Phi(\psi_1) + a_2\Phi(\psi_2) & \quad \text{(linearity)} \end{align*}
We’ll call this “gradient algebra”- it’s a generically useful set of tools that we can use to build up canonical gradients from simpler pieces, the same way we can use the chain rule, etc. to build up complex derivatives from simpler ones. We’ll see some examples in a little bit.
Saturated Models
Life is generally good when our model is fully nonparametric. There is only one influence function, and it’s the efficient one. All we need to do is to find it.
The material in this section closely follows and condenses Kennedy 2022 and Hines 2021, though the methods have been around for much longer!
Point Mass Contamination
The trick most people use in this setting is called point mass contamination (or sometimes the Gateaux derivative method). The idea is to 1) pretend that all the variables in your model are discrete and then 2) to consider a particular kind of path where the "destination" P~\tilde P is a distribution that places all of its mass at some point z~\tilde z (I'll use 1z~1_{\tilde z} to notate such a distribution). In other words, this distribution enforces P~(Z=z~)=1\tilde P(Z=\tilde z) = 1 and the probability that ZZ takes any other value is 0. We can therefore think of the path P~ϵ\tilde P_\epsilon for small ϵ\epsilon as the distribution PP "contaminated" with just a little extra mass at the value z~\tilde z.
Because our model is saturated, we can pick any point z~\tilde z and get a legal path of this type. If our model were not saturated at PP, then there would be no guarantee that for some given z~\tilde z a path like this wouldn't immediately take us outside the model space.
Let the score of such a path towards a distribution 1z~1_{\tilde z} be denoted hz~h_{\tilde z}. For paths of this type, we can argue that the value of the influence function at the point z~\tilde z is given by the derivative of ψ\psi at 0 in the direction of the score hz~h_{\tilde z}. Or, formally (and dropping tildes): ϕ(z)=hzψ\phi(z) = \nabla_{h_z} \psi. This is extremely convenient! All we have to do to get an influence function is brute-force compute the pathwise derivative of our parameter along paths defined by point mass contaminants. And since there’s only one influence function, if we find one, we’ve found the efficient one.
Proof of ϕ(z)=hz~ψ\phi(z) = \nabla_{h_{\tilde z}} \psi
Recall that our definition of the score and path is P~ϵ(A)=A(1+ϵh)dP\tilde P_\epsilon(A) = \int_A (1+\epsilon h) dP. The left-hand side is equivalent to AdP~ϵ\int_A d\tilde P_\epsilon and we can exploit our change-of-variables formula to give ϕdP~ϵ=ϕ(1+ϵh)dP\int \phi d\tilde P_\epsilon = \int \phi (1+\epsilon h) dP (as long as ϵ<1\epsilon < 1, else we lose absolute continuity). Now ϕ(1+ϵh)dP=ϕdP+ϵϕhdP=0+ϵE[ϕh]\int \phi (1+\epsilon h) dP = \int \phi dP + \epsilon \int \phi h dP = 0 + \epsilon E[\phi h] by the zero-mean property of ϕ\phi and linearity of the integral. Thus ϕdP~ϵ=ϵE[ϕh]\int \phi d\tilde P_\epsilon = \epsilon E[\phi h]. The definition of the path in terms of a convex combination of CDFs implies that P~ϵP~\tilde P_{\epsilon} \rightsquigarrow \tilde P In the limit as ϵ1\epsilon \rightarrow 1, so by the portmanteau lemma ϕdP~ϵϕdP~\int \phi d\tilde P_\epsilon \rightarrow \int \phi d\tilde P. Moreover, clearly ϵE[ϕh]E[ϕh]\epsilon E[\phi h] \rightarrow E[\phi h] in that same limit. So we have ϕdP~=E[ϕh]\int \phi d\tilde P = E[\phi h]
If P~=1z~\tilde P = 1_{\tilde z} is a point mass at z~\tilde z, then the integral on the left-hand side here is exactly ϕ(z~)\phi(\tilde z) and h=hz~h = h_{\tilde z}. Thus E[ϕhz]=ϕ(z)E[\phi h_z] = \phi(z) (dropping the tildes). Finally, by our central identity for influence functions, hzψ=E[ϕhz]=ϕ(z)\nabla_{h_z}\psi = E[\phi h_z] = \phi(z) as desired.
Why is it called an "influence function"?
When we first presented influence functions we weren't in a position to explain why they have that name, but now we are! In a saturated model, our arguments above show that ϕ(z)=hzψ\phi(z) = \nabla_{h_z} \psi for discrete distributions. Using its definition we can expand that derivative as follows:
ϕ(z)=limϵ0ψ(1z)ψ(P)ϵ\phi(z) = \lim_{\epsilon \rightarrow 0} \frac{ \psi(1_z) - \psi(P) }{ \epsilon }
so what ϕ(z)\phi(z) tells us is how much the parameter ψ\psi changes if we add an infinitesimal amount of probability mass at the point zz. Or, you might say, it is the "influence" that the point zz exerts on the parameter (at PP). That explains the name!
We started out by assuming that our data were actually discrete, so we now have to actually go back and check that the influence function we derived still works in the original model. But this is much easier once we have our candidate: just compute hψ\nabla_{h} \psi and E[ϕh]E[\phi h] and check the two are equal for arbitrary hh in the tangent space L20\mathcal L_2^0.
The whole procedure may seem slightly abstract at the moment but we will see examples shortly. You also don’t have to worry about doing this manually most of the time because you can use gradient algebra to build up your result from the examples provided here!
Example: Mean
Consider the estimand ψ(P)=E[Z]\psi(P) = E[Z] where ZZ can have any distribution PP in a nonparametric model. What’s the efficient influence function?
Point Mass Contamination
First, start by pretending ZZ is discrete and only takes values zz in some set Z\mathcal Z, with probability mass p(z)p(z) at each point. Then the formula for our estimand is E[Z]=zZzp(z)E[Z] = \sum_{z\in\mathcal Z} zp(z).
Now let’s add a tiny little bit (ϵ\epsilon) of mass to a particular point z~\tilde z to get a new probability mass function: p~ϵ(z)=(1ϵ)p(z)+ϵ1z~(z)\tilde p_\epsilon(z) = (1-\epsilon)p(z) + \epsilon1_{\tilde z}(z). Technically this isn’t a probability mass function anymore because the probabilities don’t sum to 1 but we can ignore that: all of this is a heuristic to get us a candidate influence function. Now let’s compute our estimand for this perturbed distribution:
ψ(p~ϵ)=Ep~ϵ[Z]=zZz((1ϵ)p(z)+ϵ1z~(z))\psi(\tilde p_\epsilon) = E_{\tilde p_\epsilon}[Z] = \sum_{z\in\mathcal Z} z((1-\epsilon)p(z) + \epsilon1_{\tilde z}(z))
Recall our definition hψ=ddϵψ(p~ϵ)0\nabla_{h} \psi = \left. \frac{d}{d\epsilon} \psi(\tilde p_\epsilon) \right|_0. All we have to do to compute this is take the derivative w.r.t. ϵ\epsilon of the right-hand side above and then set ϵ=0\epsilon = 0. We get:
ddϵψ(p~ϵ)=zz(1z~(z)p(z))=z~ψ(p)\begin{align*} \frac{d}{d\epsilon} \psi(\tilde p_\epsilon) &= \sum_z z(1_{\tilde z}(z) - p(z)) \\ &= \tilde z - \psi(p) \\ \end{align*}
This is typical in EIF derivations of this kind: we end up with a term that is the estimand itself (in this case zp(z)\sum zp(z)) and some other terms where the indicator function at z~\tilde z cancels out all of the terms in the sum over zz except for the term at z=z~z=\tilde z.
Thus we arrive at our candidate EIF:
ϕ(Z)=ZE[Z]\phi(Z) = Z - E[Z]
Checking the Candidate
Now we can check that for any hL20h \in \mathcal L_2^0 our central identity hψ=E[ϕh]\nabla_h \psi = E[\phi h] is satisfied:
From the left:
hψ=ddϵψ((1+ϵh)p)=ddϵz(1+ϵh(z))p(z)dz=zh(z)p(z)dz=E[Zh(Z)]\begin{align*} \nabla_h\psi &= \frac{d}{d\epsilon} \psi((1+\epsilon h)p) \\ &= \frac{d}{d\epsilon} \int z(1+\epsilon h(z))p(z)dz \\ &= \int z h(z) p(z)dz \\ &= E[Zh(Z)] \end{align*}
More generally…
hψ=ddϵψ(P~ϵ)=ddϵZdP~ϵ=ddϵZ(1+ϵh(Z))dP=Zh(Z)dP=E[Zh(Z)]\begin{align*} \nabla_h\psi &= \frac{d}{d\epsilon} \psi(\tilde P_\epsilon) \\ &= \frac{d}{d\epsilon} \int Z d\tilde P_\epsilon \\ &= \frac{d}{d\epsilon} \int Z(1+\epsilon h(Z)) dP \\ &= \int Z h(Z) dP \\ &= E[ Z h(Z)] \end{align*}
And from the right:
E[ϕh]=E[(ZE[Z])h(Z)]=E[Zh(Z)]E[Z]E[h(Z)]undefined0\begin{align*} E[\phi h] &= E[(Z-E[Z])h(Z)] \\&= E[Zh(Z)]-E[Z]\underbrace{E[h(Z)]}_0 \end{align*}
Why does having a candidate EIF make this easier?
Consider the derivation above. Working from the result on the left-hand side we could have noticed that E[Zh(Z)]=E[Zh(Z)]E[Z]E[h(Z)]E[Zh(Z)] = E[Zh(Z)]-E[Z]E[h(Z)] because E[h(Z)]=0E[h(Z)]=0 and then proceeded backwards up the derivation on the right-hand side. That would have gotten us our influence function without ever supposing a candidate.
The problem is that this is a “clever trick” that is really only obvious in retrospect. If we were working from the left-hand side, how would we know to subtract zero in the guise of E[Z]E[h]?E[Z]E[h]? There are a million other things we might have tried if we didn’t have the right intuition from solving a lot of these problems in the past. Moreover this is a very simple example- in the general case the proof might involve some very unintuitive techniques. Basically: if you don’t already know where you’re going, it’s very hard to get to the right answer.
On the other hand, with a candidate EIF we can just plug-and-chug from the right-hand side and meet up in the middle. No creativity or special intuition required.
Thus we are done and have proved that ϕ(Z)=ZE[Z]\phi(Z) = Z - E[Z] is the EIF for ψ(P)=E[Z]\psi(P) = E[Z] in the nonparametric model. You may notice this is exactly the influence function for the sample mean estimator, which shows it is nonparametrically efficient.
Example: Conditional Mean
Consider ψ(P)=E[YX=x]\psi(P) = E[Y|X=x] for some discrete variables YY and XX and a given value xx. As in the previous example, we can express this estimand as yyp(yx)\sum_y y p(y|x). The steps will be the same as before: define p~ϵ\tilde p_\epsilon as the point-mass contaminated version of pp, compute the derivative of ψ(p~ϵ)\psi(\tilde p_\epsilon) w.r.t. ϵ\epsilon, then set ϵ=0\epsilon = 0.
Here we can define
p~ϵ(yx)=p~ϵ(y,x)p~ϵ(x)=(1ϵ)p(y,x)+ϵ1y~,x~(y,x)(1ϵ)p(x)+ϵ1x~(x)\begin{align*} \tilde p_\epsilon(y|x) &= \frac{\tilde p_\epsilon(y,x)}{\tilde p_\epsilon(x)} \\&= \frac {(1-\epsilon)p(y,x) + \epsilon 1_{\tilde y, \tilde x}(y,x)} {(1-\epsilon)p(x) + \epsilon 1_{\tilde x}(x)} \end{align*}
Plugging this we can compute yyp~ϵ(yx)\sum_y y \tilde p_\epsilon(y|x) and take the derivative. I’ll spare you a few lines of algebra (see Kennedy 2022 if you need it) and tell you that what we end up with is
ϕ(Y,X)=1x(X)p(X=x)(YE[YX=x])\phi(Y,X) = \frac{1_x(X)}{p(X=x)}(Y-E[Y|X=x])
You can verify for yourself that this influence function still holds if YY is allowed to have an arbitrary (non-discrete) distribution by following the same steps from the example above.
We can also slightly relax the restriction on XX to allow arbitrary distributions that have nonzero mass at the point xx. If there is no mass at xx, the above influence function is undefined. Indeed it is known that the general conditional mean is not a pathwise differentiable estimand so this makes sense.
Building up Influence Functions
This point-mass contamination strategy works perfectly well for complicated estimands but it can take a bunch of algebra (see, e.g. Hines 2022). Instead of reinventing the wheel every time, it’s often easier to use our gradient algebra tricks to build up a complicated influence function from simpler component parts like those in the examples above.
So far we have EIFs for the mean P(Z)E[Z]P(Z) \mapsto E[Z]
Φ(PE[Z])=ZE[Z]\begin{align*} \Phi\big(P\mapsto E[Z]\big) &= Z -E[Z] \end{align*}
and for the conditional mean P(Y,X)E[YX=x]P(Y,X) \mapsto E[Y|X=x]:
Φ(PE[YX=x])=1x(X)p(X=x)(YE[YX=x])\begin{align*} \Phi\big(P\mapsto E[Y|X=x]\big) &= \frac{1_x(X)}{p(X=x)}(Y-E[Y|X=x]) \end{align*}
Let’s put these together with our gradient algebra rules in an example.
Example: ATE in an Observational Study
Here we're interested in the general nonparametric model for an observational study with an outcome YY, binary treatment AA, and vector of arbitrary covariates XX. Our parameter of interest is the statistical ATE
ψ=E[μ1(X)]undefinedψ1E[μ0(X)]undefinedψ0\psi = \underbrace{E[\mu_1(X)]}_{\psi_1} - \underbrace{E[\mu_0(X)]}_{\psi_0}
Where μa(X)=E[YX,A=a]\mu_a(X) = E[Y|X,A=a].
By the linearity, we can get the EIF of ψ\psi by taking the difference of the EIFs of ψ1\psi_1 and ψ0\psi_0. So let’s figure out what Φ(ψa)\Phi(\psi_a) is, keeping aa generic.
First, pretend our variables are discrete so ψa=xμa(x)p(x)\psi_a = \sum_x \mu_a(x)p(x). Now use the sum and product rules:
Φ(ψa)=xΦ(μa(x))p(x)+μa(x)Φ(p(x))\Phi(\psi_a) = \sum_x \Phi\big(\mu_a(x)\big)p(x) + \mu_a(x)\Phi\big(p(x)\big)
We have an expression for the influence function of μa(x)\mu_a(x) because that’s just a conditional mean. We also have the influence function of p(x)p(x) because we can write p(X=x)=E[1x(X)]p(X=x) = E[1_x(X)]. The term 1x(X)1_x(X) is just a particular random variable so we can directly apply our result for the influence function of a mean. As a result:
Φ(ψa)=x(1a,x(A,X)p(A=a,X=x)(Yμa(x)))p(x)+xμa(x)(1x(X)p(X=x))=1a(A)πa(A)(Yμa(X))+μa(X)ψa\begin{align*} \Phi(\psi_a) &= \sum_x \left( \frac{1_{a,x}(A,X)}{p(A=a,X=x)}(Y-\mu_a(x)) \right) p(x) \\ &\quad + \sum_x \mu_a(x)\big(1_x(X) - p(X=x)\big) \\ &= \frac{1_a(A)}{\pi_a(A)}\big(Y-\mu_a(X)\big) + \mu_a(X) - \psi_a \end{align*}
In the last line we used the fact that p(a,x)=p(ax)p(x)p(a,x) = p(a|x)p(x) and we defined the propensity score πa(x)=P(A=aX=x)\pi_a(x) = P(A=a|X=x). As often happens, we collect a term that turns out to be equal to our estimand: xμa(x)p(x)=ψa\sum_x \mu_a(x) p(x) = \psi_a.
This holds influence function holds for discrete data. To check the candidate works for general distributions the trick is to “factorize” the generic score h=hYX,A+hAX+hXh = h_{Y|X,A} + h_{A|X} + h_X using the tangent space factorization we discussed in the previous section. The details involve some algebra, which you can check at your leisure.
Checking the candidate
We’ll check for ψ0\psi_0 since the proof for ψ1\psi_1 is the same. From here on we’ll omit the subscript and just write ψ=ψ0\psi = \psi_0 for notational brevity. First, we write the directional derivative for ψ\psi and a general hh
hψ=ddϵxyyp~ϵ(yx,0)dy  p~ϵ(x)dx\begin{align*} \nabla_{h}\psi &= \frac{d}{d\epsilon} \int_x \int_y y\tilde p_\epsilon(y|x,0) dy \ \ \tilde p_\epsilon(x)dx \\ \end{align*}
where the factors of of p~ϵ\tilde p_\epsilon will depend on hh- see the section on tangent space factorization on the previous page.
For the covariance of our candidate EIF with general hh we have:
E[ϕh]=E[(ϕ+ψ)h]E[ψh]=xay[10(a)π0(a)(yμ0(x))+μ0(x)]h  p(yx,a)dy  p(ax)  p(x)dx=xay[10(a)π0(a)(yμ0(x))]h  p(yx,a)dy  p(ax)  p(x)dx+xayμ0(x)h  p(yx,a)dy  p(ax)  p(x)dx=xy[(yμ0(x))]h  p(yx,0)dy  p(x)dx+xμ0(x)ayh  p(yx,a)dy  p(ax)  p(x)dx=E[(Yμ0(X))h A=0]+E[μ0(X)E[hX]]\begin{align*} E[\phi h] &= E[(\phi+\psi)h] - \cancel{E[\psi h]} \\ &= \int_x \sum_a \int_y \left[ \frac{1_0(a)}{\pi_0(a)}\big(y-\mu_0(x)\big) + \mu_0(x) \right] h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &= \int_x \sum_a \int_y \left[ \frac{1_0(a)}{\pi_0(a)}\big(y-\mu_0(x)\big) \right] h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &+ \int_x \sum_a \int_y \mu_0(x) h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &= \int_x \int_y \left[ \big(y-\mu_0(x)\big) \right] h \ \ p(y|x,0)dy \ \ p(x)dx \\ &+ \int_x \mu_0(x)\sum_a \int_y h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &= E\left[\left.\big(Y-\mu_0(X)\big)h\ \right| A=0\right] + E\big[\mu_0(X) E[h|X]\big] \end{align*}
Since our distribution factorizes p=pYA,XpAXpXp = p_{Y|A,X}p_{A|X}p_X, we can write h=hYX,A+hAX+hXh = h_{Y|X,A} + h_{A|X} + h_X for an arbitrary score. Thus:
hψ=hYA,Xψ+hAXψ+hXψE[ϕh]=E[ϕhYA,X]+E[ϕhAX]+E[ϕhX]\begin{align*} \nabla_h\psi &= &\nabla_{h_{Y|A,X}}\psi&\quad +& &\nabla_{h_{A|X}}\psi& +& &\nabla_{h_{X}}\psi \\ E[\phi h] &= &E[\phi h_{Y|A,X}]&\quad +& &E[\phi h_{A|X}]& +& &E[\phi h_{X}] \end{align*}
What we’ll do is show that each of the vertically stacked terms are equal to each other. I’ll show you in some detail for the hXh_X terms but I’ll just give you the result for the hAXh_{A|X} and hYA,Xh_{Y|A,X} terms and leave it to you to verify the algebra by using the special properties of the scores in each of these subspaces.
For the hXh_X terms:
hXψ=ddϵxyyp(yx,0)dy  (1+ϵhX)p(x)dx=xyyp(yx,0)dy  hXp(x)dx=E[μ0hX]\begin{align*} \nabla_{h_X}\psi &= \frac{d}{d\epsilon} \int_x \int_y y p(y|x,0) dy \ \ (1+\epsilon h_X)p(x)dx \\ &= \int_x \int_y y p(y|x,0) dy \ \ h_Xp(x)dx \\ &= E[\mu_0h_X] \end{align*}
E[ϕhX]=E[E[(Yμ0(X)) A=0,X]hX]+E[μ0(X)E[hXX]]=E[(μ0(X)μ0(X))hX]+E[μ0(X)hX]=E[μ0hX]\begin{align*} E[\phi h_X] &= E\left[E\left[\left.\big(Y-\mu_0(X)\big)\ \right| A=0, X\right]h_X\right] + E\big[\mu_0(X) E[h_X|X]\big] \\ &= E\left[\big(\mu_0(X)-\mu_0(X)\big)h_X\right] + E\big[\mu_0(X) h_X\big] \\ &= E[\mu_0 h_X] \end{align*}
For the hAXh_{A|X} terms: verify that hAXψ=0=E[ϕhAX]\nabla_{h_{A|X}}\psi = 0 = E[\phi h_{A|X}]
For the hYA,Xh_{Y|A,X} terms: verify that hAXψ=E[YhYA,XA=0]=E[ϕhAX]\nabla_{h_{A|X}}\psi = E[Yh_{Y|A,X}|A=0] = E[\phi h_{A|X}]
Since all three sets of terms are equal, this completes the proof that ϕ\phi is an influence function for ψ0\psi_0. Thus we have the same result for the ATE because of gradient summation. Since the model is saturated, this influence function is the only one and is therefore the EIF.
Non-Saturated Models
If we have a tangent set that is not all of L20\mathcal L_2^0 things get a bit trickier. Here we’ll discuss what’s most often called the projection approach.
The idea is to first find an existing estimator that is known to be RAL and figure out its influence function ϕ\phi. Then, taking advantage of the geometry of the problem, we know that the projection of ϕ\phi onto the tangent space T\mathcal T is the EIF. In this case, finding the EIF is just a matter of computing a projection in L20\mathcal L _2^0. Note that this approach does not apply if the tangent set is L20\mathcal L_2^0 because in that case there is only a single valid influence function. Therefore if we had an existing RAL estimator, it would already be efficient.
By "projection" what we mean is the mathematical decomposition of an element of a vector space into a sum of an element of a particular subspace and a vector orthogonal to that subspace. The properties of Hilbert space guarantee that every element has a unique projection onto any given subspace (see: projection in Hilbert space). In the finite vector spaces you're probably used to, calculating projections can be tedious, but it is relatively straightforward vector algebra. In L20\mathcal L_2^0 there isn't a general purpose formula given an arbitrary subspace. Thankfully, however, there are a few for the kinds of subspaces we're usually interested in. To wit, we'll consider two kinds of specific subspaces. Imagine our data ZZ has two components XX and YY (i.e. Z=[X,Y]Z = [X,Y]) so the possible scores are zero-mean functions h(X,Y)h(X,Y).
One important subspace are the functions where h(X,Y)=h(X)h(X,Y) = h(X), i.e. the functions that just depend on XX. We'll call this subspace TX\mathcal T_X. We already saw a subspace exactly like this come up in the example in the previous section. Let hTh_{\langle \mathcal T\rangle} denote the projection of hh onto a subspace T\mathcal T (other authors prefer notation like hΠTh \Pi \mathcal T, etc.). It turns out that:
[h(X,Y)TX](X)=E[hX][h(X,Y)_{\langle \mathcal T_X\rangle}](X) = E[h|X]
The notation is a little confusing, but what I'm trying to say is that when you project the function of two variables onto this space you get back a function of just one variable.
The other important kind of subspace that we'll consider is TYX0={h(X,Y):E[hX]=0}\mathcal T^0_{Y|X} = \{h(X,Y) : E[h|X]=0 \}. In words: we're talking about the space of functions that have mean 0 when conditioned on any value of XX. We also saw a subspace like this one come up in the example above. Once again we have a handy formula:
hTYX0=hE[hX]h_{\langle \mathcal T^0_{Y|X}\rangle} = h - E[h|X]
Here the resulting projection is still a function of both XX and YY so I've omitted the explicit notation of the arguments.
You should verify both of these identities by checking the two properties of projections: hTTh_{\langle \mathcal T \rangle} \in \mathcal T and hhTTh - h_{\langle \mathcal T \rangle} \perp \mathcal T.
We can now combine these two formulas. Imagine that we have data of the form Z=[X,Y,W]Z = [X,Y,W] and we want to project a function h(X,Y,W)h(X,Y,W) to the space TYX0\mathcal T^0_{Y|X}. The functions in this subspace only depend on XX and YY so first we need to use our first identity to project h(X,Y,W)h(X,Y, W)  down to a function of just XX and YY. Then we can use the second identity to project the result into the space of functions where the conditional expectation on XX is zero. The result is:
[h(X,Y,W)TYX0](X,Y)=E[hX,Y]E[hX][h(X,Y,W)_{\langle \mathcal T_{Y|X}^0\rangle}](X,Y) = E[h|X,Y] - E[h|X]
This is facilitated by the fact that E[E[hX,Y]X]=E[hX]E[E[h|X,Y]|X] = E[h|X] (i.e. we average over ZZ, then YY, so all that's left is XX).
Example: ATE in a Randomized Trial
Let's go back to our running example. In the previous section we derived the tangent space for PMRCTP \in \mathcal M_{\text{RCT}}. We'd like to find the EIF of the ATE in this model.
We know that this model space is not saturated because any probability distribution in it has to satisfy the known treatment assignment mechanism. We'll therefore have to use the projection strategy to find the canonical gradient.
Our overall plan is as follows:
Find the canonical gradient for ψa=E[YA=a]\psi_a = E[Y|A=a]:
Start with a known RAL estimator of ψa\psi_a
Derive its influence function
Project that influence function onto the tangent space we found previously (this is the hard part):
Project onto each tangent subspace
Sum the projections
Combine the canonical gradients for ψ0\psi_0 and ψ1\psi_1 to get the equivalent for ψ\psi
First things first: do we know any RAL estimator of ψa\psi_a? It turns out that we do: ψ^IPW(Pn)=1ni[1a(Ai)Yiπa(Xi)]=Pn[1a(A)Yπa(X)]\hat\psi_{\text{IPW}}(\mathbb P_n) = \frac{1}{n} \sum_i \left[\frac{1_a(A_i)Y_i}{\pi_a(X_i)}\right] = \mathbb P_n \left[\frac{1_a(A)Y}{\pi_a(X)}\right] where πa(X)=P(A=aX)\pi_a(X) = P(A=a|X) is the known randomization mechanism. This is the inverse-probabilty-weighted estimator (IPW) of the mean of YY conditioned on A=aA=a. For a trial with simple randomization this reduces to a difference of means in the two treatment groups.
It only takes algebra to check that this estimator is asymptotically normal: just subtract ψ(P)\psi(P) from both sides, multiply by n\sqrt n, and pass the constant ψa\psi_a through the sum to see that
n(ψ^IPW(Pn)ψa(P))=nPn[1a(A)Yπa(X)ψa(P)undefinedϕa]\sqrt n \left( \hat\psi_{\text{IPW}}(\mathbb P_n) -\psi_a(P) \right) = \sqrt n \mathbb P_n \left[ \underbrace{ \frac{1_a(A)Y}{\pi_a(X)} - \psi_a(P) }_{ \phi_a } \right]
By putting it in this form and comparing to our definition of asymptotic linearity we've managed to pick out the influence function ϕa\phi_a. It's also possible to prove that this estimator is regular. One way to do that is to essentially repeat the arguments made in the proof of our main theorem of why influence functions for RAL estimators must satisfy hψ=E[ϕh]\nabla_h\psi = E[\phi h] for all hh, except in reverse: for regularity to hold, we need precisely to show that this identity holds for all scores in the tangent space.
Let's project the influence function ϕa\phi_a we just derived above into the tangent space we previously identified:
Our strategy will be to project ϕa\phi_a into each of these component subspaces and then take the sum. Thankfully, these subspaces are exactly of the form we discussed above, so we can use the projection identities we posited above (and which you checked, right?). Computing these, we get that
ϕaTYX,A0=(1a(A)πa(X)Yψa)(1a(A)πa(X)μa(X)ψa)ϕaTX=μa(X)ψa\begin{align*} \phi_{a\langle \mathcal T^0_{Y|X,A} \rangle} &= \left( \frac{1_a(A)}{\pi_a(X)}Y - \psi_a \right) - \left( \frac{1_a(A)}{\pi_a(X)} \mu_a(X) - \psi_a \right) \\ \phi_{a\langle \mathcal T_{X} \rangle} &= \mu_a(X) - \psi_a \end{align*}
To obtain these results I've just applied the identities in the section above and introduced the notation μa(X)=E[YA=a,X]\mu_a(X) = E[Y|A=a, X]. The rest is just calculating conditional expectations. Please try this yourself and make sure it makes sense to you!
Now we sum to obtain ϕa(Y,A,X)=1a(A)πa(X)(Yμa(X))+(μa(X)ψa)\phi_a^\dagger(Y,A,X) = \frac{1_a(A)}{\pi_a(X)}(Y - \mu_a(X)) + (\mu_a(X) - \psi_a).
This proves the perhaps surprising fact that the canonical gradient for the average treatment effect in an observational study is the same as it is in a randomized trial if one is not willing to make any distributional assumptions about the data-generating mechanisms (aside from the treatment assignment in the RCT).
The important difference you should notice, however, is that this canonical gradient is the only gradient in the nonparametric observational study model, whereas for RCTs there is a whole space of gradients. That has the immediate implication that there are many RAL estimators one could use for an RCT (only one of which is efficient), but there is only one valid RAL estimator (up to asymptotic equivalence) in the observational setting.
Alternative proof exploiting orthogonal decomposition of tangent space
What we'll do instead of finding something to project is actually quite clever and leverages the fact that we've already figured out the canonical gradient for an observational study. Let's start with some facts that we know: (1) we can write any score in Mobs\mathcal M_{\text{obs}} as a unique orthogonal sum h=hTRCT+hTAX0h = h_{\langle \mathcal T_{RCT}\rangle} + h_{\langle \mathcal T^0_{A|X}\rangle} since those two tangent subspaces are orthogonal, and (2) ϕa,RCThTAX0\phi^\dagger_{a,\text{RCT}} \perp h_{\langle \mathcal T^0_{A|X}\rangle}  because the canonical gradient in the RCT model is in the RCT tangent space, which is orthogonal to TAX0\mathcal T^0_{A|X}.
Now let's calculate the pathwise derivative of ψa\psi_a in the direction h=hTRCT+hTAX0h = h_{\langle \mathcal T_{RCT}\rangle} + h_{\langle \mathcal T^0_{A|X}\rangle} through the observational study model. Our hope is that we can somehow end up writing the pathwise derivative as an inner product between hh and some function, which must then be the canonical gradient.
Because the pathwise derivative is a linear function of the score, we can break it up as follows:
hψa=hTRCTψa+hTAX0ψa\nabla_h \psi_a =\nabla_{h_{\langle \mathcal T_{RCT}\rangle}} \psi_a + \nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a
The first term is nothing but the pathwise derivative of ψa\psi_a in the RCT model for which we already have the canonical gradient ϕRCT\phi^\dagger_\text{RCT}. So we can represent that term as E[hTRCTϕRCT]E[h_{\langle \mathcal T_{RCT}\rangle}\phi^\dagger_\text{RCT}]. But by our fact (2), this is equivalent to E[hϕRCT]E[h \phi^\dagger_\text{RCT}]. Adding hTAX0h_{\langle \mathcal T^0_{A|X}\rangle} inside the expectation does nothing because it's orthogonal to the RCT canonical gradient. So now we've got
hψa=E[hϕRCT]+hTAX0ψa\nabla_h \psi_a =E[h \phi^\dagger_\text{RCT}]+ \nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a
Now we'll brute-force calculate the second term:
hTAX0ψa=limϵ0P~ϵ[YA=a]P[YA=a]ϵ\nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a = \lim_{\epsilon\rightarrow 0} \frac{\tilde P_{\epsilon}[Y|A=a] - P[Y|A=a]}{\epsilon}
Let's drop the ϵ\epsilon subscripts for the moment. We also have that
p~=(1+ϵhTAX0(A,X))p=(1+ϵhTAX0(A,X))(pYA,X×pAX×pX)=pYA,Xundefinedp~YA,X×(1+ϵhTAX0(A,X))pAXundefinedp~AX×pXundefinedp~X\begin{align*} \tilde p &= (1+\epsilon h_{\langle \mathcal T^0_{ A|X}\rangle}(A,X))p \\ &= (1+\epsilon h_{\langle \mathcal T^0_{A|X} \rangle}(A,X)) \left(p_{Y|A,X} \times p_{A|X} \times p_{X} \right) \\ &= \underbrace{ p_{Y|A,X} }_{\tilde p_{Y|A,X}} \times \underbrace{ (1+\epsilon h_{\langle \mathcal T^0_{A|X} \rangle}(A,X))p_{A|X} }_{\tilde p_{A|X}} \times \underbrace{ p_{X} }_{\tilde p_X} \end{align*}
Because hTAX0h_{\langle\mathcal T_{A|X}^0\rangle}is in the tangent space of densities pAXp_{A|X}, the only way to distribute that term to get three legal densities of the form above (that factorize p~\tilde p) is to put the hTAX0h_{\langle\mathcal T_{A|X}^0\rangle} term into p~AX\tilde p_{A|X}. What this shows is that walking along any paths that are in TAX0\mathcal T^0_{A|X} actually doesn't change pYA,Xp_{Y|A,X} or pXp_X. The fluctuated density has the same marginal density for the covariates, and the same conditional density for the outcome given treatment and covariates. The only thing walking along paths in TAX0\mathcal T^0_{A|X} can change is the treatment assignment mechanism. This should make sense to you because these are exactly the paths we are forbidden to walk if we're constrained to the RCT model in which we must hold the treatment assignment mechanism constant but we're allowed to change anything else.
Going back to the expression for the pathwise gradient, we can show using iterated expectations that P~ϵ[YA=a]=μ~a(X)dP~X\tilde P_{\epsilon}[Y|A=a] = \int \tilde \mu_a(X)d\tilde P_X where μ~a(X)=P~ϵ[YA=a,X]\tilde \mu_a(X) = \tilde P_\epsilon[Y|A=a,X]. However, by the argument above, μ~a=μa\tilde \mu_a = \mu_a and dP~X=dPXd\tilde P_X = dP_X so, in fact, P~ϵ[YA=a]=P[YA=a]\tilde P_{\epsilon}[Y|A=a] = P[Y|A=a]. In words, we've shown that moving along paths in TAX0\mathcal T^0_{A|X} does nothing to the parameter ψa\psi_a. Therefore hTAX0ψa=0\nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a = 0 and we can plug that into our calculation of the pathwise derivative for general hh:
hψa=E[hϕRCT]+0\nabla_h \psi_a =E[h \phi^\dagger_\text{RCT}]+ 0
Since we've succeeded in expressing the pathwise derivative of this parameter along any path hh as an inner product between hh and a function ϕRCT\phi^\dagger_\text{RCT} we have that this is in fact the canonical gradient of the conditional mean ψa\psi_a in the nonparametric observational study model Mobs\mathcal M_{\text{obs}}.
Other Methods
Combined with gradient algebra, point mass contamination and projection are two very useful strategies for deriving efficient influence functions.
Unfortunately, there are some rare cases in which they don’t work. For example, if the tangent space factors, but not orthogonally, it becomes much more difficult to find projection (though methods do exist for doing this). Another common approach is to find the influence function in the full-data (causal) model and then figure out a way to map it to the observed data (statistical) model (see Tsiatis 2006 for details). Also, in rare cases, the tangent set isn't even a full space (i.e. instead of being a hyperplane, it's some sort of "triangle" in that hyperplane) and this can also pose difficulties.
Nuisance Tangent Space
In many texts and resources you'll find mention of something called the "nuisance tangent space". This is a tool that is sometimes helpful in characterizing the set of influence functions, but it is by no means always necessary. Indeed, we don't use it in any of the examples in this chapter. Historically it played a much larger role in efficiency theory, which is why you'll see it mentioned a lot in the literature.
The definition of this space that you'll see most often relies on a semiparametric construction of the model, where every distribution is assumed to be uniquely described by a finite-dimensional vector of parameters ψ\psi that are of interest and some infinite-dimensional vector of parameters η\eta that are not of interest. For example, you might consider a model like Y=Aψ+N(η1(X),η2)Y = A\psi + \mathcal N(\eta_1(X), \eta_2). When you have this kind of construction, you can calculate scores for each parameter and define the the nuisance tangent space as the completed span of the scores for η\eta.
I think this construction is sort of artificial. The definition I like more is that the nuisance tangent space is the completed span of all scores hh such that hψ=0\nabla_h \psi = 0. This more general definition is due to Mark van der Laan. The nusiance tangent space is usually denoted Tη\mathcal T_\eta or Λ\Lambda, which is a subset of T\mathcal T. It's immediate from our definition that any influence function is orthogonal to any element of Λ\Lambda. If we denote the orthogonal complement of Λ\Lambda using the notation Λ\Lambda^\perp, then we have that ϕΛ\phi \in \Lambda^\perp. Knowing this is sometimes useful in deriving the efficient influence function, but we don't use that fact in any of the examples in this book.
None of this is anything you should worry about unless you plan to work on esoteric new parameters and model spaces for which the canonical gradient has not yet been derived. You probably won’t come up against anything that required tools beyond what’s in this section, but mileage may vary!