3.3 Deriving EIFs

Table of Contents

Gradient Algebra

Saturated Models

Point Mass Contamination

Example: Mean

Example: Conditional Mean

Building up Influence Functions

Example: ATE in an Observational Study

Non-Saturated Models

Example: ATE in a Randomized Trial

Other Methods

In the previous section we learned that the canonical gradient is also the influence function of the most efficient RAL estimator (the efficient influence function, EIF). So, if we want to know what that estimator is (so we can use it!), the first step is to derive the canonical gradient of our statistical estimand in our statistical model. Since the point of doing this is to arrive at the EIF, we typically just say we’re “deriving the EIF”- it’s the same thing either way.

Knowing the relationships between the tangent space, pathwise derivative, etc. etc. in abstract generality doesn't immediately tell us what the EIF is for any specific problem. Basically, we know in theory what it is. But, given a particular statistical model and parameter, how do we actually find it? There are many ways to do this.

Things are typically easier in the fully nonparametric (i.e. saturated) setting. Since there are no restrictions on the tangent space, we can “move” in any direction and that ends up making it easier to derive the EIF. We’ll tackle this case first. There are a few tricks we’ll go over to make this even easier.

When the statistical model is semi-parametric (i.e. non-saturated) things are a harder because there are certain directions we can’t go without departing the tangent space. That makes the math more hairy. Thankfully, due to the factorization of the tangent space, there is a simple-enough method that works in most practical cases.

That said, no one method works for all models and estimands. If you’re doing something very different from what’s already out there you might need new and interesting math!

Before we get to any of that, though, we’ll lay out some generic algebraic tools that will help us build up complicated EIFs from simpler ones.

Gradient Algebra

Sometimes your statistical parameter can be expressed as a fixed function of another statistical parameter, or as a sum of two other statistical parameters, etc. In these cases, it's often easier to find the canonical gradient of the component parameters and then combine them to get the canonical gradient for the original target parameter. Thankfully, it's really easy to do that! 

The trick is to realize that the pathwise derivative (for a valid path given by hhh﻿) is just an ordinary unidimensional derivative for a function of ϵ\epsilonϵ﻿. In brief: ∇hψ=dψ(P~ϵ)dϵ\nabla_h \psi = \frac{d \psi(\tilde P_\epsilon)}{d \epsilon}∇h​ψ=dϵdψ(P~ϵ​)​﻿. So we can directly apply all of the relevant results from undergraduate calculus: the chain rule, addition of derivatives, etc.

We’ll go through the chain rule as an example. Imagine that you know the canonical gradient for a parameter ψ\psiψ﻿ but what you're really interested in estimating is the quantity g(ψ)g(\psi)g(ψ)﻿. Well, check this out:

\nabla_h g(\psi(P)) \overset{\text{def.}}{=} \frac{d g(\psi(\tilde P_\epsilon))}{d \epsilon} \overset{\text{chain rule}}{=} g'(\psi(P)) \frac{d \psi(\tilde P_\epsilon)}{d \epsilon} \overset{\text{Reisz rep.}}{=} \underbrace{g'(\psi(P))}_{\text{constant}} E[h\phi^\dagger] \overset{\text{linearity}}{=} E\big[h \underbrace{ \phi^\dagger g'(\psi(P)) }_{\phi^\dagger_{g \circ \psi}} \big]

This looks intimidating but each step is a something we're already familiar with so it's just about putting it together. At the end of the day, what we've shown is that the function g′(ψ(P))ϕ†g'(\psi(P))\phi^\daggerg′(ψ(P))ϕ†﻿ is exactly the Reisz representer for ∇hg(ψ(P))\nabla_h g(\psi(P))∇h​g(ψ(P))﻿ and is thus by definition the canonical gradient of g(ψ)g(\psi)g(ψ)﻿. Note that multiplying by the constant g′g'g′﻿ is a linear operation so we can't have left the tangent space (which hasn't changed) and somehow obtained a non-canonical gradient. We therefore have a simple formula (effectively the chain rule) to compute gradients for functions of parameters.

It helps to think of an “EIF operator” Φ\PhiΦ﻿ for a given model that takes a parameter ψ\psiψ﻿ and returns its efficient influence function ϕ\phiϕ﻿. We can write out a few useful algebra rules this way:

\begin{align*} \Phi(g(\psi)) &= g'(\psi)\Phi(\psi) & \quad \text{(chain rule)} \\ \Phi(\psi_1 \psi_2) &= \Phi(\psi_1)\psi_2 + \psi_1\Phi(\psi_2) & \quad \text{(product rule)} \\ \Phi(a_1\psi_1+a_2\psi_2) &= a_1\Phi(\psi_1) + a_2\Phi(\psi_2) & \quad \text{(linearity)} \end{align*}

We’ll call this “gradient algebra”- it’s a generically useful set of tools that we can use to build up canonical gradients from simpler pieces, the same way we can use the chain rule, etc. to build up complex derivatives from simpler ones. We’ll see some examples in a little bit.

Saturated Models

Life is generally good when our model is fully nonparametric. There is only one influence function, and it’s the efficient one. All we need to do is to find it. 

The material in this section closely follows and condenses Kennedy 2022 and Hines 2021, though the methods have been around for much longer!

Point Mass Contamination

The trick most people use in this setting is called point mass contamination (or sometimes the Gateaux derivative method). The idea is to 1) pretend that all the variables in your model are discrete and then 2) to consider a particular kind of path where the "destination" P~\tilde PP~﻿ is a distribution that places all of its mass at some point z~\tilde zz~﻿ (I'll use 1z~1_{\tilde z}1z~​﻿ to notate such a distribution). In other words, this distribution enforces P~(Z=z~)=1\tilde P(Z=\tilde z) = 1P~(Z=z~)=1﻿ and the probability that ZZZ﻿ takes any other value is 0. We can therefore think of the path P~ϵ\tilde P_\epsilonP~ϵ​﻿ for small ϵ\epsilonϵ﻿ as the distribution PPP﻿ "contaminated" with just a little extra mass at the value z~\tilde zz~﻿. 

Because our model is saturated, we can pick any point z~\tilde zz~﻿ and get a legal path of this type. If our model were not saturated at PPP﻿, then there would be no guarantee that for some given z~\tilde zz~﻿ a path like this wouldn't immediately take us outside the model space. 

Let the score of such a path towards a distribution 1z~1_{\tilde z}1z~​﻿ be denoted hz~h_{\tilde z}hz~​﻿. For paths of this type, we can argue that the value of the influence function at the point z~\tilde zz~﻿ is given by the derivative of ψ\psiψ﻿ at 0 in the direction of the score hz~h_{\tilde z}hz~​﻿. Or, formally (and dropping tildes): ϕ(z)=∇hzψ\phi(z) = \nabla_{h_z} \psiϕ(z)=∇hz​​ψ﻿. This is extremely convenient! All we have to do to get an influence function is brute-force compute the pathwise derivative of our parameter along paths defined by point mass contaminants. And since there’s only one influence function, if we find one, we’ve found the efficient one.

Proof of ϕ(z)=∇hz~ψ\phi(z) = \nabla_{h_{\tilde z}} \psiϕ(z)=∇hz~​​ψ﻿

Recall that our definition of the score and path is P~ϵ(A)=∫A(1+ϵh)dP\tilde P_\epsilon(A) = \int_A (1+\epsilon h) dPP~ϵ​(A)=∫A​(1+ϵh)dP﻿. The left-hand side is equivalent to ∫AdP~ϵ\int_A d\tilde P_\epsilon∫A​dP~ϵ​﻿ and  we can exploit our change-of-variables formula to give ∫ϕdP~ϵ=∫ϕ(1+ϵh)dP\int \phi d\tilde P_\epsilon = \int \phi (1+\epsilon h) dP∫ϕdP~ϵ​=∫ϕ(1+ϵh)dP﻿ (as long as ϵ<1\epsilon < 1ϵ<1﻿, else we lose absolute continuity). Now ∫ϕ(1+ϵh)dP=∫ϕdP+ϵ∫ϕhdP=0+ϵE[ϕh]\int \phi (1+\epsilon h) dP = \int \phi dP +  \epsilon \int \phi  h dP = 0 + \epsilon E[\phi h]∫ϕ(1+ϵh)dP=∫ϕdP+ϵ∫ϕhdP=0+ϵE[ϕh]﻿ by the zero-mean property of ϕ\phiϕ﻿ and linearity of the integral. Thus ∫ϕdP~ϵ=ϵE[ϕh]\int \phi d\tilde P_\epsilon = \epsilon E[\phi h]∫ϕdP~ϵ​=ϵE[ϕh]﻿. The definition of the path in terms of a convex combination of CDFs implies that P~ϵ⇝P~\tilde P_{\epsilon} \rightsquigarrow \tilde PP~ϵ​⇝P~﻿ In the limit as ϵ→1\epsilon \rightarrow 1ϵ→1﻿,  so by the portmanteau lemma ∫ϕdP~ϵ→∫ϕdP~\int \phi d\tilde P_\epsilon \rightarrow \int \phi d\tilde P∫ϕdP~ϵ​→∫ϕdP~﻿. Moreover, clearly ϵE[ϕh]→E[ϕh]\epsilon E[\phi h] \rightarrow E[\phi h]ϵE[ϕh]→E[ϕh]﻿ in that same limit. So we have ∫ϕdP~=E[ϕh]\int \phi d\tilde P = E[\phi h]∫ϕdP~=E[ϕh]﻿

If P~=1z~\tilde P = 1_{\tilde z}P~=1z~​﻿ is a point mass at z~\tilde zz~﻿, then the integral on the left-hand side here is exactly ϕ(z~)\phi(\tilde z)ϕ(z~)﻿ and h=hz~h = h_{\tilde z}h=hz~​﻿. Thus E[ϕhz]=ϕ(z)E[\phi h_z] = \phi(z)E[ϕhz​]=ϕ(z)﻿ (dropping the tildes). Finally, by our central identity for influence functions, ∇hzψ=E[ϕhz]=ϕ(z)\nabla_{h_z}\psi = E[\phi h_z] = \phi(z)∇hz​​ψ=E[ϕhz​]=ϕ(z)﻿ as desired.

Why is it called an "influence function"?

When we first presented influence functions we weren't in a position to explain why they have that name, but now we are! In a saturated model, our arguments above show that ϕ(z)=∇hzψ\phi(z) = \nabla_{h_z} \psiϕ(z)=∇hz​​ψ﻿ for discrete distributions. Using its definition we can expand that derivative as follows:

\phi(z) = \lim_{\epsilon \rightarrow 0} \frac{ \psi(1_z) - \psi(P) }{ \epsilon }

so what ϕ(z)\phi(z)ϕ(z)﻿ tells us is how much the parameter ψ\psiψ﻿ changes if we add an infinitesimal amount of probability mass at the point zzz﻿. Or, you might say, it is the "influence" that the point zzz﻿ exerts on the parameter (at PPP﻿). That explains the name! 

We started out by assuming that our data were actually discrete, so we now have to actually go back and check that the influence function we derived still works in the original model. But this is much easier once we have our candidate: just compute ∇hψ\nabla_{h} \psi∇h​ψ﻿ and E[ϕh]E[\phi h]E[ϕh]﻿ and check the two are equal for arbitrary hhh﻿ in the tangent space L20\mathcal L_2^0L20​﻿.

The whole procedure may seem slightly abstract at the moment but we will see examples shortly. You also don’t have to worry about doing this manually most of the time because you can use gradient algebra to build up your result from the examples provided here!

Example: Mean

Consider the estimand ψ(P)=E[Z]\psi(P) = E[Z]ψ(P)=E[Z]﻿ where ZZZ﻿ can have any distribution PPP﻿ in a nonparametric model. What’s the efficient influence function?

Point Mass Contamination

First, start by pretending ZZZ﻿ is discrete and only takes values zzz﻿ in some set Z\mathcal ZZ﻿, with probability mass p(z)p(z)p(z)﻿ at each point. Then the formula for our estimand is E[Z]=∑z∈Zzp(z)E[Z] = \sum_{z\in\mathcal Z} zp(z)E[Z]=∑z∈Z​zp(z)﻿. 

Now let’s add a tiny little bit (ϵ\epsilonϵ﻿) of mass to a particular point z~\tilde zz~﻿ to get a new probability mass function: p~ϵ(z)=(1−ϵ)p(z)+ϵ1z~(z)\tilde p_\epsilon(z) = (1-\epsilon)p(z) + \epsilon1_{\tilde z}(z)p~​ϵ​(z)=(1−ϵ)p(z)+ϵ1z~​(z)﻿. Technically this isn’t a probability mass function anymore because the probabilities don’t sum to 1 but we can ignore that: all of this is a heuristic to get us a candidate influence function. Now let’s compute our estimand for this perturbed distribution: 

\psi(\tilde p_\epsilon) = E_{\tilde p_\epsilon}[Z] = \sum_{z\in\mathcal Z} z((1-\epsilon)p(z) + \epsilon1_{\tilde z}(z))

Recall our definition ∇hψ=ddϵψ(p~ϵ)∣0\nabla_{h} \psi = \left. \frac{d}{d\epsilon} \psi(\tilde p_\epsilon) \right|_0∇h​ψ=dϵd​ψ(p~​ϵ​)∣∣​0​﻿. All we have to do to compute this is take the derivative w.r.t. ϵ\epsilonϵ﻿ of the right-hand side above and then set ϵ=0\epsilon = 0ϵ=0﻿. We get:

\begin{align*} \frac{d}{d\epsilon} \psi(\tilde p_\epsilon) &= \sum_z z(1_{\tilde z}(z) - p(z)) \\ &= \tilde z - \psi(p) \\ \end{align*}

This is typical in EIF derivations of this kind: we end up with a term that is the estimand itself (in this case ∑zp(z)\sum zp(z)∑zp(z)﻿) and some other terms where  the indicator function at z~\tilde zz~﻿ cancels out all of the terms in the sum over zzz﻿ except for the term at z=z~z=\tilde zz=z~﻿.

Thus we arrive at our candidate EIF:

\phi(Z) = Z - E[Z]

Checking the Candidate

Now we can check that for any h∈L20h \in \mathcal L_2^0h∈L20​﻿ our central identity ∇hψ=E[ϕh]\nabla_h \psi = E[\phi h]∇h​ψ=E[ϕh]﻿ is satisfied:

From the left:

\begin{align*} \nabla_h\psi &= \frac{d}{d\epsilon} \psi((1+\epsilon h)p) \\ &= \frac{d}{d\epsilon} \int z(1+\epsilon h(z))p(z)dz \\ &= \int z h(z) p(z)dz \\ &= E[Zh(Z)] \end{align*}

More generally…

\begin{align*} \nabla_h\psi &= \frac{d}{d\epsilon} \psi(\tilde P_\epsilon) \\ &= \frac{d}{d\epsilon} \int Z d\tilde P_\epsilon \\ &= \frac{d}{d\epsilon} \int Z(1+\epsilon h(Z)) dP \\ &= \int Z h(Z) dP \\ &= E[ Z h(Z)] \end{align*}

And from the right:

\begin{align*} E[\phi h] &= E[(Z-E[Z])h(Z)] \\&= E[Zh(Z)]-E[Z]\underbrace{E[h(Z)]}_0 \end{align*}

Why does having a candidate EIF make this easier?

Consider the derivation above. Working from the result on the left-hand side we could have noticed that E[Zh(Z)]=E[Zh(Z)]−E[Z]E[h(Z)]E[Zh(Z)] = E[Zh(Z)]-E[Z]E[h(Z)]E[Zh(Z)]=E[Zh(Z)]−E[Z]E[h(Z)]﻿ because E[h(Z)]=0E[h(Z)]=0E[h(Z)]=0﻿ and then proceeded backwards up the derivation on the right-hand side. That would have gotten us our influence function without ever supposing a candidate. 

The problem is that this is a “clever trick” that is really only obvious in retrospect. If we were working from the left-hand side, how would we know to subtract zero in the guise of E[Z]E[h]?E[Z]E[h]?E[Z]E[h]?﻿ There are a million other things we might have tried if we didn’t have the right intuition from solving a lot of these problems in the past. Moreover this is a very simple example- in the general case the proof might involve some very unintuitive techniques. Basically: if you don’t already know where you’re going, it’s very hard to get to the right answer.

On the other hand, with a candidate EIF we can just plug-and-chug from the right-hand side and meet up in the middle. No creativity or special intuition required.

Thus we are done and have proved that ϕ(Z)=Z−E[Z]\phi(Z) = Z - E[Z]ϕ(Z)=Z−E[Z]﻿ is the EIF for ψ(P)=E[Z]\psi(P) = E[Z]ψ(P)=E[Z]﻿ in the nonparametric model. You may notice this is exactly the influence function for the sample mean estimator, which shows it is nonparametrically efficient.

Example: Conditional Mean

Consider ψ(P)=E[Y∣X=x]\psi(P) = E[Y|X=x]ψ(P)=E[Y∣X=x]﻿ for some discrete variables YYY﻿ and XXX﻿ and a given value xxx﻿. As in the previous example, we can express this estimand as ∑yyp(y∣x)\sum_y y p(y|x)∑y​yp(y∣x)﻿. The steps will be the same as before: define p~ϵ\tilde p_\epsilonp~​ϵ​﻿ as the point-mass contaminated version of ppp﻿, compute the derivative of ψ(p~ϵ)\psi(\tilde p_\epsilon)ψ(p~​ϵ​)﻿ w.r.t. ϵ\epsilonϵ﻿, then set ϵ=0\epsilon = 0ϵ=0﻿.

Here we can define

\begin{align*} \tilde p_\epsilon(y|x) &= \frac{\tilde p_\epsilon(y,x)}{\tilde p_\epsilon(x)} \\&= \frac {(1-\epsilon)p(y,x) + \epsilon 1_{\tilde y, \tilde x}(y,x)} {(1-\epsilon)p(x) + \epsilon 1_{\tilde x}(x)} \end{align*}

Plugging this we can compute ∑yyp~ϵ(y∣x)\sum_y y \tilde p_\epsilon(y|x)∑y​yp~​ϵ​(y∣x)﻿ and take the derivative. I’ll spare you a few lines of algebra (see Kennedy 2022 if you need it) and tell you that what we end up with is 

\phi(Y,X) = \frac{1_x(X)}{p(X=x)}(Y-E[Y|X=x])

You can verify for yourself that this influence function still holds if YYY﻿ is allowed to have an arbitrary (non-discrete) distribution by following the same steps from the example above. 

We can also slightly relax the restriction on XXX﻿ to allow arbitrary distributions that have nonzero mass at the point xxx﻿. If there is no mass at xxx﻿, the above influence function is undefined. Indeed it is known that the general conditional mean is not a pathwise differentiable estimand so this makes sense.

Building up Influence Functions

This point-mass contamination strategy works perfectly well for complicated estimands but it can take a bunch of algebra (see, e.g. Hines 2022). Instead of reinventing the wheel every time, it’s often easier to use our gradient algebra tricks to build up a complicated influence function from simpler component parts like those in the examples above. 

So far we have EIFs for the mean P(Z)↦E[Z]P(Z) \mapsto E[Z]P(Z)↦E[Z]﻿ 

\begin{align*} \Phi\big(P\mapsto E[Z]\big) &= Z -E[Z] \end{align*}

and for the conditional mean P(Y,X)↦E[Y∣X=x]P(Y,X) \mapsto E[Y|X=x]P(Y,X)↦E[Y∣X=x]﻿:

\begin{align*} \Phi\big(P\mapsto E[Y|X=x]\big) &= \frac{1_x(X)}{p(X=x)}(Y-E[Y|X=x]) \end{align*}

Let’s put these together with our gradient algebra rules in an example.

Example: ATE in an Observational Study

Here we're interested in the general nonparametric model for an observational study with an outcome YYY﻿, binary treatment AAA﻿, and vector of arbitrary covariates XXX﻿. Our parameter of interest is the statistical ATE

\psi = \underbrace{E[\mu_1(X)]}_{\psi_1} - \underbrace{E[\mu_0(X)]}_{\psi_0}

Where μa(X)=E[Y∣X,A=a]\mu_a(X) = E[Y|X,A=a]μa​(X)=E[Y∣X,A=a]﻿. 

By the linearity, we can get the EIF of ψ\psiψ﻿ by taking the difference of the EIFs of ψ1\psi_1ψ1​﻿ and ψ0\psi_0ψ0​﻿. So let’s figure out what Φ(ψa)\Phi(\psi_a)Φ(ψa​)﻿ is, keeping aaa﻿ generic. 

First, pretend our variables are discrete so ψa=∑xμa(x)p(x)\psi_a = \sum_x \mu_a(x)p(x)ψa​=∑x​μa​(x)p(x)﻿. Now use the sum and product rules:

\Phi(\psi_a) = \sum_x \Phi\big(\mu_a(x)\big)p(x) + \mu_a(x)\Phi\big(p(x)\big)

We have an expression for the influence function of μa(x)\mu_a(x)μa​(x)﻿ because that’s just a conditional mean. We also have the influence function of p(x)p(x)p(x)﻿ because we can write p(X=x)=E[1x(X)]p(X=x) = E[1_x(X)]p(X=x)=E[1x​(X)]﻿. The term 1x(X)1_x(X)1x​(X)﻿ is just a particular random variable so we can directly apply our result for the influence function of a mean. As a result:

\begin{align*} \Phi(\psi_a) &= \sum_x \left( \frac{1_{a,x}(A,X)}{p(A=a,X=x)}(Y-\mu_a(x)) \right) p(x) \\ &\quad + \sum_x \mu_a(x)\big(1_x(X) - p(X=x)\big) \\ &= \frac{1_a(A)}{\pi_a(A)}\big(Y-\mu_a(X)\big) + \mu_a(X) - \psi_a \end{align*}

In the last line we used the fact that p(a,x)=p(a∣x)p(x)p(a,x) = p(a|x)p(x)p(a,x)=p(a∣x)p(x)﻿ and we defined the propensity score πa(x)=P(A=a∣X=x)\pi_a(x) = P(A=a|X=x)πa​(x)=P(A=a∣X=x)﻿. As often happens, we collect a term that turns out to be equal to our estimand: ∑xμa(x)p(x)=ψa\sum_x \mu_a(x) p(x) = \psi_a∑x​μa​(x)p(x)=ψa​﻿. 

This holds influence function holds for discrete data. To check the candidate works for general distributions the trick is to “factorize” the generic score h=hY∣X,A+hA∣X+hXh = h_{Y|X,A} + h_{A|X} + h_Xh=hY∣X,A​+hA∣X​+hX​﻿ using the tangent space factorization we discussed in the previous section. The details involve some algebra, which you can check at your leisure.

Checking the candidate

We’ll check for ψ0\psi_0ψ0​﻿ since the proof for ψ1\psi_1ψ1​﻿ is the same. From here on we’ll omit the subscript and just write ψ=ψ0\psi = \psi_0ψ=ψ0​﻿ for notational brevity. First, we write the directional derivative for ψ\psiψ﻿ and a general hhh﻿

\begin{align*} \nabla_{h}\psi &= \frac{d}{d\epsilon} \int_x \int_y y\tilde p_\epsilon(y|x,0) dy \ \ \tilde p_\epsilon(x)dx \\ \end{align*}

where the factors of of p~ϵ\tilde p_\epsilonp~​ϵ​﻿ will depend on hhh﻿- see the section on tangent space factorization on the previous page.

For the covariance of our candidate EIF with general hhh﻿ we have:

\begin{align*} E[\phi h] &= E[(\phi+\psi)h] - \cancel{E[\psi h]} \\ &= \int_x \sum_a \int_y \left[ \frac{1_0(a)}{\pi_0(a)}\big(y-\mu_0(x)\big) + \mu_0(x) \right] h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &= \int_x \sum_a \int_y \left[ \frac{1_0(a)}{\pi_0(a)}\big(y-\mu_0(x)\big) \right] h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &+ \int_x \sum_a \int_y \mu_0(x) h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &= \int_x \int_y \left[ \big(y-\mu_0(x)\big) \right] h \ \ p(y|x,0)dy \ \ p(x)dx \\ &+ \int_x \mu_0(x)\sum_a \int_y h \ \ p(y|x,a)dy \ \ p(a|x) \ \ p(x)dx \\ &= E\left[\left.\big(Y-\mu_0(X)\big)h\ \right| A=0\right] + E\big[\mu_0(X) E[h|X]\big] \end{align*}

Since our distribution factorizes p=pY∣A,XpA∣XpXp = p_{Y|A,X}p_{A|X}p_Xp=pY∣A,X​pA∣X​pX​﻿, we can write h=hY∣X,A+hA∣X+hXh = h_{Y|X,A} + h_{A|X} + h_Xh=hY∣X,A​+hA∣X​+hX​﻿ for an arbitrary score. Thus:

\begin{align*} \nabla_h\psi &= &\nabla_{h_{Y|A,X}}\psi&\quad +& &\nabla_{h_{A|X}}\psi& +& &\nabla_{h_{X}}\psi \\ E[\phi h] &= &E[\phi h_{Y|A,X}]&\quad +& &E[\phi h_{A|X}]& +& &E[\phi h_{X}] \end{align*}

What we’ll do is show that each of the vertically stacked terms are equal to each other. I’ll show you in some detail for the hXh_XhX​﻿ terms but I’ll just give you the result for the hA∣Xh_{A|X}hA∣X​﻿ and hY∣A,Xh_{Y|A,X}hY∣A,X​﻿ terms and leave it to you to verify the algebra by using the special properties of the scores in each of these subspaces.

For the hXh_XhX​﻿ terms:

\begin{align*} \nabla_{h_X}\psi &= \frac{d}{d\epsilon} \int_x \int_y y p(y|x,0) dy \ \ (1+\epsilon h_X)p(x)dx \\ &= \int_x \int_y y p(y|x,0) dy \ \ h_Xp(x)dx \\ &= E[\mu_0h_X] \end{align*}

\begin{align*} E[\phi h_X] &= E\left[E\left[\left.\big(Y-\mu_0(X)\big)\ \right| A=0, X\right]h_X\right] + E\big[\mu_0(X) E[h_X|X]\big] \\ &= E\left[\big(\mu_0(X)-\mu_0(X)\big)h_X\right] + E\big[\mu_0(X) h_X\big] \\ &= E[\mu_0 h_X] \end{align*}

For the hA∣Xh_{A|X}hA∣X​﻿ terms: verify that ∇hA∣Xψ=0=E[ϕhA∣X]\nabla_{h_{A|X}}\psi = 0  = E[\phi h_{A|X}]∇hA∣X​​ψ=0=E[ϕhA∣X​]﻿

For the hY∣A,Xh_{Y|A,X}hY∣A,X​﻿ terms: verify that ∇hA∣Xψ=E[YhY∣A,X∣A=0]=E[ϕhA∣X]\nabla_{h_{A|X}}\psi = E[Yh_{Y|A,X}|A=0]  = E[\phi h_{A|X}]∇hA∣X​​ψ=E[YhY∣A,X​∣A=0]=E[ϕhA∣X​]﻿

Since all three sets of terms are equal, this completes the proof that ϕ\phiϕ﻿ is an influence function for ψ0\psi_0ψ0​﻿. Thus we have the same result for the ATE because of gradient summation. Since the model is saturated, this influence function is the only one and is therefore the EIF. 

Non-Saturated Models

If we have a tangent set that is not all of L20\mathcal L_2^0L20​﻿ things get a bit trickier. Here we’ll discuss what’s most often called the projection approach.

The idea is to first find an existing estimator that is known to be RAL and figure out its influence function ϕ\phiϕ﻿. Then, taking advantage of the geometry of the problem, we know that the projection of ϕ\phiϕ﻿ onto the tangent space T\mathcal TT﻿ is the EIF. In this case, finding the EIF is just a matter of computing a projection in L20\mathcal L _2^0L20​﻿. Note that this approach does not apply if the tangent set is L20\mathcal L_2^0L20​﻿ because in that case there is only a single valid influence function. Therefore if we had an existing RAL estimator, it would already be efficient.

By "projection" what we mean is the mathematical decomposition of an element of a vector space into a sum of an element of a particular subspace and a vector orthogonal to that subspace. The properties of Hilbert space guarantee that every element has a unique projection onto any given subspace (see: projection in Hilbert space). In the finite vector spaces you're probably used to, calculating projections can be tedious, but it is relatively straightforward vector algebra. In L20\mathcal L_2^0L20​﻿ there isn't a general purpose formula given an arbitrary subspace. Thankfully, however, there are a few for the kinds of subspaces we're usually interested in. To wit, we'll consider two kinds of specific subspaces. Imagine our data ZZZ﻿ has two components XXX﻿ and YYY﻿ (i.e. Z=[X,Y]Z = [X,Y]Z=[X,Y]﻿) so the possible scores are zero-mean functions h(X,Y)h(X,Y)h(X,Y)﻿. 

One important subspace are the functions where h(X,Y)=h(X)h(X,Y) = h(X)h(X,Y)=h(X)﻿, i.e. the functions that just depend on XXX﻿. We'll call this subspace TX\mathcal T_XTX​﻿. We already saw a subspace exactly like this come up in the example in the previous section. Let h⟨T⟩h_{\langle \mathcal T\rangle}h⟨T⟩​﻿ denote the projection of hhh﻿ onto a subspace T\mathcal TT﻿ (other authors prefer notation like hΠTh \Pi \mathcal ThΠT﻿, etc.). It turns out that:

[h(X,Y)_{\langle \mathcal T_X\rangle}](X) = E[h|X]

The notation is a little confusing, but what I'm trying to say is that when you project the function of two variables onto this space you get back a function of just one variable.

The other important kind of subspace that we'll consider is TY∣X0={h(X,Y):E[h∣X]=0}\mathcal T^0_{Y|X} = \{h(X,Y) : E[h|X]=0 \}TY∣X0​={h(X,Y):E[h∣X]=0}﻿. In words: we're talking about the space of functions that have mean 0 when conditioned on any value of XXX﻿. We also saw a subspace like this one come up in the example above. Once again we have a handy formula:

h_{\langle \mathcal T^0_{Y|X}\rangle} = h - E[h|X]

Here the resulting projection is still a function of both XXX﻿ and YYY﻿ so I've omitted the explicit notation of the arguments.

You should verify both of these identities by checking the two properties of projections: h⟨T⟩∈Th_{\langle \mathcal T \rangle} \in \mathcal Th⟨T⟩​∈T﻿ and h−h⟨T⟩⊥Th - h_{\langle \mathcal T \rangle} \perp \mathcal Th−h⟨T⟩​⊥T﻿. 

We can now combine these two formulas. Imagine that we have data of the form Z=[X,Y,W]Z = [X,Y,W]Z=[X,Y,W]﻿ and we want to project a function h(X,Y,W)h(X,Y,W)h(X,Y,W)﻿ to the space TY∣X0\mathcal T^0_{Y|X}TY∣X0​﻿. The functions in this subspace only depend on XXX﻿ and YYY﻿ so first we need to use our first identity to project h(X,Y,W)h(X,Y, W) h(X,Y,W)﻿ down to a function of just XXX﻿ and YYY﻿. Then we can use the second identity to project the result into the space of functions where the conditional expectation on XXX﻿ is zero. The result is:

[h(X,Y,W)_{\langle \mathcal T_{Y|X}^0\rangle}](X,Y) = E[h|X,Y] - E[h|X]

This is facilitated by the fact that E[E[h∣X,Y]∣X]=E[h∣X]E[E[h|X,Y]|X] = E[h|X]E[E[h∣X,Y]∣X]=E[h∣X]﻿ (i.e. we average over ZZZ﻿, then YYY﻿, so all that's left is XXX﻿). 

Example: ATE in a Randomized Trial

Let's go back to our running example. In the previous section we derived the tangent space for P∈MRCTP \in \mathcal M_{\text{RCT}}P∈MRCT​﻿.  We'd like to find the EIF of the ATE in this model.

We know that this model space is not saturated because any probability distribution in it has to satisfy the known treatment assignment mechanism. We'll therefore have to use the projection strategy to find the canonical gradient.

Our overall plan is as follows:

Find the canonical gradient for ψa=E[Y∣A=a]\psi_a = E[Y|A=a]ψa​=E[Y∣A=a]﻿:

Start with a known RAL estimator of ψa\psi_aψa​﻿

Derive its influence function

Project that influence function onto the tangent space we found previously (this is the hard part):

Project onto each tangent subspace

Sum the projections

Combine the canonical gradients for ψ0\psi_0ψ0​﻿ and ψ1\psi_1ψ1​﻿ to get the equivalent for ψ\psiψ﻿

First things first: do we know any RAL estimator of ψa\psi_aψa​﻿? It turns out that we do: ψ^IPW(Pn)=1n∑i[1a(Ai)Yiπa(Xi)]=Pn[1a(A)Yπa(X)]\hat\psi_{\text{IPW}}(\mathbb P_n) =  \frac{1}{n} \sum_i \left[\frac{1_a(A_i)Y_i}{\pi_a(X_i)}\right] =  \mathbb P_n \left[\frac{1_a(A)Y}{\pi_a(X)}\right]ψ^​IPW​(Pn​)=n1​∑i​[πa​(Xi​)1a​(Ai​)Yi​​]=Pn​[πa​(X)1a​(A)Y​]﻿ where πa(X)=P(A=a∣X)\pi_a(X) = P(A=a|X)πa​(X)=P(A=a∣X)﻿ is the known randomization mechanism. This is the inverse-probabilty-weighted estimator (IPW) of the mean of YYY﻿ conditioned on A=aA=aA=a﻿. For a trial with simple randomization this reduces to a difference of means in the two treatment groups. 

It only takes algebra to check that this estimator is asymptotically normal: just subtract ψ(P)\psi(P)ψ(P)﻿ from both sides, multiply by n\sqrt nn​﻿, and pass the constant ψa\psi_aψa​﻿ through the sum to see that 

\sqrt n \left( \hat\psi_{\text{IPW}}(\mathbb P_n) -\psi_a(P) \right) = \sqrt n \mathbb P_n \left[ \underbrace{ \frac{1_a(A)Y}{\pi_a(X)} - \psi_a(P) }_{ \phi_a } \right]

By putting it in this form and comparing to our definition of asymptotic linearity we've managed to pick out the influence function ϕa\phi_aϕa​﻿. It's also possible to prove that this estimator is regular. One way to do that is to essentially repeat the arguments made in the proof of our main theorem of why influence functions for RAL estimators must satisfy ∇hψ=E[ϕh]\nabla_h\psi = E[\phi h]∇h​ψ=E[ϕh]﻿ for all hhh﻿, except in reverse: for regularity to hold, we need precisely to show that this identity holds for all scores in the tangent space.

Let's project the influence function ϕa\phi_aϕa​﻿ we just derived above into the tangent space we previously identified:

Our strategy will be to project ϕa\phi_aϕa​﻿ into each of these component subspaces and then take the sum. Thankfully, these subspaces are exactly of the form we discussed above, so we can use the projection identities we posited above (and which you checked, right?). Computing these, we get that

\begin{align*} \phi_{a\langle \mathcal T^0_{Y|X,A} \rangle} &= \left( \frac{1_a(A)}{\pi_a(X)}Y - \psi_a \right) - \left( \frac{1_a(A)}{\pi_a(X)} \mu_a(X) - \psi_a \right) \\ \phi_{a\langle \mathcal T_{X} \rangle} &= \mu_a(X) - \psi_a \end{align*}

To obtain these results I've just applied the identities in the section above and introduced the notation μa(X)=E[Y∣A=a,X]\mu_a(X) = E[Y|A=a, X]μa​(X)=E[Y∣A=a,X]﻿. The rest is just calculating conditional expectations. Please try this yourself and make sure it makes sense to you! 

Now we sum to obtain ϕa†(Y,A,X)=1a(A)πa(X)(Y−μa(X))+(μa(X)−ψa)\phi_a^\dagger(Y,A,X) 
=
\frac{1_a(A)}{\pi_a(X)}(Y - \mu_a(X))  
+ 
(\mu_a(X) - \psi_a)ϕa†​(Y,A,X)=πa​(X)1a​(A)​(Y−μa​(X))+(μa​(X)−ψa​)﻿. 

This proves the perhaps surprising fact that the canonical gradient for the average treatment effect in an observational study is the same as it is in a randomized trial if one is not willing to make any distributional assumptions about the data-generating mechanisms (aside from the treatment assignment in the RCT). 

The important difference you should notice, however, is that this canonical gradient is the only gradient in the nonparametric observational study model, whereas for RCTs there is a whole space of gradients. That has the immediate implication that there are many RAL estimators one could use for an RCT (only one of which is efficient), but there is only one valid RAL estimator (up to asymptotic equivalence) in the observational setting. 

Alternative proof exploiting orthogonal decomposition of tangent space

What we'll do instead of finding something to project is actually quite clever and leverages the fact that we've already figured out the canonical gradient for an observational study. Let's start with some facts that we know: (1) we can write any score in Mobs\mathcal M_{\text{obs}}Mobs​﻿ as a unique orthogonal sum h=h⟨TRCT⟩+h⟨TA∣X0⟩h 
= 
h_{\langle \mathcal T_{RCT}\rangle}
+ 
h_{\langle \mathcal T^0_{A|X}\rangle}h=h⟨TRCT​⟩​+h⟨TA∣X0​⟩​﻿ since those two tangent subspaces are orthogonal, and (2) ϕa,RCT†⊥h⟨TA∣X0⟩\phi^\dagger_{a,\text{RCT}} \perp h_{\langle \mathcal T^0_{A|X}\rangle}
ϕa,RCT†​⊥h⟨TA∣X0​⟩​﻿ because the canonical gradient in the RCT model is in the RCT tangent space, which is orthogonal to TA∣X0\mathcal T^0_{A|X}TA∣X0​﻿. 

Now let's calculate the pathwise derivative of ψa\psi_aψa​﻿ in the direction h=h⟨TRCT⟩+h⟨TA∣X0⟩h 
= 
h_{\langle \mathcal T_{RCT}\rangle}
+ 
h_{\langle \mathcal T^0_{A|X}\rangle}h=h⟨TRCT​⟩​+h⟨TA∣X0​⟩​﻿ through the observational study model. Our hope is that we can somehow end up writing the pathwise derivative as an inner product between hhh﻿ and some function, which must then be the canonical gradient. 

Because the pathwise derivative is a linear function of the score, we can break it up as follows:

\nabla_h \psi_a =\nabla_{h_{\langle \mathcal T_{RCT}\rangle}} \psi_a + \nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a

The first term is nothing but the pathwise derivative of ψa\psi_aψa​﻿ in the RCT model for which we already have the canonical gradient ϕRCT†\phi^\dagger_\text{RCT}ϕRCT†​﻿. So we can represent that term as E[h⟨TRCT⟩ϕRCT†]E[h_{\langle \mathcal T_{RCT}\rangle}\phi^\dagger_\text{RCT}]E[h⟨TRCT​⟩​ϕRCT†​]﻿. But by our fact (2), this is equivalent to E[hϕRCT†]E[h \phi^\dagger_\text{RCT}]E[hϕRCT†​]﻿. Adding h⟨TA∣X0⟩h_{\langle \mathcal T^0_{A|X}\rangle}h⟨TA∣X0​⟩​﻿ inside the expectation does nothing because it's orthogonal to the RCT canonical gradient. So now we've got

\nabla_h \psi_a =E[h \phi^\dagger_\text{RCT}]+ \nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a

Now we'll brute-force calculate the second term:

\nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a = \lim_{\epsilon\rightarrow 0} \frac{\tilde P_{\epsilon}[Y|A=a] - P[Y|A=a]}{\epsilon}

Let's drop the ϵ\epsilonϵ﻿ subscripts for the moment. We also have that

\begin{align*} \tilde p &= (1+\epsilon h_{\langle \mathcal T^0_{ A|X}\rangle}(A,X))p \\ &= (1+\epsilon h_{\langle \mathcal T^0_{A|X} \rangle}(A,X)) \left(p_{Y|A,X} \times p_{A|X} \times p_{X} \right) \\ &= \underbrace{ p_{Y|A,X} }_{\tilde p_{Y|A,X}} \times \underbrace{ (1+\epsilon h_{\langle \mathcal T^0_{A|X} \rangle}(A,X))p_{A|X} }_{\tilde p_{A|X}} \times \underbrace{ p_{X} }_{\tilde p_X} \end{align*}

Because h⟨TA∣X0⟩h_{\langle\mathcal T_{A|X}^0\rangle}h⟨TA∣X0​⟩​﻿is in the tangent space of densities pA∣Xp_{A|X}pA∣X​﻿, the only way to distribute that term to get three legal densities of the form above (that factorize p~\tilde pp~​﻿) is to put the h⟨TA∣X0⟩h_{\langle\mathcal T_{A|X}^0\rangle}h⟨TA∣X0​⟩​﻿ term into p~A∣X\tilde p_{A|X}p~​A∣X​﻿. What this shows is that walking along any paths that are in TA∣X0\mathcal T^0_{A|X}TA∣X0​﻿ actually doesn't change pY∣A,Xp_{Y|A,X}pY∣A,X​﻿ or pXp_XpX​﻿. The fluctuated density has the same marginal density for the covariates, and the same conditional density for the outcome given treatment and covariates. The only thing walking along paths in TA∣X0\mathcal T^0_{A|X}TA∣X0​﻿ can change is the treatment assignment mechanism. This should make sense to you because these are exactly the paths we are forbidden to walk if we're constrained to the RCT model in which we must hold the treatment assignment mechanism constant but we're allowed to change anything else.

Going back to the expression for the pathwise gradient, we can show using iterated expectations that P~ϵ[Y∣A=a]=∫μ~a(X)dP~X\tilde P_{\epsilon}[Y|A=a] =  \int \tilde \mu_a(X)d\tilde P_XP~ϵ​[Y∣A=a]=∫μ~​a​(X)dP~X​﻿ where μ~a(X)=P~ϵ[Y∣A=a,X]\tilde \mu_a(X) = \tilde P_\epsilon[Y|A=a,X]μ~​a​(X)=P~ϵ​[Y∣A=a,X]﻿. However, by the argument above, μ~a=μa\tilde \mu_a = \mu_aμ~​a​=μa​﻿ and dP~X=dPXd\tilde P_X = dP_XdP~X​=dPX​﻿ so, in fact, P~ϵ[Y∣A=a]=P[Y∣A=a]\tilde P_{\epsilon}[Y|A=a] = P[Y|A=a]P~ϵ​[Y∣A=a]=P[Y∣A=a]﻿. In words, we've shown that moving along paths in TA∣X0\mathcal T^0_{A|X}TA∣X0​﻿ does nothing to the parameter ψa\psi_aψa​﻿. Therefore ∇h⟨TA∣X0⟩ψa=0\nabla_{h_{\langle \mathcal T^0_{A|X}\rangle}} \psi_a = 0∇h⟨TA∣X0​⟩​​ψa​=0﻿ and we can plug that into our calculation of the pathwise derivative for general hhh﻿:

\nabla_h \psi_a =E[h \phi^\dagger_\text{RCT}]+ 0

Since we've succeeded in expressing the pathwise derivative of this parameter along any path hhh﻿ as an inner product between hhh﻿ and a function ϕRCT†\phi^\dagger_\text{RCT}ϕRCT†​﻿ we have that this is in fact the canonical gradient of the conditional mean ψa\psi_aψa​﻿ in the nonparametric observational study model Mobs\mathcal M_{\text{obs}}Mobs​﻿. 

Other Methods

Combined with gradient algebra, point mass contamination and projection are two very useful strategies for deriving efficient influence functions. 

Unfortunately, there are some rare cases in which they don’t work. For example, if the tangent space factors, but not orthogonally, it becomes much more difficult to find projection (though methods do exist for doing this). Another common approach is to find the influence function in the full-data (causal) model and then figure out a way to map it to the observed data (statistical) model (see Tsiatis 2006 for details). Also, in rare cases, the tangent set isn't even a full space (i.e. instead of being a hyperplane, it's some sort of "triangle" in that hyperplane) and this can also pose difficulties. 

Nuisance Tangent Space

In many texts and resources you'll find mention of something called the "nuisance tangent space". This is a tool that is sometimes helpful in characterizing the set of influence functions, but it is by no means always necessary. Indeed, we don't use it in any of the examples in this chapter. Historically it played a much larger role in efficiency theory, which is why you'll see it mentioned a lot in the literature.

The definition of this space that you'll see most often relies on a semiparametric construction of the model, where every distribution is assumed to be uniquely described by a finite-dimensional vector of parameters ψ\psiψ﻿ that are of interest and some infinite-dimensional vector of parameters η\etaη﻿ that are not of interest. For example, you might consider a model like Y=Aψ+N(η1(X),η2)Y = A\psi + \mathcal N(\eta_1(X), \eta_2)Y=Aψ+N(η1​(X),η2​)﻿. When you have this kind of construction, you can calculate scores for each parameter and define the the nuisance tangent space as the completed span of the scores for η\etaη﻿. 

I think this construction is sort of artificial. The definition I like more is that the nuisance tangent space is the completed span of all scores hhh﻿ such that ∇hψ=0\nabla_h \psi = 0∇h​ψ=0﻿. This more general definition is due to Mark van der Laan. The nusiance tangent space is usually denoted Tη\mathcal T_\etaTη​﻿ or Λ\LambdaΛ﻿, which is a subset of T\mathcal TT﻿.  It's immediate from our definition that any influence function is orthogonal to any element of Λ\LambdaΛ﻿. If we denote the orthogonal complement of Λ\LambdaΛ﻿ using the notation Λ⊥\Lambda^\perpΛ⊥﻿, then we have that ϕ∈Λ⊥\phi \in \Lambda^\perpϕ∈Λ⊥﻿. Knowing this is sometimes useful in deriving the efficient influence function, but we don't use that fact in any of the examples in this book.

None of this is anything you should worry about unless you plan to work on esoteric new parameters and model spaces for which the canonical gradient has not yet been derived. You probably won’t come up against anything that required tools beyond what’s in this section, but mileage may vary!

⬅️

BACK: 3.2 Efficiency Among RAL Estimators

➡️

NEXT: 4. Building Efficient Estimators