In the previous section we learned that the canonical gradient is also the influence function of the most efficient RAL estimator (the efficient influence function, EIF). So, if we want to know what that estimator is (so we can use it!), the first step is to derive the canonical gradient of our statistical estimand in our statistical model. Since the point of doing this is to arrive at the EIF, we typically just say we’re “deriving the EIF” it’s the same thing either way.
Knowing the relationships between the tangent space, pathwise derivative, etc. etc. in abstract generality doesn't immediately tell us what the EIF is for any specific problem. Basically, we know in theory what it is. But, given a particular statistical model and parameter, how do we actually find it? There are many ways to do this.
Things are typically easier in the fully nonparametric (i.e. saturated) setting. Since there are no restrictions on the tangent space, we can “move” in any direction and that ends up making it easier to derive the EIF. We’ll tackle this case first. There are a few tricks we’ll go over to make this even easier.
When the statistical model is semiparametric (i.e. nonsaturated) things are a harder because there are certain directions we can’t go without departing the tangent space. That makes the math more hairy. Thankfully, due to the factorization of the tangent space, there is a simpleenough method that works in most practical cases.
That said, no one method works for all models and estimands. If you’re doing something very different from what’s already out there you might need new and interesting math!
Before we get to any of that, though, we’ll lay out some generic algebraic tools that will help us build up complicated EIFs from simpler ones.
Gradient Algebra
Sometimes your statistical parameter can be expressed as a fixed function of another statistical parameter, or as a sum of two other statistical parameters, etc. In these cases, it's often easier to find the canonical gradient of the component parameters and then combine them to get the canonical gradient for the original target parameter. Thankfully, it's really easy to do that!
The trick is to realize that the pathwise derivative (for a valid path given by $h$) is just an ordinary unidimensional derivative for a function of $\epsilon$. In brief: $\nabla_h \psi = \frac{d \psi(\tilde P_\epsilon)}{d \epsilon}$. So we can directly apply all of the relevant results from undergraduate calculus: the chain rule, addition of derivatives, etc.
We’ll go through the chain rule as an example. Imagine that you know the canonical gradient for a parameter $\psi$ but what you're really interested in estimating is the quantity $g(\psi)$. Well, check this out:
$\nabla_h g(\psi(P))
\overset{\text{def.}}{=}
\frac{d g(\psi(\tilde P_\epsilon))}{d \epsilon}
\overset{\text{chain rule}}{=}
g'(\psi(P))
\frac{d \psi(\tilde P_\epsilon)}{d \epsilon}
\overset{\text{Reisz rep.}}{=}
\underbrace{g'(\psi(P))}_{\text{constant}} E[h\phi^\dagger]
\overset{\text{linearity}}{=}
E\big[h
\underbrace{
\phi^\dagger g'(\psi(P))
}_{\phi^\dagger_{g \circ \psi}}
\big]$
This looks intimidating but each step is a something we're already familiar with so it's just about putting it together. At the end of the day, what we've shown is that the function $g'(\psi(P))\phi^\dagger$ is exactly the Reisz representer for $\nabla_h g(\psi(P))$ and is thus by definition the canonical gradient of $g(\psi)$. Note that multiplying by the constant $g'$ is a linear operation so we can't have left the tangent space (which hasn't changed) and somehow obtained a noncanonical gradient. We therefore have a simple formula (effectively the chain rule) to compute gradients for functions of parameters.
It helps to think of an “EIF operator” $\Phi$ for a given model that takes a parameter $\psi$ and returns its efficient influence function $\phi$. We can write out a few useful algebra rules this way:
$\begin{align*}
\Phi(g(\psi)) &= g'(\psi)\Phi(\psi)
& \quad \text{(chain rule)}
\\
\Phi(\psi_1 \psi_2) &= \Phi(\psi_1)\psi_2 + \psi_1\Phi(\psi_2)
& \quad \text{(product rule)}
\\
\Phi(a_1\psi_1+a_2\psi_2) &= a_1\Phi(\psi_1) + a_2\Phi(\psi_2)
& \quad \text{(linearity)}
\end{align*}$
We’ll call this “gradient algebra” it’s a generically useful set of tools that we can use to build up canonical gradients from simpler pieces, the same way we can use the chain rule, etc. to build up complex derivatives from simpler ones. We’ll see some examples in a little bit.
Saturated Models
Life is generally good when our model is fully nonparametric. There is only one influence function, and it’s the efficient one. All we need to do is to find it.
The material in this section closely follows and condenses Kennedy 2022 and Hines 2021, though the methods have been around for much longer!
Point Mass Contamination
The trick most people use in this setting is called point mass contamination (or sometimes the Gateaux derivative method). The idea is to 1) pretend that all the variables in your model are discrete and then 2) to consider a particular kind of path where the "destination" $\tilde P$ is a distribution that places all of its mass at some point $\tilde z$ (I'll use $1_{\tilde z}$ to notate such a distribution). In other words, this distribution enforces $\tilde P(Z=\tilde z) = 1$ and the probability that $Z$ takes any other value is 0. We can therefore think of the path $\tilde P_\epsilon$ for small $\epsilon$ as the distribution $P$ "contaminated" with just a little extra mass at the value $\tilde z$.
Because our model is saturated, we can pick any point $\tilde z$ and get a legal path of this type. If our model were not saturated at $P$, then there would be no guarantee that for some given $\tilde z$ a path like this wouldn't immediately take us outside the model space.
Let the score of such a path towards a distribution $1_{\tilde z}$ be denoted $h_{\tilde z}$. For paths of this type, we can argue that the value of the influence function at the point $\tilde z$ is given by the derivative of $\psi$ at 0 in the direction of the score $h_{\tilde z}$. Or, formally (and dropping tildes): $\phi(z) = \nabla_{h_z} \psi$. This is extremely convenient! All we have to do to get an influence function is bruteforce compute the pathwise derivative of our parameter along paths defined by point mass contaminants. And since there’s only one influence function, if we find one, we’ve found the efficient one.
Proof of $\phi(z) = \nabla_{h_{\tilde z}} \psi$
Recall that our definition of the score and path is $\tilde P_\epsilon(A) = \int_A (1+\epsilon h) dP$. The lefthand side is equivalent to $\int_A d\tilde P_\epsilon$ and we can exploit our changeofvariables formula to give $\int \phi d\tilde P_\epsilon = \int \phi (1+\epsilon h) dP$ (as long as $\epsilon < 1$, else we lose absolute continuity). Now $\int \phi (1+\epsilon h) dP = \int \phi dP + \epsilon \int \phi h dP = 0 + \epsilon E[\phi h]$ by the zeromean property of $\phi$ and linearity of the integral. Thus $\int \phi d\tilde P_\epsilon = \epsilon E[\phi h]$. The definition of the path in terms of a convex combination of CDFs implies that $\tilde P_{\epsilon} \rightsquigarrow \tilde P$ In the limit as $\epsilon \rightarrow 1$, so by the portmanteau lemma $\int \phi d\tilde P_\epsilon \rightarrow \int \phi d\tilde P$. Moreover, clearly $\epsilon E[\phi h] \rightarrow E[\phi h]$ in that same limit. So we have $\int \phi d\tilde P = E[\phi h]$
If $\tilde P = 1_{\tilde z}$ is a point mass at $\tilde z$, then the integral on the lefthand side here is exactly $\phi(\tilde z)$ and $h = h_{\tilde z}$. Thus $E[\phi h_z] = \phi(z)$ (dropping the tildes). Finally, by our central identity for influence functions, $\nabla_{h_z}\psi = E[\phi h_z] = \phi(z)$ as desired.
Why is it called an "influence function"?
When we first presented influence functions we weren't in a position to explain why they have that name, but now we are! In a saturated model, our arguments above show that $\phi(z) = \nabla_{h_z} \psi$ for discrete distributions. Using its definition we can expand that derivative as follows:
$\phi(z) = \lim_{\epsilon \rightarrow 0}
\frac{
\psi(1_z)  \psi(P)
}{
\epsilon
}$
so what $\phi(z)$ tells us is how much the parameter $\psi$ changes if we add an infinitesimal amount of probability mass at the point $z$. Or, you might say, it is the "influence" that the point $z$ exerts on the parameter (at $P$). That explains the name!
We started out by assuming that our data were actually discrete, so we now have to actually go back and check that the influence function we derived still works in the original model. But this is much easier once we have our candidate: just compute $\nabla_{h} \psi$ and $E[\phi h]$ and check the two are equal for arbitrary $h$ in the tangent space $\mathcal L_2^0$.
The whole procedure may seem slightly abstract at the moment but we will see examples shortly. You also don’t have to worry about doing this manually most of the time because you can use gradient algebra to build up your result from the examples provided here!
Example: Mean
Consider the estimand $\psi(P) = E[Z]$ where $Z$ can have any distribution $P$ in a nonparametric model. What’s the efficient influence function?
Point Mass Contamination
First, start by pretending $Z$ is discrete and only takes values $z$ in some set $\mathcal Z$, with probability mass $p(z)$ at each point. Then the formula for our estimand is $E[Z] = \sum_{z\in\mathcal Z} zp(z)$.
Now let’s add a tiny little bit ($\epsilon$) of mass to a particular point $\tilde z$ to get a new probability mass function: $\tilde p_\epsilon(z) = (1\epsilon)p(z) + \epsilon1_{\tilde z}(z)$. Technically this isn’t a probability mass function anymore because the probabilities don’t sum to 1 but we can ignore that: all of this is a heuristic to get us a candidate influence function. Now let’s compute our estimand for this perturbed distribution:
$\psi(\tilde p_\epsilon) = E_{\tilde p_\epsilon}[Z] = \sum_{z\in\mathcal Z} z((1\epsilon)p(z) + \epsilon1_{\tilde z}(z))$
Recall our definition $\nabla_{h} \psi = \left. \frac{d}{d\epsilon} \psi(\tilde p_\epsilon) \right_0$. All we have to do to compute this is take the derivative w.r.t. $\epsilon$ of the righthand side above and then set $\epsilon = 0$. We get:
$\begin{align*}
\frac{d}{d\epsilon} \psi(\tilde p_\epsilon)
&=
\sum_z z(1_{\tilde z}(z)  p(z)) \\
&= \tilde z  \psi(p) \\
\end{align*}$
This is typical in EIF derivations of this kind: we end up with a term that is the estimand itself (in this case $\sum zp(z)$) and some other terms where the indicator function at $\tilde z$ cancels out all of the terms in the sum over $z$ except for the term at $z=\tilde z$.
Thus we arrive at our candidate EIF:
$\phi(Z) = Z  E[Z]$
Checking the Candidate
Now we can check that for any $h \in \mathcal L_2^0$ our central identity $\nabla_h \psi = E[\phi h]$ is satisfied:
From the left:
$\begin{align*}
\nabla_h\psi &=
\frac{d}{d\epsilon} \psi((1+\epsilon h)p)
\\ &=
\frac{d}{d\epsilon}
\int z(1+\epsilon h(z))p(z)dz
\\ &=
\int z h(z) p(z)dz
\\ &= E[Zh(Z)]
\end{align*}$
More generally…
$\begin{align*}
\nabla_h\psi &=
\frac{d}{d\epsilon} \psi(\tilde P_\epsilon)
\\ &=
\frac{d}{d\epsilon}
\int Z d\tilde P_\epsilon
\\ &=
\frac{d}{d\epsilon}
\int Z(1+\epsilon h(Z)) dP
\\ &=
\int Z h(Z) dP
\\ &=
E[ Z h(Z)]
\end{align*}$
And from the right:
$\begin{align*}
E[\phi h]
&=
E[(ZE[Z])h(Z)]
\\&=
E[Zh(Z)]E[Z]\underbrace{E[h(Z)]}_0
\end{align*}$
Why does having a candidate EIF make this easier?
Consider the derivation above. Working from the result on the lefthand side we could have noticed that $E[Zh(Z)] = E[Zh(Z)]E[Z]E[h(Z)]$ because $E[h(Z)]=0$ and then proceeded backwards up the derivation on the righthand side. That would have gotten us our influence function without ever supposing a candidate.
The problem is that this is a “clever trick” that is really only obvious in retrospect. If we were working from the lefthand side, how would we know to subtract zero in the guise of $E[Z]E[h]?$ There are a million other things we might have tried if we didn’t have the right intuition from solving a lot of these problems in the past. Moreover this is a very simple example in the general case the proof might involve some very unintuitive techniques. Basically: if you don’t already know where you’re going, it’s very hard to get to the right answer.
On the other hand, with a candidate EIF we can just plugandchug from the righthand side and meet up in the middle. No creativity or special intuition required.
Thus we are done and have proved that $\phi(Z) = Z  E[Z]$ is the EIF for $\psi(P) = E[Z]$ in the nonparametric model. You may notice this is exactly the influence function for the sample mean estimator, which shows it is nonparametrically efficient.
Example: Conditional Mean
Consider $\psi(P) = E[YX=x]$ for some discrete variables $Y$ and $X$ and a given value $x$. As in the previous example, we can express this estimand as $\sum_y y p(yx)$. The steps will be the same as before: define $\tilde p_\epsilon$ as the pointmass contaminated version of $p$, compute the derivative of $\psi(\tilde p_\epsilon)$ w.r.t. $\epsilon$, then set $\epsilon = 0$.
Here we can define
$\begin{align*}
\tilde p_\epsilon(yx)
&= \frac{\tilde p_\epsilon(y,x)}{\tilde p_\epsilon(x)}
\\&=
\frac
{(1\epsilon)p(y,x) + \epsilon 1_{\tilde y, \tilde x}(y,x)}
{(1\epsilon)p(x) + \epsilon 1_{\tilde x}(x)}
\end{align*}$
Plugging this we can compute $\sum_y y \tilde p_\epsilon(yx)$ and take the derivative. I’ll spare you a few lines of algebra (see Kennedy 2022 if you need it) and tell you that what we end up with is
$\phi(Y,X) =
\frac{1_x(X)}{p(X=x)}(YE[YX=x])$
You can verify for yourself that this influence function still holds if $Y$ is allowed to have an arbitrary (nondiscrete) distribution by following the same steps from the example above.
We can also slightly relax the restriction on $X$ to allow arbitrary distributions that have nonzero mass at the point $x$. If there is no mass at $x$, the above influence function is undefined. Indeed it is known that the general conditional mean is not a pathwise differentiable estimand so this makes sense.
Building up Influence Functions
This pointmass contamination strategy works perfectly well for complicated estimands but it can take a bunch of algebra (see, e.g. Hines 2022). Instead of reinventing the wheel every time, it’s often easier to use our gradient algebra tricks to build up a complicated influence function from simpler component parts like those in the examples above.
So far we have EIFs for the mean $P(Z) \mapsto E[Z]$
$\begin{align*}
\Phi\big(P\mapsto E[Z]\big) &= Z E[Z]
\end{align*}$
and for the conditional mean $P(Y,X) \mapsto E[YX=x]$:
$\begin{align*}
\Phi\big(P\mapsto E[YX=x]\big) &= \frac{1_x(X)}{p(X=x)}(YE[YX=x])
\end{align*}$
Let’s put these together with our gradient algebra rules in an example.
Example: ATE in an Observational Study
Here we're interested in the general nonparametric model for an observational study with an outcome $Y$, binary treatment $A$, and vector of arbitrary covariates $X$. Our parameter of interest is the statistical ATE
$\psi = \underbrace{E[\mu_1(X)]}_{\psi_1}  \underbrace{E[\mu_0(X)]}_{\psi_0}$
Where $\mu_a(X) = E[YX,A=a]$.
By the linearity, we can get the EIF of $\psi$ by taking the difference of the EIFs of $\psi_1$ and $\psi_0$. So let’s figure out what $\Phi(\psi_a)$ is, keeping $a$ generic.
First, pretend our variables are discrete so $\psi_a = \sum_x \mu_a(x)p(x)$. Now use the sum and product rules:
$\Phi(\psi_a) = \sum_x \Phi\big(\mu_a(x)\big)p(x) + \mu_a(x)\Phi\big(p(x)\big)$
We have an expression for the influence function of $\mu_a(x)$ because that’s just a conditional mean. We also have the influence function of $p(x)$ because we can write $p(X=x) = E[1_x(X)]$. The term $1_x(X)$ is just a particular random variable so we can directly apply our result for the influence function of a mean. As a result:
$\begin{align*}
\Phi(\psi_a) &=
\sum_x \left(
\frac{1_{a,x}(A,X)}{p(A=a,X=x)}(Y\mu_a(x))
\right) p(x) \\
&\quad + \sum_x \mu_a(x)\big(1_x(X)  p(X=x)\big) \\
&= \frac{1_a(A)}{\pi_a(A)}\big(Y\mu_a(X)\big) + \mu_a(X)  \psi_a
\end{align*}$
In the last line we used the fact that $p(a,x) = p(ax)p(x)$ and we defined the propensity score $\pi_a(x) = P(A=aX=x)$. As often happens, we collect a term that turns out to be equal to our estimand: $\sum_x \mu_a(x) p(x) = \psi_a$.
This holds influence function holds for discrete data. To check the candidate works for general distributions the trick is to “factorize” the generic score $h = h_{YX,A} + h_{AX} + h_X$ using the tangent space factorization we discussed in the previous section. The details involve some algebra, which you can check at your leisure.
Checking the candidate
We’ll check for $\psi_0$ since the proof for $\psi_1$ is the same. From here on we’ll omit the subscript and just write $\psi = \psi_0$ for notational brevity. First, we write the directional derivative for $\psi$ and a general $h$
$\begin{align*}
\nabla_{h}\psi
&=
\frac{d}{d\epsilon}
\int_x \int_y
y\tilde p_\epsilon(yx,0) dy
\ \
\tilde p_\epsilon(x)dx
\\
\end{align*}$
where the factors of of $\tilde p_\epsilon$ will depend on $h$ see the section on tangent space factorization on the previous page.
For the covariance of our candidate EIF with general $h$ we have:
$\begin{align*}
E[\phi h] &= E[(\phi+\psi)h]  \cancel{E[\psi h]} \\
&=
\int_x \sum_a \int_y
\left[ \frac{1_0(a)}{\pi_0(a)}\big(y\mu_0(x)\big) + \mu_0(x)
\right]
h \ \
p(yx,a)dy \ \
p(ax) \ \
p(x)dx
\\
&=
\int_x \sum_a \int_y
\left[ \frac{1_0(a)}{\pi_0(a)}\big(y\mu_0(x)\big)
\right]
h \ \
p(yx,a)dy \ \
p(ax) \ \
p(x)dx
\\
&+
\int_x \sum_a \int_y
\mu_0(x)
h \ \
p(yx,a)dy \ \
p(ax) \ \
p(x)dx
\\
&=
\int_x \int_y
\left[ \big(y\mu_0(x)\big)
\right]
h \ \
p(yx,0)dy \ \
p(x)dx
\\
&+
\int_x \mu_0(x)\sum_a \int_y
h \ \
p(yx,a)dy \ \
p(ax) \ \
p(x)dx
\\
&=
E\left[\left.\big(Y\mu_0(X)\big)h\ \right A=0\right]
+
E\big[\mu_0(X) E[hX]\big]
\end{align*}$
Since our distribution factorizes $p = p_{YA,X}p_{AX}p_X$, we can write $h = h_{YX,A} + h_{AX} + h_X$ for an arbitrary score. Thus:
$\begin{align*}
\nabla_h\psi
&=
&\nabla_{h_{YA,X}}\psi&\quad +&
&\nabla_{h_{AX}}\psi& +&
&\nabla_{h_{X}}\psi
\\
E[\phi h]
&=
&E[\phi h_{YA,X}]&\quad +&
&E[\phi h_{AX}]& +&
&E[\phi h_{X}]
\end{align*}$
What we’ll do is show that each of the vertically stacked terms are equal to each other. I’ll show you in some detail for the $h_X$ terms but I’ll just give you the result for the $h_{AX}$ and $h_{YA,X}$ terms and leave it to you to verify the algebra by using the special properties of the scores in each of these subspaces.
For the $h_X$ terms:
$\begin{align*}
\nabla_{h_X}\psi
&=
\frac{d}{d\epsilon}
\int_x \int_y
y p(yx,0) dy
\ \
(1+\epsilon h_X)p(x)dx
\\
&=
\int_x \int_y
y p(yx,0) dy
\ \
h_Xp(x)dx
\\
&= E[\mu_0h_X]
\end{align*}$
$\begin{align*}
E[\phi h_X]
&=
E\left[E\left[\left.\big(Y\mu_0(X)\big)\ \right A=0, X\right]h_X\right]
+
E\big[\mu_0(X) E[h_XX]\big]
\\
&=
E\left[\big(\mu_0(X)\mu_0(X)\big)h_X\right]
+
E\big[\mu_0(X) h_X\big]
\\
&=
E[\mu_0 h_X]
\end{align*}$
For the $h_{AX}$ terms: verify that $\nabla_{h_{AX}}\psi = 0 = E[\phi h_{AX}]$
For the $h_{YA,X}$ terms: verify that $\nabla_{h_{AX}}\psi = E[Yh_{YA,X}A=0] = E[\phi h_{AX}]$
Since all three sets of terms are equal, this completes the proof that $\phi$ is an influence function for $\psi_0$. Thus we have the same result for the ATE because of gradient summation. Since the model is saturated, this influence function is the only one and is therefore the EIF.
NonSaturated Models
If we have a tangent set that is not all of $\mathcal L_2^0$ things get a bit trickier. Here we’ll discuss what’s most often called the projection approach.
The idea is to first find an existing estimator that is known to be RAL and figure out its influence function $\phi$. Then, taking advantage of the geometry of the problem, we know that the projection of $\phi$ onto the tangent space $\mathcal T$ is the EIF. In this case, finding the EIF is just a matter of computing a projection in $\mathcal L _2^0$. Note that this approach does not apply if the tangent set is $\mathcal L_2^0$ because in that case there is only a single valid influence function. Therefore if we had an existing RAL estimator, it would already be efficient.
By "projection" what we mean is the mathematical decomposition of an element of a vector space into a sum of an element of a particular subspace and a vector orthogonal to that subspace. The properties of Hilbert space guarantee that every element has a unique projection onto any given subspace (see: projection in Hilbert space). In the finite vector spaces you're probably used to, calculating projections can be tedious, but it is relatively straightforward vector algebra. In $\mathcal L_2^0$ there isn't a general purpose formula given an arbitrary subspace. Thankfully, however, there are a few for the kinds of subspaces we're usually interested in. To wit, we'll consider two kinds of specific subspaces. Imagine our data $Z$ has two components $X$ and $Y$ (i.e. $Z = [X,Y]$) so the possible scores are zeromean functions $h(X,Y)$.
One important subspace are the functions where $h(X,Y) = h(X)$, i.e. the functions that just depend on $X$. We'll call this subspace $\mathcal T_X$. We already saw a subspace exactly like this come up in the example in the previous section. Let $h_{\langle \mathcal T\rangle}$ denote the projection of $h$ onto a subspace $\mathcal T$ (other authors prefer notation like $h \Pi \mathcal T$, etc.). It turns out that:
$[h(X,Y)_{\langle \mathcal T_X\rangle}](X)
=
E[hX]$
The notation is a little confusing, but what I'm trying to say is that when you project the function of two variables onto this space you get back a function of just one variable.
The other important kind of subspace that we'll consider is $\mathcal T^0_{YX} = \{h(X,Y) : E[hX]=0 \}$. In words: we're talking about the space of functions that have mean 0 when conditioned on any value of $X$. We also saw a subspace like this one come up in the example above. Once again we have a handy formula:
$h_{\langle \mathcal T^0_{YX}\rangle}
=
h  E[hX]$
Here the resulting projection is still a function of both $X$ and $Y$ so I've omitted the explicit notation of the arguments.
You should verify both of these identities by checking the two properties of projections: $h_{\langle \mathcal T \rangle} \in \mathcal T$ and $h  h_{\langle \mathcal T \rangle} \perp \mathcal T$.
We can now combine these two formulas. Imagine that we have data of the form $Z = [X,Y,W]$ and we want to project a function $h(X,Y,W)$ to the space $\mathcal T^0_{YX}$. The functions in this subspace only depend on $X$ and $Y$ so first we need to use our first identity to project $h(X,Y, W)$ down to a function of just $X$ and $Y$. Then we can use the second identity to project the result into the space of functions where the conditional expectation on $X$ is zero. The result is:
$[h(X,Y,W)_{\langle \mathcal T_{YX}^0\rangle}](X,Y)
=
E[hX,Y]  E[hX]$
This is facilitated by the fact that $E[E[hX,Y]X] = E[hX]$ (i.e. we average over $Z$, then $Y$, so all that's left is $X$).
Example: ATE in a Randomized Trial
Let's go back to our running example. In the previous section we derived the tangent space for $P \in \mathcal M_{\text{RCT}}$. We'd like to find the EIF of the ATE in this model.
We know that this model space is not saturated because any probability distribution in it has to satisfy the known treatment assignment mechanism. We'll therefore have to use the projection strategy to find the canonical gradient.
Our overall plan is as follows:
Find the canonical gradient for $\psi_a = E[YA=a]$:
Start with a known RAL estimator of $\psi_a$
Derive its influence function
Project that influence function onto the tangent space we found previously (this is the hard part):
Project onto each tangent subspace
Sum the projections
Combine the canonical gradients for $\psi_0$ and $\psi_1$ to get the equivalent for $\psi$
First things first: do we know any RAL estimator of $\psi_a$? It turns out that we do: $\hat\psi_{\text{IPW}}(\mathbb P_n) = \frac{1}{n} \sum_i \left[\frac{1_a(A_i)Y_i}{\pi_a(X_i)}\right] = \mathbb P_n \left[\frac{1_a(A)Y}{\pi_a(X)}\right]$ where $\pi_a(X) = P(A=aX)$ is the known randomization mechanism. This is the inverseprobabiltyweighted estimator (IPW) of the mean of $Y$ conditioned on $A=a$. For a trial with simple randomization this reduces to a difference of means in the two treatment groups.
It only takes algebra to check that this estimator is asymptotically normal: just subtract $\psi(P)$ from both sides, multiply by $\sqrt n$, and pass the constant $\psi_a$ through the sum to see that
$\sqrt n \left(
\hat\psi_{\text{IPW}}(\mathbb P_n)
\psi_a(P)
\right)
=
\sqrt n
\mathbb P_n
\left[
\underbrace{
\frac{1_a(A)Y}{\pi_a(X)}
 \psi_a(P)
}_{
\phi_a
}
\right]$
By putting it in this form and comparing to our definition of asymptotic linearity we've managed to pick out the influence function $\phi_a$. It's also possible to prove that this estimator is regular. One way to do that is to essentially repeat the arguments made in the proof of our main theorem of why influence functions for RAL estimators must satisfy $\nabla_h\psi = E[\phi h]$ for all $h$, except in reverse: for regularity to hold, we need precisely to show that this identity holds for all scores in the tangent space.
Let's project the influence function $\phi_a$ we just derived above into the tangent space we previously identified:
Our strategy will be to project $\phi_a$ into each of these component subspaces and then take the sum. Thankfully, these subspaces are exactly of the form we discussed above, so we can use the projection identities we posited above (and which you checked, right?). Computing these, we get that
$\begin{align*}
\phi_{a\langle \mathcal T^0_{YX,A} \rangle}
&=
\left(
\frac{1_a(A)}{\pi_a(X)}Y
 \psi_a
\right)

\left(
\frac{1_a(A)}{\pi_a(X)}
\mu_a(X)

\psi_a
\right)
\\
\phi_{a\langle \mathcal T_{X} \rangle}
&= \mu_a(X)  \psi_a
\end{align*}$
To obtain these results I've just applied the identities in the section above and introduced the notation $\mu_a(X) = E[YA=a, X]$. The rest is just calculating conditional expectations. Please try this yourself and make sure it makes sense to you!
Now we sum to obtain $\phi_a^\dagger(Y,A,X)
=
\frac{1_a(A)}{\pi_a(X)}(Y  \mu_a(X))
+
(\mu_a(X)  \psi_a)$.
This proves the perhaps surprising fact that the canonical gradient for the average treatment effect in an observational study is the same as it is in a randomized trial if one is not willing to make any distributional assumptions about the datagenerating mechanisms (aside from the treatment assignment in the RCT).
The important difference you should notice, however, is that this canonical gradient is the only gradient in the nonparametric observational study model, whereas for RCTs there is a whole space of gradients. That has the immediate implication that there are many RAL estimators one could use for an RCT (only one of which is efficient), but there is only one valid RAL estimator (up to asymptotic equivalence) in the observational setting.
Alternative proof exploiting orthogonal decomposition of tangent space
What we'll do instead of finding something to project is actually quite clever and leverages the fact that we've already figured out the canonical gradient for an observational study. Let's start with some facts that we know: (1) we can write any score in $\mathcal M_{\text{obs}}$ as a unique orthogonal sum $h
=
h_{\langle \mathcal T_{RCT}\rangle}
+
h_{\langle \mathcal T^0_{AX}\rangle}$ since those two tangent subspaces are orthogonal, and (2) $\phi^\dagger_{a,\text{RCT}} \perp h_{\langle \mathcal T^0_{AX}\rangle}$ because the canonical gradient in the RCT model is in the RCT tangent space, which is orthogonal to $\mathcal T^0_{AX}$.
Now let's calculate the pathwise derivative of $\psi_a$ in the direction $h
=
h_{\langle \mathcal T_{RCT}\rangle}
+
h_{\langle \mathcal T^0_{AX}\rangle}$ through the observational study model. Our hope is that we can somehow end up writing the pathwise derivative as an inner product between $h$ and some function, which must then be the canonical gradient.
Because the pathwise derivative is a linear function of the score, we can break it up as follows:
$\nabla_h \psi_a
=\nabla_{h_{\langle \mathcal T_{RCT}\rangle}} \psi_a
+
\nabla_{h_{\langle \mathcal T^0_{AX}\rangle}} \psi_a$
The first term is nothing but the pathwise derivative of $\psi_a$ in the RCT model for which we already have the canonical gradient $\phi^\dagger_\text{RCT}$. So we can represent that term as $E[h_{\langle \mathcal T_{RCT}\rangle}\phi^\dagger_\text{RCT}]$. But by our fact (2), this is equivalent to $E[h \phi^\dagger_\text{RCT}]$. Adding $h_{\langle \mathcal T^0_{AX}\rangle}$ inside the expectation does nothing because it's orthogonal to the RCT canonical gradient. So now we've got
$\nabla_h \psi_a
=E[h \phi^\dagger_\text{RCT}]+
\nabla_{h_{\langle \mathcal T^0_{AX}\rangle}} \psi_a$
Now we'll bruteforce calculate the second term:
$\nabla_{h_{\langle \mathcal T^0_{AX}\rangle}} \psi_a
=
\lim_{\epsilon\rightarrow 0}
\frac{\tilde P_{\epsilon}[YA=a]  P[YA=a]}{\epsilon}$
Let's drop the $\epsilon$ subscripts for the moment. We also have that
$\begin{align*}
\tilde p
&=
(1+\epsilon h_{\langle \mathcal T^0_{ AX}\rangle}(A,X))p
\\
&=
(1+\epsilon h_{\langle \mathcal T^0_{AX} \rangle}(A,X))
\left(p_{YA,X} \times p_{AX} \times p_{X} \right)
\\
&=
\underbrace{
p_{YA,X}
}_{\tilde p_{YA,X}}
\times
\underbrace{
(1+\epsilon h_{\langle \mathcal T^0_{AX} \rangle}(A,X))p_{AX}
}_{\tilde p_{AX}}
\times
\underbrace{
p_{X}
}_{\tilde p_X}
\end{align*}$
Because $h_{\langle\mathcal T_{AX}^0\rangle}$is in the tangent space of densities $p_{AX}$, the only way to distribute that term to get three legal densities of the form above (that factorize $\tilde p$) is to put the $h_{\langle\mathcal T_{AX}^0\rangle}$ term into $\tilde p_{AX}$. What this shows is that walking along any paths that are in $\mathcal T^0_{AX}$ actually doesn't change $p_{YA,X}$ or $p_X$. The fluctuated density has the same marginal density for the covariates, and the same conditional density for the outcome given treatment and covariates. The only thing walking along paths in $\mathcal T^0_{AX}$ can change is the treatment assignment mechanism. This should make sense to you because these are exactly the paths we are forbidden to walk if we're constrained to the RCT model in which we must hold the treatment assignment mechanism constant but we're allowed to change anything else.
Going back to the expression for the pathwise gradient, we can show using iterated expectations that $\tilde P_{\epsilon}[YA=a] = \int \tilde \mu_a(X)d\tilde P_X$ where $\tilde \mu_a(X) = \tilde P_\epsilon[YA=a,X]$. However, by the argument above, $\tilde \mu_a = \mu_a$ and $d\tilde P_X = dP_X$ so, in fact, $\tilde P_{\epsilon}[YA=a] = P[YA=a]$. In words, we've shown that moving along paths in $\mathcal T^0_{AX}$ does nothing to the parameter $\psi_a$. Therefore $\nabla_{h_{\langle \mathcal T^0_{AX}\rangle}} \psi_a = 0$ and we can plug that into our calculation of the pathwise derivative for general $h$:
$\nabla_h \psi_a
=E[h \phi^\dagger_\text{RCT}]+
0$
Since we've succeeded in expressing the pathwise derivative of this parameter along any path $h$ as an inner product between $h$ and a function $\phi^\dagger_\text{RCT}$ we have that this is in fact the canonical gradient of the conditional mean $\psi_a$ in the nonparametric observational study model $\mathcal M_{\text{obs}}$.
Other Methods
Combined with gradient algebra, point mass contamination and projection are two very useful strategies for deriving efficient influence functions.
Unfortunately, there are some rare cases in which they don’t work. For example, if the tangent space factors, but not orthogonally, it becomes much more difficult to find projection (though methods do exist for doing this). Another common approach is to find the influence function in the fulldata (causal) model and then figure out a way to map it to the observed data (statistical) model (see Tsiatis 2006 for details). Also, in rare cases, the tangent set isn't even a full space (i.e. instead of being a hyperplane, it's some sort of "triangle" in that hyperplane) and this can also pose difficulties.
Nuisance Tangent Space
In many texts and resources you'll find mention of something called the "nuisance tangent space". This is a tool that is sometimes helpful in characterizing the set of influence functions, but it is by no means always necessary. Indeed, we don't use it in any of the examples in this chapter. Historically it played a much larger role in efficiency theory, which is why you'll see it mentioned a lot in the literature.
The definition of this space that you'll see most often relies on a semiparametric construction of the model, where every distribution is assumed to be uniquely described by a finitedimensional vector of parameters $\psi$ that are of interest and some infinitedimensional vector of parameters $\eta$ that are not of interest. For example, you might consider a model like $Y = A\psi + \mathcal N(\eta_1(X), \eta_2)$. When you have this kind of construction, you can calculate scores for each parameter and define the the nuisance tangent space as the completed span of the scores for $\eta$.
I think this construction is sort of artificial. The definition I like more is that the nuisance tangent space is the completed span of all scores $h$ such that $\nabla_h \psi = 0$. This more general definition is due to Mark van der Laan. The nusiance tangent space is usually denoted $\mathcal T_\eta$ or $\Lambda$, which is a subset of $\mathcal T$. It's immediate from our definition that any influence function is orthogonal to any element of $\Lambda$. If we denote the orthogonal complement of $\Lambda$ using the notation $\Lambda^\perp$, then we have that $\phi \in \Lambda^\perp$. Knowing this is sometimes useful in deriving the efficient influence function, but we don't use that fact in any of the examples in this book.
None of this is anything you should worry about unless you plan to work on esoteric new parameters and model spaces for which the canonical gradient has not yet been derived. You probably won’t come up against anything that required tools beyond what’s in this section, but mileage may vary!