Comment
Search
Duplicate
Try Notion
3.2 Efficiency Among RAL Estimators
Now that we’ve gotten rid of all the estimators we don’t want (i.e. those that aren’t RAL), we need to figure out how to sort through the all remaining estimators we might have at our disposal:
So let's do just that. By imposing asymptotic linearity, we've actually made our job super easy. As nn gets bigger and bigger, we know that the sampling distribution of any RAL estimator ψ^k\hat\psi_k minus truth (scaled by root nn) goes to a normal with mean zero and variance given by V[ϕk(Z)]V[\phi_k(Z)] where ϕk\phi_k is the influence function for that particular estimator. That means that the only difference between any two RAL estimators (in large enough samples) is that they have sampling variances of different magnitude. Obviously we would prefer an estimator with less variance because that will give us smaller (but still valid) confidence intervals and p-values. In other words, given the same data, we are more certain of our estimate if we use an estimator that has a smaller sampling variance.
The implication is that, if given the choice, we want to pick an estimator with an influence function that has the smallest possible variance.
However, instead of picking an estimator out of a lineup, why don't we make our own estimator that is guaranteed to beat anything that anyone else could come up with? Here is a strategy that will let us do that:
Figure out the set of all possible influence functions: Φ={ϕ:ϕ is an IF for some RAL estimator}\Phi = \{\phi: \phi \text{ is an IF for some RAL estimator}\}
Pick the one that has the smallest variance: ϕ=arg minϕΦV[ϕ]\phi^\dagger = \argmin_{\phi \in \Phi} V[\phi]
Build an estimator that has that influence function.
This section will cover steps 1 and 2 of this strategy. There are a few different strategies to build an estimator that has the influence function with smallest variance and these will be the subject of the next chapter.
It's important to come away with an understanding of what and the why of what we're going to discuss. In practice, unless you're working with brand-new model spaces or parameters, you will never have to actually do any of what is described in this section. Nonetheless it's critical to understand it to have an idea of how the tools we have today were developed and why they work.
Characterizing the Set of Influence Functions
Not every possible function corresponds to a valid RAL estimator in a particular statistical model. For example, it would be great if we could find a RAL estimator with ϕ(Z)=0\phi(Z) = 0 because this estimator would have no variance at all in large samples. Clearly that's a pipe dream that's not going to happen in general, so we must conclude that ϕ=0\phi = 0 is not an influence function. For a given parameter and statistical model, what functions ϕ(Z)\phi(Z) are influence functions of RAL estimators, and what functions aren't?
Using nothing but the definitions of regularity and asymptotic linearity, we arrive at the following result. For any RAL estimator with influence function ϕ\phi, the following holds for all scores hh corresponding to paths in the model:
limϵ0ψ(P~ϵ)ψ(P)ϵ=E[ϕh]\lim_{\epsilon \rightarrow 0} \frac{\psi(\tilde P_\epsilon) - \psi(P)}{\epsilon} = E[\phi h]
Before giving a proof, which sadly relies on some technical results, let's understand what this is even saying. The term on the left is what we call the pathwise derivative of ψ\psi at PP in the direction hh (evaluated at 0). From now on we'll abbreviate this with the notation hψ(P)\nabla_h \psi(P). The term on the right is the covariance between h(Z)h(Z) and ϕ(Z)\phi(Z) because both are mean-zero. So what the result above says is that
🌈
[Reisz representation, central identity for influence functions] If ϕ\phi is an influence function for a RAL estimator of a parameter ψ\psi and hh is the score for any legal path at PMP\in \mathcal M, the hh-direction pathwise derivative of ψ\psi is equal to the covariance of hh and ϕ\phi: hψ=E[ϕh]\nabla_h \psi = E[\phi h]
This is a fascinating connection between what seem to be two very different things. The pathwise derivative describes how quickly the estimand (parameter) changes as we move along a particular path. The expectation of the influence function times the score effectively tells us what the angle is between the path's score and the estimator's influence function (see section below on the space L2\mathcal L_2 if this confuses you). The left-hand side (pathwise derivative) depends on the parameter ψ\psi, the direction hh, and the true distribution PP. The right hand side (influence function covariance w/ score) depends on the choice of estimator, implying ϕ\phi, the direction hh, and the true distribution PP (under which the covariance is taken).
This is cool in and of itself, but it's also the fundamental key to characterizing the set of all influence functions, and therefore the key link that holds everything we're talking about together.
😱 Proof that hψ=E[hϕ]\nabla_h \psi = E[h\phi]
We'll start with our definition of a path and our definition of asymptotic normality. Using just these definitions and some esoteric theorems, we'll amazingly be able to characterize how our estimator should behave as we move along the path towards PP. That's a little surprising because asymptotic normality is a property that only holds at PP and doesn't say anything explicit about behavior along paths. Nonetheless, we'll see that there is an implication for how the estimator behaves along paths. However, we've also assumed the estimator is also regular, which is already a definition of how the estimator behaves along paths. In comparing the derived behavior from asymptotic normality and the assumed behavior from regularity, we will see there is a difference. The difference in the two behaviors can only be made to go away (as it must if an estimator is both asymptotically normal and regular) if hψ=E[hϕ]\nabla_h \psi = E[h\phi], so that's what we conclude.
In proving that asymptotic normality and our definition of a path by themselves imply some behavior of the estimator along such a path we'll have to use two advanced, technical results. Unfortunately I haven't found an alternative way to prove this that is both rigorous and intuitive. It's not at all impossible to understand these theorems (just read the appropriate sections of vdW 1998) but if you're not interested in the math for its own sake it's probably not worth your time. In terms of understanding the arc of the proof you just need to grasp how we've used these tools to get what we need, not how they work.
Throughout I'll abbreviate ψ^n=ψ^(Pn)\hat\psi_n =\hat\psi(\mathbb P_n) (the estimate as we draw increasing samples from PP) and ψ=ψ(P)\psi = \psi(P) (the true parameter at PP) to keep the notation light. I'll also abbreviate ψ~^n=ψ^(P~1/n)\hat{\tilde\psi}_n =\hat\psi(\tilde{\mathbb P}_{1/\sqrt n}), which is the estimate as we draw increasing samples from distributions moving along our path closer to PP, and ψ~n=ψ(P~1/n)\tilde\psi_n = \psi(\tilde P_{1/\sqrt n}), which are the true parameter values at each of these distributions. Anything that has a hat is an estimate, anything that has a tilde is along a path.
Alright, go time. Let's see what we can say about how our estimator behaves along sequences of distributions like P~n=P~ϵ=1/n\tilde P_n = \tilde P_{\epsilon = 1/\sqrt{n}}, since these are the only ones regularity has anything to say about. By theorem 7.2 of vdV 1998 the fact that our paths are differentiable in quadratic mean ensures that we get something called local asymptotic normality 🤷:
logdP~1/nndPn(Z1,Zn)=1nnh(Zi)12E[h(Z)2]+oP(1)\log \frac{ d\tilde P_{1/\sqrt n}^n }{ dP^n }(Z_1, \dots Z_n) = \frac{1}{\sqrt n}\sum^n h(Z_i) - \frac{1}{2}E[h(Z)^2] + o_P(1)
It really doesn't matter what this means because we're just going to use it to satisfy a particular technical condition in a minute. The important part is to see that we've defined some random variable on the left: the derivative-looking thing is just some fixed function of the data, so let's rename that as some random variable TnT_n. What the right hand says is that in large-enough samples, this TnT_n thing looks like a sum of IID variables plus some constant. We combine this with the assumed asymptotic linearity of our estimator:
n(ψ^(Pn)ψ(P))=1nnϕ(Zi)+oP(1)\sqrt{n}(\hat\psi(\mathbb P_n) - \psi(P)) = \frac{1}{\sqrt n}\sum^n\phi(Z_i) + o_P(1)
Which is also a random variable that's approximately equal to an IID sum when nn gets big. Here we stack the two above equations on top of each other and by the central limit theorem we have that
[Tnn(ψ^nψ)]PN([12P[h2]0],[P[h2]P[hϕ]P[hϕ]P[ϕ2]])\left[ \begin{array}{c} T_n \\ \sqrt n (\hat\psi_n -\psi) \end{array} \right] \overset{P}\rightsquigarrow \mathcal N \left( \left[ \begin{array}{c} -\frac{1}{2}P[h^2] \\ 0 \end{array} \right] , \left[ \begin{array}{cc} P[h^2] & P[h\phi] \\ P[h\phi] & P[\phi^2] \\ \end{array} \right] \right)
Now we use another technical result 🧐 (Le Cam's 3rd lemma- see vdV 1998 ex. 6.7), which says that when we have exactly the situation above, we can infer
n(ψ~^nψ)P~1/nN(P[hϕ],P[ϕ2])\begin{align*} \sqrt n \left(\hat{\tilde \psi}_n - \psi \right) &\overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N\left(P[h \phi], P[\phi^2]\right) \end{align*}
Again it's not important to understand the technical device. The idea is that we're trying to say something about how our estimator behaves as we change the distribution along our path of interest. At first this seems impossible because asymptotic normality only holds at each PP and doesn't say anything about what happens along paths. However, with the help of these technical devices, we've actually managed to say something about the difference between the estimate as we change the underlying distribution and the truth at PP. In particular, this difference converges to a normal with the same variance as if we had not been moving along the path towards PP but had instead sat still at PP. Crazy. The limiting normal distribution now has some mean P[hϕ]P[h\phi], which we'll pull out in the course of some algebraic manipulation during which we also add and subtract nψ~n\sqrt n \tilde\psi_n on the left:
n(ψ~^nψ)P~1/nN(P[hϕ],P[ϕ2])P[hϕ]+n(ψ~^nψ~n+ψ~nψ)P~1/nN(0,P[ϕ2])\begin{align*} \sqrt n \left(\hat{\tilde \psi}_n - \psi \right) &\overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N\left(P[h \phi], P[\phi^2]\right) \\ -P[h \phi] + \sqrt n \left( \hat{\tilde \psi}_n - \tilde\psi_n + \tilde\psi_n - \psi \right) &\overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N \left( 0, P[\phi^2] \right) \\ \end{align*}
Moving terms around, we arrive at
[n(ψ~nψ)P[hϕ]]+n(ψ~^nψ~n)P~1/nN(0,P[ϕ2])(AL)\left[ \sqrt n \left( \tilde\psi_n - \psi \right) - P[h \phi] \right] + \sqrt n \left( \hat{\tilde \psi}_n - \tilde\psi_n \right) \overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N \left( 0, P[\phi^2] \right) \tag{AL}
Notice that the term in brackets on the left is a constant for each nn, i.e. it's not random.
Now, finally, we recall our definition of regularity, which was
n(ψ~^nψ~n)P~1/nN(0,P[ϕ2])(R)\sqrt n \left( \hat{\tilde \psi}_n - \tilde\psi_n \right) \overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N \left( 0, P[\phi^2] \right) \tag{R}
An estimator that is asymptotically linear must satisfy the behavior in the display above denoted (AL). An estimator that is regular must satisfy the behavior in the display denoted (R). A RAL estimator must satisfy both. But the only way that both of those can be true is if they are actually saying the same thing, which only happens if the bracketed term in (AL) goes to zero. Thus, for RAL estimators,
limnn(ψ~nψ)=P[hϕ]limϵn0ψ~nψϵn=P[hϕ]hψ=P[hϕ]\begin{align*} \lim_{n\rightarrow \infty}\sqrt n \left( \tilde\psi_n - \psi \right) &= P[h \phi] \\ \lim_{\epsilon_n \rightarrow 0} \frac{ \tilde\psi_n - \psi }{ \epsilon_n } &= P[h \phi] \\ \nabla_h \psi &= P[h \phi] \end{align*}
where we've recalled ϵn=1/n\epsilon_n = 1/\sqrt{n} from our original definition of our sequence along the path.
What we mean is this: since the left-hand side doesn't depend on the influence function ϕ\phi, the same identity holds (keeping constant hh and PP) for any two estimators with influence functions ϕ1\phi_1, ϕ2\phi_2 (if two such estimators exist). Specifically: E[ϕ1h]=hψ=E[ϕ2h]E[\phi_1 h] = \nabla_h \psi = E[\phi_2 h]. Eliminating the middleman, we get E[(ϕ1ϕ2)h]=0E[(\phi_1 - \phi_2)h] = 0. Therefore the difference of influence functions for any two RAL estimators is orthogonal to any score! If you don't understand why this is the implication, you should review the section on L2\mathcal L_2 space. Moreover, if we take any function hh^\perp that is orthogonal to all scores and add it to the influence function from a RAL estimator, the result ϕ+h\phi + h^\perp still satisfies the above requirement and so this is an influence function for a different RAL estimator.
The Space L2(P)\mathcal L_2(P)
We can treat any score hh and any influence function ϕ\phi as an element of a set we call L2(P)\mathcal L_2(P) (or just L2\mathcal L_2 when the measure PP is clear). This is the set of all functions f(Z)f(Z) which satisfy the condition f(Z)2dP<\int f(Z)^2 dP < \infty (i.e. f(Z)f(Z) is a random variable with finite variance). This space is a lot like the vector space Rp\mathbb R^p in many important ways:
L2\mathcal L_2 is a linear space: f,gL2,αR    f+αgL2f,g \in \mathcal L_2, \alpha \in \mathbb R \implies f + \alpha g \in \mathcal L_2
L2\mathcal L_2 has an inner product: in Rp\mathbb R^p, the inner product between two vectors xx and yy is given by xiyi\sum x_i y_i. The equivalent in L2\mathcal L_2 for two functions ff and gg is fgdP=E[fg]\int fg dP = E[fg]. Note how the integral of a product of two functions is a lot like the sum of the element-wise product of vectors. It happens that the two operations obey all the same rules that define something as an "inner product". When we don't have a specific space in mind, we usually write the inner product between two elements as u,v\langle u, v \rangle. If two elements have an inner product equal to 0 we say that the two are orthogonal. This generalizes the notion of two vectors being at a right angle in Rp\mathbb R^p. More generally, the inner product also defines what the "angle" is between two vectors: θu,v=cos1(u,vvu)\theta_{u,v} = \cos^{-1}\left(\frac{\langle u, v \rangle}{||v|| ||u||}\right), where u=u,u||u|| = \langle u,u \rangle  is the norm of uu in this space.
L2\mathcal L_2 is complete. This is a technical term that means that the space contains all of its limit points (i.e. a sequence of convergent elements in the space can't converge to a limit that is outside the space).
Together, these conditions are the definition of something called a Hilbert space. Indeed, Rp\mathbb R^p and L2\mathcal L_2 are the usual examples of Hilbert spaces. The main difference between the two is that Rp\mathbb R^p is finite-dimensional, whereas L2\mathcal L_2 is infinite-dimensional. What do I mean by this? Well, a vector in, say R3\mathbb R^3 clearly has 3 components [x1,x2,x3][x_1, x_2, x_3]. A "vector" in L2\mathcal L_2 has as many components as its functions have arguments because you can think of a function f:RRf:\mathbb R \rightarrow \mathbb R like this: f=[f(200)f(0.1)f(0)f(1)f(1.5)]f = [\dots f(-200) \dots f(-0.1) \dots f(0) \dots f(1) \dots f(1.5) \dots ]. The only difference is that I put the "vector index" in parentheses instead of as a subscript and now I call it an "argument". We also now allow for indices that are anywhere in the domain of ff instead of just the integers 1,2,31,2,3. Alternatively, we can think of the vector xx as a function that maps i{1,2,3}xji \in \{1,2,3\} \mapsto x_j. So vectors, functions, whatever. It's kind of the same thing in most of the ways that matter!
Tangent Spaces
If influence functions are orthogonal to every score hh, they must also be orthogonal to any linear combination of scores, or any limit of a sequence thereof. So it makes everything more concise if we start talking about the tangent space T\mathcal T (which is exactly the set of all scores, their linear combinations, and limits) instead of having to continually refer to "all scores". Since the tangent space comes from the set of scores, it has nothing to do with either the estimator or the parameter. It depends purely on the statistical model and the true distribution PP.
Why is it called the tangent space?
Honestly, I don't think it's the greatest name, but let's explain it.
The key is to 1) think about each point PMP \in \mathcal M as having density pp w.r.t. some dominating measure P°P\degree  and 2) realize that pL2p \in \mathcal L_2. Because pp has to integrate to 1, then then norm of p\sqrt{p} (that is, EP°[pp])E_{P\degree}[{\sqrt p \sqrt p}]) is of course 1. So we can identify M\mathcal M with some subset of the unit ball in L2(P°)\mathcal L_2(P\degree). It's now easy to show that p\sqrt p is orthogonal to hph \sqrt p:
p(ph)dP°=hpdP°=hdP=0\int \sqrt p (\sqrt p h) dP\degree = \int h p dP\degree = \int hdP = 0
so the set of functions hph\sqrt p ranging over scores hh is orthogonal (tangent) to the point p\sqrt p in the model space. Or, equivalently, pp is orthogonal to all scores hh in the space L(P°)\mathcal L(P\degree).
Once nice thing that this picture shows is that the tangent space clearly depends on where PP is within the model.
Nonetheless, it'd probably be more informative to call it the "score space", since the fact that the scores are tangent to the density pp in L2(P°)\mathcal L_2(P^\degree) doesn't seem matter that much in terms of the role the space ends up playing in the theory we're developing. Alas, we're stuck with the names we have.
Saturated vs. Non-Saturated Models
Sometimes the tangent space is all of L20\mathcal L_2^0. This happens if we put no restrictions on the distributions in our statistical model. If any distribution can be in the model, then starting at any point PP, any function hh that has zero mean and finite variance defines a valid path at PP because p~ϵ=(1+ϵh)p\tilde p_\epsilon = (1+\epsilon h)p will still be a density for any such score hh.
When this happens, we say that the model is nonparametric saturated, or just saturated (at PP). Intuitively, this means that we have a model space such that, standing at PP, we can move in any direction and still stay inside the model. If this is not the case, then we say the model is not saturated (at PP).
Factorizing Tangent Spaces
While we're on the subject of tangent spaces, it turns out that if we can factorize our distribution P(Y,X)=P(YX)P(X)P(Y, X) = P(Y|X)P(X) then all scores hh end up being the sum of scores for each factor, treating these as living in statistical models of their own, i.e. P(YX)MYXP(Y|X) \in \mathcal M_{Y|X}. Moreover, the scores for each factor end up being orthogonal, i.e. h=hX(x)+hYX(x,y)h = h_X(x) + h_{Y|X}(x, y) and hXhYXh_X \perp h_{Y|X}. Lastly, these scores satisfy E[hX]=0E[h_X] = 0 and E[hYXX]=0E[h_{Y|X}|X]=0. A short proof of all of is given below. This argument also generalizes to densities that have more than two factors (just factor one of the two factors).
Proof
Recall that h=ddϵlogp~ϵϵ=0h = \frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0}. For ϵ\epsilon small enough p~ϵ\tilde p_\epsilon is still in the model, so we can factor it into p~Z2Z1p~Z1\tilde p_{Z_2|Z_1} \tilde p_{Z_1}(omitting the ϵ\epsilon subscripts). The log of a product is the sum of logs and the derivative is linear over a sum, so we get h(Z1,Z2)=ddϵlogp~Z2Z1ϵ=0undefinedh2(Z1,Z2)+ddϵlogp~Z1ϵ=0undefinedh1(Z1)h(Z_1, Z_2) = \underbrace{ \frac{d}{d\epsilon} \log \tilde p_{Z_2|Z_1} \big|_{\epsilon=0} }_{h_2(Z_1, Z_2)} + \underbrace{ \frac{d}{d\epsilon} \log \tilde p_{Z_1} \big|_{\epsilon=0} }_{h_1(Z_1)}
Now we'd like to show that h2h1h_2 \perp h_1 in L2(P)\mathcal L_2(P).
For that we have to notice that h2h_2 is a score at pZ2Z1p_{Z_2|Z_1} in the nonparametric model MZ2Z1={all densities of Z2 given Z1}\mathcal M_{Z_2|Z_1} = \{\text{all densities of $Z_2$ given $Z_1$\}} (by definition). If we start at pZ2Z1p_{Z_2|Z_1} and move along a path defined by h2h_2, the only way we stay within MZ2Z1\mathcal M_{Z_2|Z_1}is if E[h2Z1]=0E[h_2|Z_1] =0. Otherwise the resulting perturbation will not be a density. Similarly, we need for E[h1]=0E[h_1] = 0.
Now E[h1h2]=E[E[h1h2Z1]]=E[h1(Z1)E[h2Z1]]=E[h10]=0E[h_1 h_2] = E[E[h_1 h_2|Z_1]] = E[h_1(Z_1) E[h_2|Z_1]] = E[h_1 \cdot 0] = 0 and we've shown h1h_1 and h2h_2 are orthogonal.
We can also go the other way- if I propose h1h_1 and h2h_2 that satisfy the above, then h=h1+h2h = h_1 + h_2 must be a valid score at PP in the original model.
Variational Independence
We showed that the tangent space can be broken up into an orthogonal sum when every distribution in the model factors and those factors can vary independently in their own model spaces. There are cases where this breaks down, though. If you look at the following picture, you'll surmise that all the scores at PP (blue arrows) can indeed be created as orthogonal sums of scores from MZ2Z1\mathcal M_{Z_2|Z_1} and MZ1\mathcal M_{Z_1}. However, there are arrows that can be constructed the same way (grey, dotted) that take us outside of the model space- these are not scores because they don't correspond to a legal path inside the model. Therefore in this case TT1T2\mathcal T \subset \mathcal T_1 \oplus \mathcal T_2, but we don't have the equality. When this happens, we say that our submodels are not variationally independent. In other words, there are some points in the model space where I can't legally change in one of the submodels without having to make a change in another (i.e. at PP in the picture we can't go just up, we also have to move a bit to the right along the diagonal).
We can therefore usually construct tangent spaces for each factor separately and then the tangent space for the whole model is the orthogonal sum of the component tangent spaces (we write this T=T1T2\mathcal T = \mathcal T_1 \oplus \mathcal T_2).
Consider what happens if we perturb a density pp along a path defined by a score hXh_X that satisfies hX(y,x)=hX(x)h_X(y,x) = h_X(x) and E[hX]=0E[h_X] = 0. We know that p~ϵ(y,x)=(1+ϵhX)p(y,x)\tilde p_\epsilon(y,x) = (1+\epsilon h_X)p(y,x), but what can we say about the resulting factors p~ϵ(yx)\tilde p_\epsilon(y|x) and p~ϵ(x)\tilde p_\epsilon(x)? A bit of algebra using the properties of hXh_X and the definition of the conditional density shows that p~ϵ(yx)=p(yx)\tilde p_\epsilon(y|x) = p(y|x) and also p~ϵ(x)=(1+ϵhX)p(x)\tilde p_\epsilon(x) = (1+\epsilon h_X)p(x). In other words, moving along a path in MX\mathcal M_X only affects the factor p(x)p(x). Similarly, if you repeat this exercise with a score hYXh_{Y|X} satisfying E[hYXX]=0E[h_{Y|X}|X] = 0 you get that p~ϵ(yx)=(1+ϵhYX)p(yx)\tilde p_\epsilon(y|x) = (1+\epsilon h_{Y|X})p(y|x) and also p~ϵ(x)=p(x)\tilde p_\epsilon(x) = p(x). Thus moving along a path in MYX\mathcal M_{Y|X} only affects the factor p(yx)p(y|x). This should make some intuitive sense to you, and, naturally, everything generalizes cleanly when there are more than two factors.
This is useful when we evaluate the directional derivative hψ\nabla_h \psi. If hh is the sum of two other functions we can always write hψ=hXψ+hYXψ\nabla_h\psi = \nabla_{h_X} \psi + \nabla_{h_{Y|X}}\psi because the derivative is a linear operator in the score. The resulting terms hXψ\nabla_{h_X} \psi and hYXψ\nabla_{h_{Y|X}}\psi are now easier to evaluate because for each of them we just need to consider a single perturbation in p(x)p(x) or in p(yx)p(y|x), respectively. We’ll see this come in handy in the next section of this chapter.
Example: tangent space for a randomized controlled trial
Let's give a concrete example of a tangent space at PP for some particular statistical model. For our model, we'll use the space MRCT(π)\mathcal M_{\text{RCT($\pi$)}} where RCT(π\pi) stands for "randomized controlled trial" with treatment mechanism π\pi. This model is characterized by a joint distribution between some vector of observed covariates XX, a binary treatment AA, and an outcome YY. Any distribution of three variables can always factor as P(Y,A,X)=P(YA,X)P(AX)P(X)P(Y,A,X) = P(Y|A,X)P(A|X)P(X). The treatment mechanism is defined as π(X)=P(A=1X)\pi(X) = P(A=1|X), the probability of receiving treatment given observed covariates XX.
What makes the RCT model space different from the more general nonparametric space where each of these three distributions can be anything is that in the RCT the distribution of AXA|X is fixed. Specifically, we know P(A=1X)=π(X)P(A=1|X) = \pi(X). For example, in a trial with a simple 50:50 randomization we have π(X)=0.5\pi(X) = 0.5.
So what are the possible scores? The density factors into three components, only two of which can actually vary. Thus any score will be the sum of two orthogonal components, one of which corresponds to pXMXp_{X} \in \mathcal M_X and one of which corresponds to pYA,XMYA,Xp_{Y|A,X} \in \mathcal M_{Y|A,X} (recall pAXp_{A|X} is fixed so we can't move in that "direction").
Now we're basically done. Since there are no more restrictions on our model, our tangent space is orthogonal sum of the two tangent subspaces. Let L20\mathcal L_2^0  be the subspace of all zero-mean functions in L2\mathcal L_2 (recall all scores and all influence functions have mean zero). Now, formally:
TX={h:E[h]=0h(Y,A,X)=h(X)}TYA,X={h:E[hA,X]=0}TRCT=TXTYA,X\begin{align*} \mathcal T_X &= \left\{ h : \begin{split} &E[h]=0 \\ &h(Y,A,X) = h(X) \end{split} \right\} \\ \mathcal T_{Y|A,X} &= \left\{ h : E[h|A,X]=0 \right\} \\ \mathcal T_{\text{RCT}} &= \mathcal T_X \oplus \mathcal T_{Y|A,X} \end{align*}
Can you think of a function that is in L20\mathcal L_2^0 but not in TRCT\mathcal T_{\text{RCT}}?
Example: tangent space for an observational study
The difference between the observational study model and the RCT model is that now we don't know the treatment mechanism P(AX)P(A|X). This model (we'll call it Mobs\mathcal M_\text{obs}) therefore contains MRCT(π)\mathcal M_{\text{RCT($\pi$)}} for any treatment mechanism π\pi.
Since this model is larger than MRCT\mathcal M_{RCT}, it should make some intuitive sense to you that the tangent space is bigger too. This is because at any point PP in the model, we have more directions we can move in than we previously did. Specifically, we can now move in directions hh that end up changing the treatment mechanism. To be specific, we can have paths through Mobs\mathcal M_{\text{obs}} (specified by some hh) such that P~ϵ(AX)P(AX)=π(X)\tilde P_\epsilon(A|X) \ne P(A|X) = \pi(X). But this same hh cannot be a path through MRCT(π)\mathcal M_{\text{RCT}(\pi)} because for the fluctuated distribution P~ϵ\tilde P_\epsilon to remain inside of MRCT(π)\mathcal M_{\text{RCT}(\pi)} we have to require that P~ϵ(AX)=π(X)\tilde P_\epsilon(A|X) = \pi(X). Therefore for any PP with some treatment mechanism π\pi, all paths through PP in MRCT(π)\mathcal M_{\text{RCT}(\pi)} are also paths through PP in Mobs\mathcal M_\text{obs} but not vice-versa.
We can be even more specific about this. Since any density in Mobs\mathcal M_\text{obs} factors as P(Y,A,X)=P(YA,X)P(AW)P(X)P(Y,A,X) = P(Y|A,X)P(A|W)P(X) we immediately have that the tangent space is given by the orthogonal sum Tobs=TYX,A0TAX0TX\mathcal T_{\text{obs}} = \mathcal T_{Y|X,A}^0 \oplus \mathcal T^0_{A|X} \oplus \mathcal T_{X}. Compare this to TRCT=TYX,A0TX\mathcal T_{\text{RCT}} = \mathcal T_{Y|X,A}^0 \oplus \mathcal T_{X} and you immediately see that scores in the observational study have an additional set of degrees of freedom that the score in the RCT don't have. Namely: all directions in TAX0\mathcal T^0_{A|X}. In fact, Tobs=L20T_{\text{obs}} = \mathcal L_2^0, which is to say that any mean-zero function with finite variance is a legal score for this model. That's because we've put literally no restrictions on what P(Y,A,X)P(Y,A,X) can be so p~ϵ=(1+ϵh)p\tilde p_\epsilon = (1+\epsilon h)p is a legitimate density in our model for any such hh.
The Efficient Influence Function
Starting with nothing but the definition of a RAL estimator, we've shown that the set of influence functions of RAL estimators is (after shifting it to the origin) orthogonal to the tangent space. Since RAL estimators basically only differ by their variance (which is the variance of their influence functions), we get the best RAL estimator by finding one that has the influence function with the smallest variance. The point of everything we've done in the section above is that at least now we know the space we have to look in!
Thankfully, our characterization of the tangent space and the set of influence functions makes it easy to find the influence function with the smallest variance.
Pathwise Differentiability
There are combinations of parameters and models for which the pathwise derivative doesn't exist or isn't a bounded linear operator. For example, consider densities of Y,XY,X and let our parameter be ψ(P)=E[YX=x0]\psi(P) = E[Y |X=x_0] for a particular point of interest x0x_0. You can evaluate the derivative and show that hψ(P)=yh(y,x0)pYX(y,x0)dy\nabla_h \psi(P) = \int y h(y,x_0)p_{Y|X}(y,x_0) dy. The problem here is that I can always pick hh with norm 1 but that takes huge values at x=x0x=x_0 (by compensating with large negative values at other xx). So I can pick hh (in the unit ball) such that the integral blows up as large as I want. That's what "unbounded" means for a linear operator. The consequence of this is that the Reisz representation theorem no longer applies and we cannot guarantee that there is an element ϕ\phi^\dagger that is in the tangent space. This also destroys the argument based on the difference of two influence functions of RAL estimators because we can't guarantee that even a single RAL estimator exists.
To do this we can argue that there is some influence function that is itself in the tangent space because we can always project some influence function ϕ\phi into T\mathcal T to get ϕ=ϕ+h\phi = \phi^\dagger + h^\perp. By definition, ϕ\phi^\dagger is in the tangent space, and by what we know about the relationship of Φ\Phi to the tangent space, ϕ\phi^\dagger is the difference of an influence function with something that's orthogonal to all scores, meaning that it too must be an influence function. We can reach the same conclusion if we use the Reisz representation theorem. If the pathwise derivative, viewed as a function of the score hh, is bounded and linear among scores in T\mathcal T, then there is some element of T\mathcal T that satisfies our requirement of being an influence function for a RAL estimator. This is is usually the case, although there are some parameters and models for which it isn't. But as long as we have pathwise differentiability, there is always exactly one influence function that is also in the tangent space.
Reisz Representation
Besides giving us a definition of orthogonality, Hilbert spaces are super nice to work in because of an important property called Reisz representability. The Reisz representation theorem says that if you have a bounded linear function FF that maps an element of the Hilbert space to a real number, then that function can always be represented as an inner product between the argument to the function and some other element of the Hilbert space. In other words, for every bounded linear function FF, there is some element hFh_F so that F(h)=hF,hF(h) = \langle h_F, h \rangle.
This is a very surprising result at first but it's pretty easy to convince yourself of it in Rp\mathbb R^p. Consider vectors x,yRp\vec x, \vec y \in \mathbb R^p (I'll use the vector arrow x\vec x in this section to be very explicit). The definition of a linear function F:RpRF: \mathbb R^p \rightarrow \mathbb R is that F(x+αy)=F(x)+αF(y)F(\vec x + \alpha \vec y) = F(\vec x) + \alpha F(\vec y) for any scalar α\alpha. Any linear function in this particular space is bounded so we don't need to worry about what that means. To show that F(x)F(\vec x) is actually just taking an inner product between x\vec x and some other vector we can apply FF to the unit vectors e1p\vec e_{1 \dots p} and see what it returns. Say F(ei)=aiF(\vec e_i) = a_i. Now we can express any vector as the weighted sum of the unit vectors x=xiei\vec x = \sum x_i\vec e_i, and by linearity of FF notice now that F(x)=xiF(ei)=xiai=x,aF(\vec x) = \sum x_iF(\vec e_i) = \sum x_i a_i = \langle \vec x, \vec a\rangle . The conclusion is that the operation of the function FF on x\vec x is really just the inner product between x\vec x and some other vector a\vec a, which is also in Rp\mathbb R^p. The other direction is also obvious: given a\vec a, define Fa(x)=a,xF_{\vec a}(\vec x) = \langle \vec a, \vec x\rangle and since the inner product is a linear operation in both arguments, we have that the function is linear as desired.
What's amazing is that this property actually extends to any Hilbert space (i.e. a complete linear space with an inner product). In particular, if we have a bounded linear function FF that maps functions fL2f \in \mathcal L_2 to real numbers, then we know there is some fixed function gL2g \in \mathcal L_2 such that F(f)=E[fg]F(f) = E[fg]. If you're unfamiliar with functions that have functions as arguments you can think of FF as "assigning" a number to each function ff.
Projection in Hilbert Space
Almost all Hilbert spaces contain other Hilbert spaces within them. For example, we can think of any plane in 3D space as a 2D space in its own right. If we add the restriction that the contained space must have the origin in it, we call it a subspace and a useful result follows. Namely: for every vector vv in the Hilbert space H\mathcal H, there is always exactly one vector vv_* in the desired subspace H\mathcal H_* and a vector vv^\perp that's orthogonal to every vector in H\mathcal H_* such that v=v+vv = v_* + v^\perp. Since the projection is unique, it's easy to check if vv_* is indeed a projection by simply checking (1) whether vHv_* \in \mathcal H_* and (2) whether vvHv - v_* \perp \mathcal H_*.
If a subspace is the orthogonal sum of two other subspaces, we can always obtain the projection by projecting into each sub-subspace and then summing the result.
And, in fact, this unique influence function ϕ\phi^\dagger in the tangent space is the one with the smallest variance. The proof of this is relatively simple: Φ\Phi and T\mathcal T are subsets of L20\mathcal L_2^0. In this space, the norm is f=E[f2]=V[f]||f||=E[f^2] = V[f] since by definition ff must have mean zero to be in L20\mathcal L_2^0. Therefore looking for the smallest variance influence function is the same as looking for the influence function with the smallest norm in L20\mathcal L_2^0, or, equivalently, the point in Φ\Phi that's closest to the origin. We can write any point in ϕΦ\phi \in \Phi as the sum of the influence function that is in the tangent space plus some function that is orthogonal to the tangent set: ϕ=ϕ+h\phi = \phi^\dagger + h^\perp. But since those two components are at a right angle, there's no way for the length of the "vector" ϕ\phi to be less than the length of the "vector" ϕ\phi^\dagger because the Pythagorean theorem says that ϕ2=ϕ2+h2||\phi||^2 = ||\phi^\dagger||^2 + ||h^\perp||^2.
🌈
Because ϕ\phi^\dagger has the smallest variance of any influence function, and therefore any RAL estimator that has it will make the most efficient possible use of the data, we call ϕ\phi^\dagger the efficient influence function (EIF; sometimes also referred to as efficient influence curve or EIC).
If we identify ϕ\phi^\dagger as the unique element in the tangent space for which the pathwise derivative in a particular direction can be represented as the inner product between that direction and this element (i.e. the Reisz representation), then it also makes sense to call ϕ\phi^\dagger the canonical gradient.
Why is it called the canonical gradient?
If we have a function f:RpRf:\mathbb R^p \rightarrow \mathbb R that maps vectors to real numbers, the directional derivative at x\vec x in the direction of a vector h\vec h is defined as hf(x)=limϵ0f(x+h)f(x)ϵ\nabla_hf(\vec x) = \lim_{\epsilon \rightarrow 0} \frac{f(\vec x + \vec h) - f(\vec x) }{\epsilon}. However, it's well-known that we can also write this as the sum of the partial derivatives times the components of h\vec h: hf(x)=f(x)x1h1++f(x)xphp\nabla_{\vec h} f(\vec x) = \frac{\partial f(\vec x)}{\partial x_1} h_1 + \dots + \frac{\partial f(\vec x)}{\partial x_p} h_p . In fact, the proof of this follows from noticing that the directional derivative is a linear operator and applying the Reisz representation theorem that we derived for finite dimensional Rp\mathbb R^p. Of course, this is the same as the inner product between the vector of partial derivatives, which we call the gradient f(x)=[f(x)/x1f(x)/xp]\nabla f(\vec x) = [\partial f(\vec x)/\partial x_1 \dots \partial f(\vec x)/\partial x_p], and hh. Thus hf=hf=h,fRp\nabla_{\vec h} f = \vec h \cdot \nabla f = \langle \vec h, \nabla f \rangle_{\mathbb R^p}.
Now it should make sense why we refer to ϕ\phi^\dagger as a gradient. It's exactly the same as f\nabla f, except now we're in L20\mathcal L_2^0 instead of Rp\mathbb R^p. Since we can write the directional derivative hψ(P)\nabla_h \psi(P) as an inner product E[h,ϕ]=h,ϕL2E[h, \phi^\dagger] = \langle h, \phi^\dagger \rangle_{\mathcal L_2}, the object ϕ\phi^\dagger is serving exactly the same role as the gradient f\nabla f is above. To reiterate:
hf(x)=h,fRphψ(P)undefineddirectional derivative=h,ϕL2(P)undefinedinner product b/t direction and gradient\begin{array}{ccc} \nabla_{\vec h} f(\vec x) &=& \langle \vec h, \nabla f \rangle_{\mathbb R^p} \\ \underbrace{ \nabla_h \psi(P) }_ {\text{directional derivative}} &=& \underbrace{ \langle h, \phi^\dagger \rangle_{\mathcal L_2(P)} }_ {\text{inner product b/t direction and gradient}} \end{array}
We call ϕ\phi^\dagger the canonical gradient because there are usually many functions in L2\mathcal L_2 that satisfy the above (namely: every other influence function). However, ϕ\phi^\dagger is the only one that is in the tangent space, and is the only one that's guaranteed to exist according to the Reisz representation theorem. That's what makes it "canonical".
In the context we've described here it might make more intuitive sense to use the notation ψ\nabla \psi to represent the canonical gradient. However, we use ϕ\phi^\dagger to make the connection to influence functions for RAL estimators. A gradient (depends on the parameter) and an influence function (depends on the estimator) are actually totally different things. It just so happens that for RAL estimators they happen to occupy exactly the same space.
If a model is saturated, these geometrical arguments imply that there is only one valid influence function (and it is therefore efficient). To see why, we'll assume that there are two different influence functions and then show that they are actually the same. If there are two IFs, we know that E[(ϕ1ϕ2)h]=0E[(\phi_1 - \phi_2)h] = 0 for all hh in the tangent set, which in the case of a saturated model is all of L20\mathcal L_2^0. However, ϕ1ϕ2\phi_1 - \phi_2 is itself a zero-mean, finite variance function (i.e. an element of the tangent set) so there is some score h=ϕ1ϕ2h^* = \phi_1 - \phi_2. But by the orthogonality result, we must have that E[(ϕ1ϕ2)h]=E[hh]=0E[(\phi_1 - \phi_2)h^*] = E[h^*h^*] = 0. Since nothing can be orthogonal to itself unless it is 0, we conclude that ϕ1ϕ2=0\phi_1-\phi_2 = 0 and our two "different" influence functions are in fact the same one. This, in turn, means there exists only one RAL estimator (or, technically, one class of RAL estimators that are all asymptotically equivalent).