3.2 Efficiency Among RAL Estimators

3.2 Efficiency Among RAL Estimators

Table of Contents

Characterizing the Set of Influence Functions

Tangent Spaces

Saturated vs. Non-Saturated Models

Factorizing Tangent Spaces

Example: tangent space for a randomized controlled trial

Example: tangent space for an observational study

The Efficient Influence Function

Now that we’ve gotten rid of all the estimators we don’t want (i.e. those that aren’t RAL), we need to figure out how to sort through the all remaining estimators we might have at our disposal:

So let's do just that. By imposing asymptotic linearity, we've actually made our job super easy. As nnn﻿ gets bigger and bigger, we know that the sampling distribution of any RAL estimator ψ^k\hat\psi_kψ^​k​﻿ minus truth (scaled by root nnn﻿) goes to a normal with mean zero and variance given by V[ϕk(Z)]V[\phi_k(Z)]V[ϕk​(Z)]﻿ where ϕk\phi_kϕk​﻿ is the influence function for that particular estimator. That means that the only difference between any two RAL estimators (in large enough samples) is that they have sampling variances of different magnitude. Obviously we would prefer an estimator with less variance because that will give us smaller (but still valid) confidence intervals and p-values. In other words, given the same data, we are more certain of our estimate if we use an estimator that has a smaller sampling variance.

The implication is that, if given the choice, we want to pick an estimator with an influence function that has the smallest possible variance.

However, instead of picking an estimator out of a lineup, why don't we make our own estimator that is guaranteed to beat anything that anyone else could come up with? Here is a strategy that will let us do that:

Figure out the set of all possible influence functions: Φ={ϕ:ϕ is an IF for some RAL estimator}\Phi = \{\phi: \phi \text{ is an IF for some RAL estimator}\}Φ={ϕ:ϕ is an IF for some RAL estimator}﻿

Pick the one that has the smallest variance: ϕ†=arg min⁡ϕ∈ΦV[ϕ]\phi^\dagger = \argmin_{\phi \in \Phi} V[\phi]ϕ†=argminϕ∈Φ​V[ϕ]﻿

Build an estimator that has that influence function.

This section will cover steps 1 and 2 of this strategy. There are a few different strategies to build an estimator that has the influence function with smallest variance and these will be the subject of the next chapter.

It's important to come away with an understanding of what and the why of what we're going to discuss. In practice, unless you're working with brand-new model spaces or parameters, you will never have to actually do any of what is described in this section. Nonetheless it's critical to understand it to have an idea of how the tools we have today were developed and why they work.

Characterizing the Set of Influence Functions

Not every possible function corresponds to a valid RAL estimator in a particular statistical model. For example, it would be great if we could find a RAL estimator with ϕ(Z)=0\phi(Z) = 0ϕ(Z)=0﻿ because this estimator would have no variance at all in large samples. Clearly that's a pipe dream that's not going to happen in general, so we must conclude that ϕ=0\phi = 0ϕ=0﻿ is not an influence function. For a given parameter and statistical model, what functions ϕ(Z)\phi(Z)ϕ(Z)﻿ are influence functions of RAL estimators, and what functions aren't?

Using nothing but the definitions of regularity and asymptotic linearity, we arrive at the following result. For any RAL estimator with influence function ϕ\phiϕ﻿, the following holds for all scores hhh﻿ corresponding to paths in the model: 

\lim_{\epsilon \rightarrow 0} \frac{\psi(\tilde P_\epsilon) - \psi(P)}{\epsilon} = E[\phi h]

Before giving a proof, which sadly relies on some technical results, let's understand what this is even saying. The term on the left is what we call the pathwise derivative of ψ\psiψ﻿ at PPP﻿ in the direction hhh﻿ (evaluated at 0). From now on we'll abbreviate this with the notation ∇hψ(P)\nabla_h \psi(P)∇h​ψ(P)﻿. The term on the right is the covariance between h(Z)h(Z)h(Z)﻿ and ϕ(Z)\phi(Z)ϕ(Z)﻿ because both are mean-zero. So what the result above says is that

🌈

[Reisz representation, central identity for influence functions] If ϕ\phiϕ﻿  is an influence function for a RAL estimator of a parameter ψ\psiψ﻿ and hhh﻿ is the score for any legal path at P∈MP\in \mathcal MP∈M﻿, the hhh﻿-direction pathwise derivative of ψ\psiψ﻿ is equal to the covariance of hhh﻿ and ϕ\phiϕ﻿: ∇hψ=E[ϕh]\nabla_h \psi = E[\phi h]∇h​ψ=E[ϕh]﻿

This is a fascinating connection between what seem to be two very different things. The pathwise derivative describes how quickly the estimand (parameter) changes as we move along a particular path. The expectation of the influence function times the score effectively tells us what the angle is between the path's score and the estimator's influence function (see section below on the space L2\mathcal L_2L2​﻿ if this confuses you). The left-hand side (pathwise derivative) depends on the parameter ψ\psiψ﻿, the direction hhh﻿, and the true distribution PPP﻿. The right hand side (influence function covariance w/ score) depends on the choice of estimator, implying ϕ\phiϕ﻿, the direction hhh﻿, and the true distribution PPP﻿ (under which the covariance is taken). 

This is cool in and of itself, but it's also the fundamental key to characterizing the set of all influence functions, and therefore the key link that holds everything we're talking about together. 

 😱 Proof that ∇hψ=E[hϕ]\nabla_h \psi = E[h\phi]∇h​ψ=E[hϕ]﻿ 

We'll start with our definition of a path and our definition of asymptotic normality. Using just these definitions and some esoteric theorems, we'll amazingly be able to characterize how our estimator should behave as we move along the path towards PPP﻿. That's a little surprising because asymptotic normality is a property that only holds at PPP﻿ and doesn't say anything explicit about behavior along paths. Nonetheless, we'll see that there is an implication for how the estimator behaves along paths. However, we've also assumed the estimator is also regular, which is already a definition of how the estimator behaves along paths. In comparing the derived behavior from asymptotic normality and the assumed behavior from regularity, we will see there is a difference. The difference in the two behaviors can only be made to go away (as it must if an estimator is both asymptotically normal and regular) if ∇hψ=E[hϕ]\nabla_h \psi = E[h\phi]∇h​ψ=E[hϕ]﻿, so that's what we conclude.

In proving that asymptotic normality and our definition of a path by themselves imply some behavior of the estimator along such a path we'll have to use two advanced, technical results. Unfortunately I haven't found an alternative way to prove this that is both rigorous and intuitive. It's not at all impossible to understand these theorems (just read the appropriate sections of vdW 1998) but if you're not interested in the math for its own sake it's probably not worth your time. In terms of understanding the arc of the proof you just need to grasp how we've used these tools to get what we need, not how they work.

Throughout I'll abbreviate ψ^n=ψ^(Pn)\hat\psi_n =\hat\psi(\mathbb P_n)ψ^​n​=ψ^​(Pn​)﻿ (the estimate as we draw increasing samples from PPP﻿) and ψ=ψ(P)\psi = \psi(P)ψ=ψ(P)﻿ (the true parameter at PPP﻿) to keep the notation light. I'll also abbreviate ψ~^n=ψ^(P~1/n)\hat{\tilde\psi}_n =\hat\psi(\tilde{\mathbb P}_{1/\sqrt n})ψ~​^​n​=ψ^​(P~1/n​​)﻿, which is the estimate as we draw increasing samples from distributions moving along our path closer to PPP﻿, and ψ~n=ψ(P~1/n)\tilde\psi_n = \psi(\tilde P_{1/\sqrt n})ψ~​n​=ψ(P~1/n​​)﻿, which are the true parameter values at each of these distributions. Anything that has a hat is an estimate, anything that has a tilde is along a path.

Alright, go time. Let's see what we can say about how our estimator behaves along sequences of distributions like P~n=P~ϵ=1/n\tilde P_n = \tilde P_{\epsilon = 1/\sqrt{n}}P~n​=P~ϵ=1/n​​﻿, since these are the only ones regularity has anything to say about. By theorem 7.2 of vdV 1998 the fact that our paths are differentiable in quadratic mean ensures that we get something called local asymptotic normality 🤷:

\log \frac{ d\tilde P_{1/\sqrt n}^n }{ dP^n }(Z_1, \dots Z_n) = \frac{1}{\sqrt n}\sum^n h(Z_i) - \frac{1}{2}E[h(Z)^2] + o_P(1)

It really doesn't matter what this means because we're just going to use it to satisfy a particular technical condition in a minute. The important part is to see that we've defined some random variable on the left: the derivative-looking thing is just some fixed function of the data, so let's rename that as some random variable TnT_nTn​﻿. What the right hand says is that in large-enough samples, this TnT_nTn​﻿ thing looks like a sum of IID variables plus some constant. We combine this with the assumed asymptotic linearity of our estimator:

\sqrt{n}(\hat\psi(\mathbb P_n) - \psi(P)) = \frac{1}{\sqrt n}\sum^n\phi(Z_i) + o_P(1)

Which is also a random variable that's approximately equal to an IID sum when nnn﻿ gets big. Here we stack the two above equations on top of each other and by the central limit theorem we have that 

\left[ \begin{array}{c} T_n \\ \sqrt n (\hat\psi_n -\psi) \end{array} \right] \overset{P}\rightsquigarrow \mathcal N \left( \left[ \begin{array}{c} -\frac{1}{2}P[h^2] \\ 0 \end{array} \right] , \left[ \begin{array}{cc} P[h^2] & P[h\phi] \\ P[h\phi] & P[\phi^2] \\ \end{array} \right] \right)

Now we use another technical result 🧐 (Le Cam's 3rd lemma- see vdV 1998 ex. 6.7), which says that when we have exactly the situation above, we can infer

\begin{align*} \sqrt n \left(\hat{\tilde \psi}_n - \psi \right) &\overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N\left(P[h \phi], P[\phi^2]\right) \end{align*}

Again it's not important to understand the technical device. The idea is that we're trying to say something about how our estimator behaves as we change the distribution along our path of interest. At first this seems impossible because  asymptotic normality only holds at each PPP﻿ and doesn't say anything about what happens along paths. However, with the help of these technical devices, we've actually managed to say something about the difference between the estimate as we change the underlying distribution and the truth at PPP﻿. In particular, this difference converges to a normal with the same variance as if we had not been moving along the path towards PPP﻿ but had instead sat still at PPP﻿. Crazy. The limiting normal distribution now has some mean P[hϕ]P[h\phi]P[hϕ]﻿, which we'll pull out in the course of some algebraic manipulation during which we also add and subtract nψ~n\sqrt n \tilde\psi_nn​ψ~​n​﻿ on the left:

\begin{align*} \sqrt n \left(\hat{\tilde \psi}_n - \psi \right) &\overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N\left(P[h \phi], P[\phi^2]\right) \\ -P[h \phi] + \sqrt n \left( \hat{\tilde \psi}_n - \tilde\psi_n + \tilde\psi_n - \psi \right) &\overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N \left( 0, P[\phi^2] \right) \\ \end{align*}

Moving terms around, we arrive at 

\left[ \sqrt n \left( \tilde\psi_n - \psi \right) - P[h \phi] \right] + \sqrt n \left( \hat{\tilde \psi}_n - \tilde\psi_n \right) \overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N \left( 0, P[\phi^2] \right) \tag{AL}

Notice that the term in brackets on the left is a constant for each nnn﻿, i.e. it's not random. 

Now, finally, we recall our definition of regularity, which was

\sqrt n \left( \hat{\tilde \psi}_n - \tilde\psi_n \right) \overset{\tilde P_{1/\sqrt n}}{\rightsquigarrow} \mathcal N \left( 0, P[\phi^2] \right) \tag{R}

An estimator that is asymptotically linear must satisfy the behavior in the display above denoted (AL). An estimator that is regular must satisfy the behavior in the display denoted (R). A RAL estimator must satisfy both. But the only way that both of those can be true is if they are actually saying the same thing, which only happens if the bracketed term in (AL) goes to zero. Thus, for RAL estimators,

\begin{align*} \lim_{n\rightarrow \infty}\sqrt n \left( \tilde\psi_n - \psi \right) &= P[h \phi] \\ \lim_{\epsilon_n \rightarrow 0} \frac{ \tilde\psi_n - \psi }{ \epsilon_n } &= P[h \phi] \\ \nabla_h \psi &= P[h \phi] \end{align*}

where we've recalled ϵn=1/n\epsilon_n = 1/\sqrt{n}ϵn​=1/n​﻿ from our original definition of our sequence along the path.

What we mean is this: since the left-hand side doesn't depend on the influence function ϕ\phiϕ﻿, the same identity holds (keeping constant hhh﻿ and PPP﻿) for any two estimators with influence functions ϕ1\phi_1ϕ1​﻿, ϕ2\phi_2ϕ2​﻿ (if two such estimators exist). Specifically:  E[ϕ1h]=∇hψ=E[ϕ2h]E[\phi_1 h] = \nabla_h \psi = E[\phi_2 h]E[ϕ1​h]=∇h​ψ=E[ϕ2​h]﻿. Eliminating the middleman, we get E[(ϕ1−ϕ2)h]=0E[(\phi_1 - \phi_2)h] = 0E[(ϕ1​−ϕ2​)h]=0﻿. Therefore the difference of influence functions for any two RAL estimators is orthogonal to any score! If you don't understand why this is the implication, you should review the section on L2\mathcal L_2L2​﻿ space. Moreover, if we take any function h⊥h^\perph⊥﻿ that is orthogonal to all scores and add it to the influence function from a RAL estimator, the result ϕ+h⊥\phi + h^\perpϕ+h⊥﻿ still satisfies the above requirement and so this is an influence function for a different RAL estimator.

The Space L2(P)\mathcal L_2(P)L2​(P)﻿

We can treat any score hhh﻿ and any influence function ϕ\phiϕ﻿ as an element of a set we call L2(P)\mathcal L_2(P)L2​(P)﻿ (or just L2\mathcal L_2L2​﻿ when the measure PPP﻿ is clear). This is the set of all functions f(Z)f(Z)f(Z)﻿ which satisfy the condition ∫f(Z)2dP<∞\int f(Z)^2 dP < \infty∫f(Z)2dP<∞﻿ (i.e. f(Z)f(Z)f(Z)﻿ is a random variable with finite variance). This space is a lot like the vector space Rp\mathbb R^pRp﻿ in many important ways:

L2\mathcal L_2L2​﻿ is a linear space: f,g∈L2,α∈R  ⟹  f+αg∈L2f,g \in \mathcal L_2, \alpha \in \mathbb R \implies f + \alpha g \in \mathcal L_2f,g∈L2​,α∈R⟹f+αg∈L2​﻿

L2\mathcal L_2L2​﻿ has an inner product: in Rp\mathbb R^pRp﻿, the inner product between two vectors xxx﻿ and yyy﻿ is given by ∑xiyi\sum x_i y_i∑xi​yi​﻿. The equivalent in L2\mathcal L_2L2​﻿ for two functions fff﻿ and ggg﻿ is ∫fgdP=E[fg]\int fg dP = E[fg]∫fgdP=E[fg]﻿. Note how the integral of a product of two functions is a lot like the sum of the element-wise product of vectors. It happens that the two operations obey all the same rules that define something as an "inner product". When we don't have a specific space in mind, we usually write the inner product between two elements as ⟨u,v⟩\langle u, v \rangle⟨u,v⟩﻿. If two elements have an inner product equal to 0 we say that the two are orthogonal. This generalizes the notion of two vectors being at a right angle in Rp\mathbb R^pRp﻿. More generally, the inner product also defines what the "angle" is between two vectors: θu,v=cos⁡−1(⟨u,v⟩∣∣v∣∣∣∣u∣∣)\theta_{u,v} = \cos^{-1}\left(\frac{\langle u, v \rangle}{||v|| ||u||}\right)θu,v​=cos−1(∣∣v∣∣∣∣u∣∣⟨u,v⟩​)﻿, where ∣∣u∣∣=⟨u,u⟩||u|| = \langle u,u \rangle ∣∣u∣∣=⟨u,u⟩﻿ is the norm of uuu﻿ in this space. 

L2\mathcal L_2L2​﻿ is complete. This is a technical term that means that the space contains all of its limit points (i.e. a sequence of convergent elements in the space can't converge to a limit that is outside the space).

Together, these conditions are the definition of something called a Hilbert space. Indeed, Rp\mathbb R^pRp﻿ and L2\mathcal L_2L2​﻿ are the usual examples of Hilbert spaces. The main difference between the two is that Rp\mathbb R^pRp﻿ is finite-dimensional, whereas L2\mathcal L_2L2​﻿ is infinite-dimensional. What do I mean by this? Well, a vector in, say R3\mathbb R^3R3﻿ clearly has 3 components [x1,x2,x3][x_1, x_2, x_3][x1​,x2​,x3​]﻿. A "vector" in L2\mathcal L_2L2​﻿ has as many components as its functions have arguments because you can think of a function f:R→Rf:\mathbb R \rightarrow \mathbb Rf:R→R﻿ like this: f=[…f(−200)…f(−0.1)…f(0)…f(1)…f(1.5)… ]f = [\dots f(-200) \dots f(-0.1) \dots f(0) \dots f(1) \dots f(1.5) \dots ]f=[…f(−200)…f(−0.1)…f(0)…f(1)…f(1.5)…]﻿. The only difference is that I put the "vector index" in parentheses instead of as a subscript and now I call it an "argument". We also now allow for indices that are anywhere in the domain of fff﻿ instead of just the integers 1,2,31,2,31,2,3﻿. Alternatively, we can think of the vector xxx﻿ as a function that maps i∈{1,2,3}↦xji \in \{1,2,3\} \mapsto x_ji∈{1,2,3}↦xj​﻿. So vectors, functions, whatever. It's kind of the same thing in most of the ways that matter!

Tangent Spaces

If influence functions are orthogonal to every score hhh﻿, they must also be orthogonal to any linear combination of scores, or any limit of a sequence thereof. So it makes everything more concise if we start talking about the tangent space T\mathcal TT﻿ (which is exactly the set of all scores, their linear combinations, and limits) instead of having to continually refer to "all scores". Since the tangent space comes from the set of scores, it has nothing to do with either the estimator or the parameter. It depends purely on the statistical model and the true distribution PPP﻿. 

Why is it called the tangent space?

Honestly, I don't think it's the greatest name, but let's explain it. 

The key is to 1) think about each point P∈MP \in \mathcal MP∈M﻿ as having density ppp﻿ w.r.t. some dominating measure P°P\degree P°﻿ and 2) realize that p∈L2p \in \mathcal L_2p∈L2​﻿. Because ppp﻿ has to integrate to 1, then then norm of p\sqrt{p}p​﻿ (that is, EP°[pp])E_{P\degree}[{\sqrt p \sqrt p}])EP°​[p​p​])﻿ is of course 1. So we can identify M\mathcal MM﻿ with some subset of the unit ball in L2(P°)\mathcal L_2(P\degree)L2​(P°)﻿. It's now easy to show that p\sqrt pp​﻿ is orthogonal to hph \sqrt php​﻿: 

∫p(ph)dP°=∫hpdP°=∫hdP=0\int \sqrt p (\sqrt p h) dP\degree = \int h p dP\degree = \int hdP = 0∫p​(p​h)dP°=∫hpdP°=∫hdP=0﻿

so the set of functions hph\sqrt php​﻿ ranging over scores hhh﻿ is orthogonal (tangent) to the point p\sqrt pp​﻿ in the model space. Or, equivalently, ppp﻿ is orthogonal to all scores hhh﻿ in the space L(P°)\mathcal L(P\degree)L(P°)﻿.

Once nice thing that this picture shows is that the tangent space clearly depends on where PPP﻿ is within the model. 

Nonetheless, it'd probably be more informative to call it the "score space", since the fact that the scores are tangent to the density ppp﻿ in L2(P°)\mathcal L_2(P^\degree)L2​(P°)﻿ doesn't seem matter that much in terms of the role the space ends up playing in the theory we're developing. Alas, we're stuck with the names we have.

Saturated vs. Non-Saturated Models

Sometimes the tangent space is all of L20\mathcal L_2^0L20​﻿. This happens if we put no restrictions on the distributions in our statistical model. If any distribution can be in the model, then starting at any point PPP﻿, any function hhh﻿ that has zero mean and finite variance defines a valid path at PPP﻿ because p~ϵ=(1+ϵh)p\tilde p_\epsilon = (1+\epsilon h)pp~​ϵ​=(1+ϵh)p﻿ will still be a density for any such score hhh﻿. 

When this happens, we say that the model is nonparametric saturated, or just saturated (at PPP﻿). Intuitively, this means that we have a model space such that, standing at PPP﻿, we can move in any direction and still stay inside the model. If this is not the case, then we say the model is not saturated (at PPP﻿).

Factorizing Tangent Spaces

While we're on the subject of tangent spaces, it turns out that if we can factorize our distribution  P(Y,X)=P(Y∣X)P(X)P(Y, X) = P(Y|X)P(X)P(Y,X)=P(Y∣X)P(X)﻿ then all scores hhh﻿ end up being the sum of scores for each factor, treating these as living in statistical models of their own, i.e. P(Y∣X)∈MY∣XP(Y|X) \in \mathcal M_{Y|X}P(Y∣X)∈MY∣X​﻿. Moreover, the scores for each factor end up being orthogonal, i.e. h=hX(x)+hY∣X(x,y)h = h_X(x) + h_{Y|X}(x, y)h=hX​(x)+hY∣X​(x,y)﻿ and hX⊥hY∣Xh_X \perp h_{Y|X}hX​⊥hY∣X​﻿. Lastly, these scores satisfy E[hX]=0E[h_X] = 0E[hX​]=0﻿ and E[hY∣X∣X]=0E[h_{Y|X}|X]=0E[hY∣X​∣X]=0﻿.  A short proof of all of is given below. This argument also generalizes to densities that have more than two factors (just factor one of the two factors).

Proof

Recall that h=ddϵlog⁡p~ϵ∣ϵ=0h = \frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0}h=dϵd​logp~​ϵ​∣∣​ϵ=0​﻿. For ϵ\epsilonϵ﻿ small enough p~ϵ\tilde p_\epsilonp~​ϵ​﻿ is still in the model, so we can factor it into p~Z2∣Z1p~Z1\tilde p_{Z_2|Z_1} \tilde p_{Z_1}p~​Z2​∣Z1​​p~​Z1​​﻿(omitting the ϵ\epsilonϵ﻿ subscripts). The log of a product is the sum of logs and the derivative is linear over a sum, so we get h(Z1,Z2)=ddϵlog⁡p~Z2∣Z1∣ϵ=0undefinedh2(Z1,Z2)+ddϵlog⁡p~Z1∣ϵ=0undefinedh1(Z1)h(Z_1, Z_2)
= 
\underbrace{
\frac{d}{d\epsilon} \log \tilde p_{Z_2|Z_1} \big|_{\epsilon=0}
}_{h_2(Z_1, Z_2)}
+
\underbrace{
\frac{d}{d\epsilon} \log \tilde p_{Z_1} \big|_{\epsilon=0}
}_{h_1(Z_1)}h(Z1​,Z2​)=h2​(Z1​,Z2​)dϵd​logp~​Z2​∣Z1​​∣∣​ϵ=0​​​+h1​(Z1​)dϵd​logp~​Z1​​∣∣​ϵ=0​​​﻿

Now we'd like to show that h2⊥h1h_2 \perp h_1h2​⊥h1​﻿ in L2(P)\mathcal L_2(P)L2​(P)﻿. 

For that we have to notice that h2h_2h2​﻿ is a score at pZ2∣Z1p_{Z_2|Z_1}pZ2​∣Z1​​﻿ in the nonparametric model MZ2∣Z1={all densities of Z2 given Z1}\mathcal M_{Z_2|Z_1} = \{\text{all densities of $Z_2$ given $Z_1$\}}MZ2​∣Z1​​={all densities of Z2​ given Z1​}﻿ (by definition). If we start at pZ2∣Z1p_{Z_2|Z_1}pZ2​∣Z1​​﻿ and move along a path defined by h2h_2h2​﻿, the only way we stay within MZ2∣Z1\mathcal M_{Z_2|Z_1}MZ2​∣Z1​​﻿is if E[h2∣Z1]=0E[h_2|Z_1] =0E[h2​∣Z1​]=0﻿. Otherwise the resulting perturbation will not be a density. Similarly, we need for E[h1]=0E[h_1] = 0E[h1​]=0﻿. 

Now E[h1h2]=E[E[h1h2∣Z1]]=E[h1(Z1)E[h2∣Z1]]=E[h1⋅0]=0E[h_1 h_2] = E[E[h_1 h_2|Z_1]] = E[h_1(Z_1) E[h_2|Z_1]] = E[h_1 \cdot 0] = 0E[h1​h2​]=E[E[h1​h2​∣Z1​]]=E[h1​(Z1​)E[h2​∣Z1​]]=E[h1​⋅0]=0﻿ and we've shown h1h_1h1​﻿ and h2h_2h2​﻿ are orthogonal.

We can also go the other way- if I propose h1h_1h1​﻿ and h2h_2h2​﻿ that satisfy the above, then h=h1+h2h = h_1 + h_2h=h1​+h2​﻿ must be a valid score at PPP﻿ in the original model. 

Variational Independence

We showed that the tangent space can be broken up into an orthogonal sum when every distribution in the model factors and those factors can vary independently in their own model spaces. There are cases where this breaks down, though. If you look at the following picture, you'll surmise that all the scores at PPP﻿ (blue arrows) can indeed be created as orthogonal sums of scores from MZ2∣Z1\mathcal M_{Z_2|Z_1}MZ2​∣Z1​​﻿ and MZ1\mathcal M_{Z_1}MZ1​​﻿. However, there are arrows that can be constructed the same way (grey, dotted) that take us outside of the model space- these are not scores because they don't correspond to a legal path inside the model. Therefore in this case T⊂T1⊕T2\mathcal T \subset \mathcal T_1 \oplus \mathcal T_2T⊂T1​⊕T2​﻿, but we don't have the equality. When this happens, we say that our submodels are not variationally independent. In other words, there are some points in the model space where I can't legally change in one of the submodels without having to make a change in another (i.e. at PPP﻿ in the picture we can't go just up, we also have to move a bit to the right along the diagonal).

We can therefore usually construct tangent spaces for each factor separately and then the tangent space for the whole model is the orthogonal sum of the component tangent spaces (we write this T=T1⊕T2\mathcal T = \mathcal T_1 \oplus \mathcal T_2T=T1​⊕T2​﻿). 

Consider what happens if we perturb a density ppp﻿ along a path defined by a score hXh_XhX​﻿ that satisfies hX(y,x)=hX(x)h_X(y,x) = h_X(x)hX​(y,x)=hX​(x)﻿ and E[hX]=0E[h_X] = 0E[hX​]=0﻿. We know that p~ϵ(y,x)=(1+ϵhX)p(y,x)\tilde p_\epsilon(y,x)  = (1+\epsilon h_X)p(y,x)p~​ϵ​(y,x)=(1+ϵhX​)p(y,x)﻿, but what can we say about the resulting factors p~ϵ(y∣x)\tilde p_\epsilon(y|x)p~​ϵ​(y∣x)﻿ and p~ϵ(x)\tilde p_\epsilon(x)p~​ϵ​(x)﻿? A bit of algebra using the properties of hXh_XhX​﻿ and the definition of the conditional density shows that p~ϵ(y∣x)=p(y∣x)\tilde p_\epsilon(y|x) = p(y|x)p~​ϵ​(y∣x)=p(y∣x)﻿ and also p~ϵ(x)=(1+ϵhX)p(x)\tilde p_\epsilon(x) = (1+\epsilon h_X)p(x)p~​ϵ​(x)=(1+ϵhX​)p(x)﻿. In other words, moving along a path in MX\mathcal M_XMX​﻿ only affects the factor p(x)p(x)p(x)﻿. Similarly, if you repeat this exercise with a score hY∣Xh_{Y|X}hY∣X​﻿ satisfying E[hY∣X∣X]=0E[h_{Y|X}|X] = 0E[hY∣X​∣X]=0﻿ you get that  p~ϵ(y∣x)=(1+ϵhY∣X)p(y∣x)\tilde p_\epsilon(y|x) = (1+\epsilon h_{Y|X})p(y|x)p~​ϵ​(y∣x)=(1+ϵhY∣X​)p(y∣x)﻿ and also p~ϵ(x)=p(x)\tilde p_\epsilon(x) = p(x)p~​ϵ​(x)=p(x)﻿. Thus moving along a path in MY∣X\mathcal M_{Y|X}MY∣X​﻿ only affects the factor p(y∣x)p(y|x)p(y∣x)﻿. This should make some intuitive sense to you, and, naturally, everything generalizes cleanly when there are more than two factors.

This is useful when we evaluate the directional derivative ∇hψ\nabla_h \psi∇h​ψ﻿. If hhh﻿ is the sum of two other functions we can always write ∇hψ=∇hXψ+∇hY∣Xψ\nabla_h\psi = \nabla_{h_X} \psi + \nabla_{h_{Y|X}}\psi∇h​ψ=∇hX​​ψ+∇hY∣X​​ψ﻿ because the derivative is a linear operator in the score. The resulting terms ∇hXψ\nabla_{h_X} \psi∇hX​​ψ﻿ and ∇hY∣Xψ\nabla_{h_{Y|X}}\psi∇hY∣X​​ψ﻿ are now easier to evaluate because for each of them we just need to consider a single perturbation in p(x)p(x)p(x)﻿ or in p(y∣x)p(y|x)p(y∣x)﻿, respectively. We’ll see this come in handy in the next section of this chapter.

Example: tangent space for a randomized controlled trial

Let's give a concrete example of a tangent space at PPP﻿ for some particular statistical model. For our model, we'll use the space MRCT(π)\mathcal M_{\text{RCT($\pi$)}}MRCT(π)​﻿ where RCT(π\piπ﻿) stands for "randomized controlled trial" with treatment mechanism π\piπ﻿. This model is characterized by a joint distribution between some vector of observed covariates XXX﻿, a binary treatment AAA﻿, and an outcome YYY﻿. Any distribution of three variables can always factor as P(Y,A,X)=P(Y∣A,X)P(A∣X)P(X)P(Y,A,X) = P(Y|A,X)P(A|X)P(X)P(Y,A,X)=P(Y∣A,X)P(A∣X)P(X)﻿. The treatment mechanism is defined as π(X)=P(A=1∣X)\pi(X) = P(A=1|X)π(X)=P(A=1∣X)﻿, the probability of receiving treatment given observed covariates XXX﻿.

What makes the RCT model space different from the more general nonparametric space where each of these three distributions can be anything is that in the RCT the distribution of A∣XA|XA∣X﻿ is fixed. Specifically, we know P(A=1∣X)=π(X)P(A=1|X) = \pi(X)P(A=1∣X)=π(X)﻿. For example, in a trial with a simple 50:50 randomization we have π(X)=0.5\pi(X) = 0.5π(X)=0.5﻿. 

So what are the possible scores? The density factors into three components, only two of which can actually vary. Thus any score will be the sum of two orthogonal components, one of which corresponds to pX∈MXp_{X} \in \mathcal M_XpX​∈MX​﻿ and one of which corresponds to pY∣A,X∈MY∣A,Xp_{Y|A,X} \in \mathcal M_{Y|A,X}pY∣A,X​∈MY∣A,X​﻿ (recall pA∣Xp_{A|X}pA∣X​﻿ is fixed so we can't move in that "direction"). 

Now we're basically done. Since there are no more restrictions on our model, our tangent space is orthogonal sum of the two tangent subspaces. Let L20\mathcal L_2^0 L20​﻿ be the subspace of all zero-mean functions in L2\mathcal L_2L2​﻿ (recall all scores and all influence functions have mean zero). Now, formally:

\begin{align*} \mathcal T_X &= \left\{ h : \begin{split} &E[h]=0 \\ &h(Y,A,X) = h(X) \end{split} \right\} \\ \mathcal T_{Y|A,X} &= \left\{ h : E[h|A,X]=0 \right\} \\ \mathcal T_{\text{RCT}} &= \mathcal T_X \oplus \mathcal T_{Y|A,X} \end{align*}

Can you think of a function that is in L20\mathcal L_2^0L20​﻿ but not in TRCT\mathcal T_{\text{RCT}}TRCT​﻿?

Example: tangent space for an observational study

The difference between the observational study model and the RCT model is that now we don't know the treatment mechanism P(A∣X)P(A|X)P(A∣X)﻿. This model (we'll call it Mobs\mathcal M_\text{obs}Mobs​﻿) therefore contains MRCT(π)\mathcal M_{\text{RCT($\pi$)}}MRCT(π)​﻿ for any treatment mechanism π\piπ﻿.

Since this model is larger than MRCT\mathcal M_{RCT}MRCT​﻿, it should make some intuitive sense to you that the tangent space is bigger too. This is because at any point PPP﻿ in the model, we have more directions we can move in than we previously did. Specifically, we can now move in directions hhh﻿ that end up changing the treatment mechanism. To be specific, we can have paths through Mobs\mathcal M_{\text{obs}}Mobs​﻿ (specified by some hhh﻿) such that P~ϵ(A∣X)≠P(A∣X)=π(X)\tilde P_\epsilon(A|X) \ne P(A|X) = \pi(X)P~ϵ​(A∣X)=P(A∣X)=π(X)﻿. But this same hhh﻿ cannot be a path through MRCT(π)\mathcal M_{\text{RCT}(\pi)}MRCT(π)​﻿ because for the fluctuated distribution P~ϵ\tilde P_\epsilonP~ϵ​﻿ to remain inside of MRCT(π)\mathcal M_{\text{RCT}(\pi)}MRCT(π)​﻿ we have to require that P~ϵ(A∣X)=π(X)\tilde P_\epsilon(A|X) = \pi(X)P~ϵ​(A∣X)=π(X)﻿. Therefore for any PPP﻿ with some treatment mechanism π\piπ﻿, all paths through PPP﻿ in MRCT(π)\mathcal M_{\text{RCT}(\pi)}MRCT(π)​﻿ are also paths through PPP﻿ in Mobs\mathcal M_\text{obs}Mobs​﻿ but not vice-versa.

We can be even more specific about this. Since any density in Mobs\mathcal M_\text{obs}Mobs​﻿ factors as P(Y,A,X)=P(Y∣A,X)P(A∣W)P(X)P(Y,A,X) = P(Y|A,X)P(A|W)P(X)P(Y,A,X)=P(Y∣A,X)P(A∣W)P(X)﻿ we immediately have that the tangent space is given by the orthogonal sum Tobs=TY∣X,A0⊕TA∣X0⊕TX\mathcal T_{\text{obs}} = \mathcal T_{Y|X,A}^0 \oplus \mathcal T^0_{A|X} \oplus \mathcal T_{X}Tobs​=TY∣X,A0​⊕TA∣X0​⊕TX​﻿. Compare this to TRCT=TY∣X,A0⊕TX\mathcal T_{\text{RCT}} = \mathcal T_{Y|X,A}^0 \oplus \mathcal T_{X}TRCT​=TY∣X,A0​⊕TX​﻿ and you immediately see that scores in the observational study have an additional set of degrees of freedom that the score in the RCT don't have. Namely: all directions in TA∣X0\mathcal T^0_{A|X}TA∣X0​﻿. In fact, Tobs=L20T_{\text{obs}} = \mathcal L_2^0Tobs​=L20​﻿, which is to say that any mean-zero function with finite variance is a legal score for this model. That's because we've put literally no restrictions on what P(Y,A,X)P(Y,A,X)P(Y,A,X)﻿ can be so p~ϵ=(1+ϵh)p\tilde p_\epsilon = (1+\epsilon h)pp~​ϵ​=(1+ϵh)p﻿ is a legitimate density in our model for any such hhh﻿.

The Efficient Influence Function

Starting with nothing but the definition of a RAL estimator, we've shown that the set of influence functions of RAL estimators is (after shifting it to the origin) orthogonal to the tangent space. Since RAL estimators basically only differ by their variance (which is the variance of their influence functions), we get the best RAL estimator by finding one that has the influence function with the smallest variance. The point of everything we've done in the section above is that at least now we know the space we have to look in!

Thankfully, our characterization of the tangent space and the set of influence functions makes it easy to find the influence function with the smallest variance. 

Pathwise Differentiability

There are combinations of parameters and models for which the pathwise derivative doesn't exist or isn't a bounded linear operator. For example, consider densities of Y,XY,XY,X﻿ and let our parameter be ψ(P)=E[Y∣X=x0]\psi(P) = E[Y |X=x_0]ψ(P)=E[Y∣X=x0​]﻿ for a particular point of interest x0x_0x0​﻿. You can evaluate the derivative and show that ∇hψ(P)=∫yh(y,x0)pY∣X(y,x0)dy\nabla_h \psi(P) = \int y h(y,x_0)p_{Y|X}(y,x_0) dy∇h​ψ(P)=∫yh(y,x0​)pY∣X​(y,x0​)dy﻿. The problem here is that I can always pick hhh﻿ with norm 1 but that takes huge values at  x=x0x=x_0x=x0​﻿ (by compensating with large negative values at other xxx﻿). So I can pick hhh﻿ (in the unit ball) such that the integral blows up as large as I want. That's what "unbounded" means for a linear operator. The consequence of this is that the Reisz representation theorem no longer applies and we cannot guarantee that there is an element ϕ†\phi^\daggerϕ†﻿ that is in the tangent space. This also destroys the argument based on the difference of two influence functions of RAL estimators because we can't guarantee that even a single RAL estimator exists.

To do this we can argue that there is some influence function that is itself in the tangent space because we can always project some influence function ϕ\phiϕ﻿ into T\mathcal TT﻿ to get ϕ=ϕ†+h⊥\phi = \phi^\dagger + h^\perpϕ=ϕ†+h⊥﻿. By definition, ϕ†\phi^\daggerϕ†﻿ is in the tangent space, and by what we know about the relationship of Φ\PhiΦ﻿ to the tangent space, ϕ†\phi^\daggerϕ†﻿ is the difference of an influence function with something that's orthogonal to all scores, meaning that it too must be an influence function. We can reach the same conclusion if we use the Reisz representation theorem. If the pathwise derivative, viewed as a function of the score hhh﻿, is bounded and linear among scores in T\mathcal TT﻿, then there is some element of T\mathcal TT﻿ that satisfies our requirement of being an influence function for a RAL estimator. This is is usually the case, although there are some parameters and models for which it isn't. But as long as we have pathwise differentiability, there is always exactly one influence function that is also in the tangent space.

Reisz Representation

Besides giving us a definition of orthogonality, Hilbert spaces are super nice to work in because of an important property called Reisz representability. The Reisz representation theorem says that if you have a bounded linear function FFF﻿ that maps an element of the Hilbert space to a real number, then that function can always be represented as an inner product between the argument to the function and some other element of the Hilbert space. In other words, for every bounded linear function FFF﻿, there is some element hFh_FhF​﻿ so that F(h)=⟨hF,h⟩F(h) = \langle h_F, h \rangleF(h)=⟨hF​,h⟩﻿. 

This is a very surprising result at first but it's pretty easy to convince yourself of it in Rp\mathbb R^pRp﻿. Consider vectors x⃗,y⃗∈Rp\vec x, \vec y \in \mathbb R^px,y​∈Rp﻿ (I'll use the vector arrow x⃗\vec xx﻿ in this section to be very explicit). The definition of a linear function F:Rp→RF: \mathbb R^p \rightarrow \mathbb RF:Rp→R﻿ is that F(x⃗+αy⃗)=F(x⃗)+αF(y⃗)F(\vec x + \alpha \vec y) = F(\vec x) + \alpha F(\vec y)F(x+αy​)=F(x)+αF(y​)﻿ for any scalar α\alphaα﻿. Any linear function in this particular space is bounded so we don't need to worry about what that means. To show that F(x⃗)F(\vec x)F(x)﻿ is actually just taking an inner product between x⃗\vec xx﻿ and some other vector we can apply FFF﻿ to the unit vectors e⃗1…p\vec e_{1 \dots p}e1…p​﻿ and see what it returns. Say F(e⃗i)=aiF(\vec e_i) = a_iF(ei​)=ai​﻿. Now we can express any vector as the weighted sum of the unit vectors x⃗=∑xie⃗i\vec x = \sum x_i\vec e_ix=∑xi​ei​﻿, and by linearity of FFF﻿ notice now that F(x⃗)=∑xiF(e⃗i)=∑xiai=⟨x⃗,a⃗⟩F(\vec x) = \sum x_iF(\vec e_i) = \sum x_i a_i = \langle \vec x, \vec a\rangle F(x)=∑xi​F(ei​)=∑xi​ai​=⟨x,a⟩﻿. The conclusion is that the operation of the function FFF﻿ on x⃗\vec xx﻿ is really just the inner product between x⃗\vec xx﻿ and some other vector a⃗\vec aa﻿, which is also in Rp\mathbb R^pRp﻿. The other direction is also obvious: given a⃗\vec aa﻿, define Fa⃗(x⃗)=⟨a⃗,x⃗⟩F_{\vec a}(\vec x) = \langle \vec a, \vec x\rangleFa​(x)=⟨a,x⟩﻿ and since the inner product is a linear operation in both arguments, we have that the function is linear as desired.

What's amazing is that this property actually extends to any Hilbert space (i.e. a complete linear space with an inner product). In particular, if we have a bounded linear function FFF﻿ that maps functions f∈L2f \in \mathcal L_2f∈L2​﻿ to real numbers, then we know there is some fixed function g∈L2g \in \mathcal L_2g∈L2​﻿ such that F(f)=E[fg]F(f) = E[fg]F(f)=E[fg]﻿. If you're unfamiliar with functions that have functions as arguments you can think of FFF﻿ as "assigning" a number to each function fff﻿.

Projection in Hilbert Space

Almost all Hilbert spaces contain other Hilbert spaces within them. For example, we can think of any plane in 3D space as a 2D space in its own right. If we add the restriction that the contained space must have the origin in it, we call it a subspace and a useful result follows. Namely: for every vector vvv﻿ in the Hilbert space H\mathcal HH﻿, there is always exactly one vector v∗v_*v∗​﻿ in the desired subspace H∗\mathcal H_*H∗​﻿ and a vector v⊥v^\perpv⊥﻿ that's orthogonal to every vector in H∗\mathcal H_*H∗​﻿ such that v=v∗+v⊥v = v_* + v^\perpv=v∗​+v⊥﻿.  Since the projection is unique, it's easy to check if v∗v_*v∗​﻿ is indeed a projection by simply checking (1) whether v∗∈H∗v_* \in \mathcal H_*v∗​∈H∗​﻿ and (2) whether v−v∗⊥H∗v - v_* \perp \mathcal H_*v−v∗​⊥H∗​﻿.

If a subspace is the orthogonal sum of two other subspaces, we can always obtain the projection by projecting into each sub-subspace and then summing the result.

And, in fact, this unique influence function ϕ†\phi^\daggerϕ†﻿ in the tangent space is the one with the smallest variance. The proof of this is relatively simple: Φ\PhiΦ﻿ and T\mathcal TT﻿ are subsets of L20\mathcal L_2^0L20​﻿. In this space, the norm is ∣∣f∣∣=E[f2]=V[f]||f||=E[f^2] = V[f]∣∣f∣∣=E[f2]=V[f]﻿ since by definition fff﻿ must have mean zero to be in L20\mathcal L_2^0L20​﻿. Therefore looking for the smallest variance influence function is the same as looking for the influence function with the smallest norm in L20\mathcal L_2^0L20​﻿, or, equivalently, the point in Φ\PhiΦ﻿ that's closest to the origin. We can write any point in ϕ∈Φ\phi \in \Phiϕ∈Φ﻿ as the sum of the influence function that is in the tangent space plus some function that is orthogonal to the tangent set: ϕ=ϕ†+h⊥\phi = \phi^\dagger + h^\perpϕ=ϕ†+h⊥﻿. But since those two components are at a right angle, there's no way for the length of the "vector" ϕ\phiϕ﻿ to be less than the length of the "vector" ϕ†\phi^\daggerϕ†﻿ because the Pythagorean theorem says that ∣∣ϕ∣∣2=∣∣ϕ†∣∣2+∣∣h⊥∣∣2||\phi||^2 = ||\phi^\dagger||^2 + ||h^\perp||^2∣∣ϕ∣∣2=∣∣ϕ†∣∣2+∣∣h⊥∣∣2﻿. 

🌈

Because ϕ†\phi^\daggerϕ†﻿ has the smallest variance of any influence function, and therefore any RAL estimator that has it will make the most efficient possible use of the data, we call ϕ†\phi^\daggerϕ†﻿ the efficient influence function (EIF; sometimes also referred to as efficient influence curve or EIC). 

If we identify ϕ†\phi^\daggerϕ†﻿ as the unique element in the tangent space for which the pathwise derivative in a particular direction can be represented as the inner product between that direction and this element (i.e. the Reisz representation), then it also makes sense to call ϕ†\phi^\daggerϕ†﻿ the canonical gradient. 

Why is it called the canonical gradient?

If we have a function f:Rp→Rf:\mathbb R^p \rightarrow \mathbb Rf:Rp→R﻿ that maps vectors to real numbers, the directional derivative at x⃗\vec xx﻿ in the direction of a vector h⃗\vec hh﻿ is defined as ∇hf(x⃗)=lim⁡ϵ→0f(x⃗+h⃗)−f(x⃗)ϵ\nabla_hf(\vec x) 
=
\lim_{\epsilon \rightarrow 0}
\frac{f(\vec x + \vec h) - f(\vec x)
}{\epsilon}∇h​f(x)=limϵ→0​ϵf(x+h)−f(x)​﻿. However, it's well-known that we can also write this as the sum of the partial derivatives times the components of h⃗\vec hh﻿: ∇h⃗f(x⃗)=∂f(x⃗)∂x1h1+⋯+∂f(x⃗)∂xphp\nabla_{\vec h} f(\vec x)
=
\frac{\partial f(\vec x)}{\partial x_1} h_1
+ \dots + 
\frac{\partial f(\vec x)}{\partial x_p} h_p
∇h​f(x)=∂x1​∂f(x)​h1​+⋯+∂xp​∂f(x)​hp​﻿. In fact, the proof of this follows from noticing that the directional derivative is a linear operator and applying the Reisz representation theorem that we derived for finite dimensional Rp\mathbb R^pRp﻿. Of course, this is the same as the inner product between the vector of partial derivatives, which we call the gradient ∇f(x⃗)=[∂f(x⃗)/∂x1…∂f(x⃗)/∂xp]\nabla f(\vec x) = [\partial f(\vec x)/\partial x_1 \dots \partial f(\vec x)/\partial x_p]∇f(x)=[∂f(x)/∂x1​…∂f(x)/∂xp​]﻿, and hhh﻿. Thus ∇h⃗f=h⃗⋅∇f=⟨h⃗,∇f⟩Rp\nabla_{\vec h} f = \vec h \cdot \nabla f = \langle \vec h, \nabla f \rangle_{\mathbb R^p}∇h​f=h⋅∇f=⟨h,∇f⟩Rp​﻿. 

Now it should make sense why we refer to ϕ†\phi^\daggerϕ†﻿ as a gradient. It's exactly the same as ∇f\nabla f∇f﻿, except now we're in L20\mathcal L_2^0L20​﻿ instead of Rp\mathbb R^pRp﻿. Since we can write the directional derivative ∇hψ(P)\nabla_h \psi(P)∇h​ψ(P)﻿ as an inner product E[h,ϕ†]=⟨h,ϕ†⟩L2E[h, \phi^\dagger] = \langle h, \phi^\dagger \rangle_{\mathcal L_2}E[h,ϕ†]=⟨h,ϕ†⟩L2​​﻿, the object ϕ†\phi^\daggerϕ†﻿ is serving exactly the same role as the gradient ∇f\nabla f∇f﻿ is above. To reiterate:

\begin{array}{ccc} \nabla_{\vec h} f(\vec x) &=& \langle \vec h, \nabla f \rangle_{\mathbb R^p} \\ \underbrace{ \nabla_h \psi(P) }_ {\text{directional derivative}} &=& \underbrace{ \langle h, \phi^\dagger \rangle_{\mathcal L_2(P)} }_ {\text{inner product b/t direction and gradient}} \end{array}

We call ϕ†\phi^\daggerϕ†﻿ the canonical gradient because there are usually many functions in L2\mathcal L_2L2​﻿ that satisfy the above (namely: every other influence function). However, ϕ†\phi^\daggerϕ†﻿ is the only one that is in the tangent space, and is the only one that's guaranteed to exist according to the Reisz representation theorem. That's what makes it "canonical". 

In the context we've described here it might make more intuitive sense to use the notation ∇ψ\nabla \psi∇ψ﻿ to represent the canonical gradient. However, we use ϕ†\phi^\daggerϕ†﻿ to make the connection to influence functions for RAL estimators. A gradient (depends on the parameter) and an influence function (depends on the estimator) are actually totally different things. It just so happens that for RAL estimators they happen to occupy exactly the same space.

If a model is saturated, these geometrical arguments imply that there is only one valid influence function (and it is therefore efficient). To see why, we'll assume that there are two different influence functions and then show that they are actually the same. If there are two IFs, we know that E[(ϕ1−ϕ2)h]=0E[(\phi_1 - \phi_2)h] = 0E[(ϕ1​−ϕ2​)h]=0﻿ for all hhh﻿ in the tangent set, which in the case of a saturated model is all of L20\mathcal L_2^0L20​﻿. However, ϕ1−ϕ2\phi_1 - \phi_2ϕ1​−ϕ2​﻿ is itself a zero-mean, finite variance function (i.e. an element of the tangent set) so there is some score h∗=ϕ1−ϕ2h^* = \phi_1 - \phi_2h∗=ϕ1​−ϕ2​﻿. But by the orthogonality result, we must have that E[(ϕ1−ϕ2)h∗]=E[h∗h∗]=0E[(\phi_1 - \phi_2)h^*] = E[h^*h^*] = 0E[(ϕ1​−ϕ2​)h∗]=E[h∗h∗]=0﻿. Since nothing can be orthogonal to itself unless it is 0, we conclude that ϕ1−ϕ2=0\phi_1-\phi_2 = 0ϕ1​−ϕ2​=0﻿ and our two "different" influence functions are in fact the same one. This, in turn, means there exists only one RAL estimator (or, technically, one class of RAL estimators that are all asymptotically equivalent).

⬅️

BACK: 3.1 Regular and Asymptotically Linear Estimators

➡️

NEXT: 3.3 Deriving EIFs