Search
Duplicate
Try Notion
4.1 Naive Plug-In Estimation
The naive plug-in estimate is fairly straightforward. We start with the data Pn\mathbb P_n (which is IID from PP) and we use some kind of algorithm to come up with a guess for the probability distribution that generates it. We call that guess P^\hat P. We then evaluate our parameter at P^\hat P to get the estimate. In other words, we define ψ^(Pn)=ψ(P^)\hat\psi(\mathbb P_n) = \psi(\hat P).
Recall that the efficient influence function (and influence functions in general) depend on what point PP we're at in the statistical model M\mathcal M. We don't know PP so we don't know the efficient influence function for PP, which is ϕP\phi_P^\dagger. However, we do know P^\hat P so we can use the techniques from the previous chapter to find its influence function ϕP^\phi_{\hat P}^\dagger, or ϕ^\hat \phi^\dagger for short.
Example: ATE for Observational Study
For example, let's work with the (Y,A,X)(Y,A,X) data structure that we'd find in an observational study and assume that our parameter of interest is the average treatment effect ψ(P)=EP[YA=1]EP[YA=0]\psi(P) = E_P[Y|A=1] - E_P[Y|A=0]. As before, we'll notate μa(X)=EP[YA=a,X]\mu_a(X) = E_P[Y|A=a,X] so we can also write the target parameter as ψ(P)=P[μ1(X)μ0(X)]\psi(P) = P[\mu_1(X) -\mu_0(X)]. For simplicity let's assume the outcome YY is binary.
We need to use the data to come up with an estimate of the joint distribution. We can break that down by factoring P(Y,A,X)=P(YA,X)P(AX)P(X)P(Y,A,X) = P(Y|A,X)P(A|X)P(X) and coming up with estimates for each factor. For example, maybe we use the empirical distribution of XX as our estimate Pn(X)=P^(X)\mathbb P_n(X) = \hat P(X). We might also use a probabilistic classification algorithm (e.g. random forest) trained by regressing AA onto XX to estimate the probability of treatment π^a(X)=P^(A=aX)\hat\pi_a(X) = \hat P(A=a|X) and the same algorithm trained by regressing YY onto (A,X)(A,X) to estimate the probability of an outcome μ^a(X)=P^(Y=1A=a,X)\hat\mu_a(X) = \hat P(Y=1|A=a,X).
Now we just need to apply the definition ψ(P)=EP[YA=1]EP[YA=0]\psi(P) = E_P[Y|A=1] - E_P[Y|A=0] but take the expectations with respect to our estimate P^\hat P instead of the unknown PP. Based on the estimates we defined above, and the definition of our parameter we get
ψ^(Pn)=ψ(P^)=Pn[μ^1(X)μ^0(X)]\hat\psi(\mathbb P_n) = \psi(\hat P) = \mathbb P_n[\hat\mu_1(X) - \hat\mu_0(X)]
Or, in words: take the sample average value of our estimate for the treatment outcome and subtract off the sample average value of our estimate for the control outcome. You might think of this as "imputing" the expected probability of the outcome under each treatment, for each subject in the study, and then averaging and taking the difference.
Asymptotic Analysis
A naive plug-in estimator is nice because it's easy to understand and construct, but since we haven't said anything about what kind of algorithm we use to estimate P^\hat P we can't ensure that the plug-in will behave nicely. As a stupid example, consider that we use some algorithm that completely ignores the data and always returns some fixed distribution P^=P°\hat P = P^\degree. The plug-in estimate based on that distribution will always be the same number ψ(P°)\psi(P^\degree) so this estimator won't have a normal sampling distribution. Nor will it be centered at the true parameter unless we get lucky and ψ(P)=ψ(P°)\psi(P) = \psi(P^\degree).
Our goal, therefore, is to come up with some "rules" that our estimator for P^\hat P has to follow in order to make the plug-in estimator asymptotically linear with the efficient influence function:
[ψ(P^)ψ(P)]=Pnϕ+oP(1n)\left[\psi(\hat P) - \psi(P)\right] = \mathbb P_n\phi^\dagger + o_P\left(\frac{1}{\sqrt n}\right)
Notation: oP()o_P(\cdot)
Often in asymptotic statistics we want to say something like "Y=X+UnY = X + U_n where UnP0U_n \overset{P}{\to} 0". This is kind of annoying to write, so instead you'll more often see Y=X+oP(1)Y = X + o_P(1). You can read the term oP(1)o_P(1) as "some sequence of random variables that converges in probability to 0". The benefit of this is that we don't even have to give this sequence a name (like UnU_n, for example) if we don't care to. Moreover, we've hidden away any limit arrows and are left with an equation that we can manipulate algebraically.
This notation is extended to handle cases where we want to say "Y=X+rnUnY = X + r_nU_n where rnUnP0r_nU_n \overset{P}{\to} 0". The abbreviation for this is Y=X+oP(1/rn)Y = X + o_P(1/r_n). The way I think of this is that rnUn=oP(1)r_nU_n = o_P(1), so we divide by rnr_n to get Un=oP(1)/rnoP(1/rn)U_n = o_P(1) / r_n \equiv o_P(1/r_n). The last equality here defines what we mean when we say some variable UnU_n is oP(1/rn)o_P(1/r_n): namely that UnU_n times rnr_n converges in probability to 0. When this is the case we say that "UnU_n converges at the rate 1/rn1/r_n".
For example, in our definition of the influence function we had that n([ψ^ψ]Pnϕ)P0\sqrt n \left([\hat\psi - \psi] - \mathbb P_n \phi\right) \overset{P}{\to} 0. If we let R=[ψ^ψ]PnϕR = [\hat\psi - \psi] - \mathbb P_n \phi, then this just says nRP0\sqrt n R \overset{P}{\to} 0, which we can write as R=oP(1n)R = o_P(\frac{1}{\sqrt n}). Now, spelling out RR gives us the alternative definition [ψψ]=Pnϕ+oP(1n)\left[\psi - \psi\right] = \mathbb P_n\phi + o_P\left(\frac{1}{\sqrt n}\right).
The intuition for "rates" is that often we're interested in random variables that go to 0 so quickly that even if they get blown up by some increasing sequence, the product still goes to 0. For example, if some variable UnU_n is oP(1/n)o_P(1/\sqrt n), that means that nUn\sqrt n U_n still goes to 0, even though n\sqrt n is an increasing sequence that gets bigger and bigger.
Understanding this about rates also shows that anything that is oP(n1/2)o_P(n^{-1/2}) is also oP(1)o_P(1), but not the other way around, and similar facts of that nature. That's because something that converges to 0 even if it's blown up by n\sqrt n must go to 0 (even faster) if left alone.
You can learn more about this notation in chapter 2 of vdW98 and in a number of online resources (google "little oP" or "little oh p"). A through understanding requires a grasp of convergence in probability (see the recommended resources to learn more).
From here on I'm going to use ψ^\hat\psi and ψ\psi interchangeably with ψ(P^)\psi(\hat P) and ψ(P)\psi(P) to keep things uncluttered when necessary. I'll also omit the dagger (\dagger) as a superscript of the efficient influence function, since from here on out we won't be talking about any influence functions that aren't efficient.
We want to get an idea of the magnitude of the estimation error ψ^ψ\hat\psi - \psi. We actually have a pretty nice way to do this if we construct a path from P^\hat P to PP and then use the pathwise derivative along this path, exploiting the fact that we know how to express the pathwise derivative at 0 as the covariance between the path's direction (the score hh) and the canonical gradient of the parameter ψ\psi at PP (which is the efficient influence function ϕP\phi_P). To wit:
ψ(P)ψ(P^)ΔϵEP^[ϕP^h]    ψ(P^)ψ(P)EP^[ϕP^h](10)=Pϕ^\begin{align*} \frac{\psi(P) - \psi(\hat P)}{\Delta \epsilon} &\approx E_{\hat P}[\phi_{\hat P} h] \\ \implies \psi(\hat P) - \psi(P) &\approx - E_{\hat P}[\phi_{\hat P} h](1-0) \\ &= -P\hat\phi \end{align*}
In this case Δϵ=10\Delta\epsilon = 1 - 0. The equality in the last line is because EP^[ϕ^h]=ϕ^hdP^=ϕ^dP=Pϕ^E_{\hat P}[\hat\phi h] = \int \hat\phi h d\hat P = \int \hat\phi dP = P\hat\phi by the definition of the score hh along a path from P^\hat P to PP and the change-of-variables formula.
This is called the von Mises expansion of our estimate, which is a lot like the first-order Taylor expansion you may have seen in calculus. The idea is to use a derivative at a point and the value at that point to estimate the value of the function at a nearby point. It is an approximation, however. Because we let Δϵ=1\Delta \epsilon = 1 be some finite value, the linear approximation won't be perfectly accurate. We'll call the approximation error RR, so we have that ψ^ψ=RPϕ^\hat\psi-\psi = R - P\hat\phi.
This gets us a little closer to where we want to go. Starting from here, we can add and subtract PϕP\phi, Pnϕ\mathbb P_n\phi, and Pnϕ^\mathbb P_n \hat\phi to the right-hand side and rearrange terms to get
[ψ^ψ]=Pnϕ  Pnϕ^+(PnP)(ϕ^ϕ)+Rundefined must be oP(1/n) for efficiency [\hat\psi - \psi] = \mathbb P_n\phi \ \ \underbrace{ - \mathbb P_n\hat\phi + (\mathbb P_n - P)(\hat\phi - \phi) + R }_{ \text{ must be $o_P(1/\sqrt n)$ for efficiency } }
Now we're good. Comparing this to the characterization of an efficient estimator [ψ^ψ]=Pnϕ+oP(1n)[\hat\psi - \psi] = \mathbb P_n\phi + o_P\left(\frac{1}{\sqrt n}\right) we see that for our estimator to be efficient, the sum of the three terms after the "efficient central limit theorem (CLT) term" Pnϕ\mathbb P_n \phi needs to be oP(1/n)o_P(1/\sqrt n) to make the two characterizations match up. This will be the case if each of the three terms is oP(1/n)o_P(1/\sqrt n), so that's what we need to show. We'll give each of these terms names and get to know them more intimately one-by-one:
Plug-In Bias: Pnϕ^-\mathbb P_n \hat\phi
Empirical Process: (PnP)(ϕ^ϕ)(\mathbb P_n - P)(\hat\phi-\phi)
Second-Order Remainder: R=[ψ^ψ]+Pϕ^R = [\hat\psi -\psi] + P\hat\phi
Plug-In Bias
The term Pnϕ^-\mathbb P_n\hat\phi is known as the plug-in bias term because, in general, it's not possible to show that it goes to zero at all for a generic plug-in estimate P^\hat P. If it doesn't go to zero, that makes our parameter estimate ψ^=ψ(P^)\hat\psi = \psi(\hat P) biased, which is not at all what we want.
Control of plug-in bias?
The three different strategies for building efficient estimators that we're going to learn about differ only in the way that they ensure this term is eliminated. All of these methods eliminate the plug-in bias exactly without needing to make any additional assumptions about the initial plug-in estimate P^\hat P. We'll talk a lot more about that later.
Empirical Process
The term (PnP)(ϕ^ϕ)(\mathbb P_n - P)(\hat\phi-\phi) is known as the empirical process term because there is an entire field called empirical process theory that studies things that look exactly like n(PnP)f\sqrt n (\mathbb P_n - P)f. These creatures are called empirical processes because they are special kind of stochastic processes based on the empirical distribution Pn\mathbb P_n.
We'll see there are two ways we can make sure this term is oP(1/n)o_P(1/\sqrt n): sample splitting and Donsker conditions. We'll discuss what these are shortly but for both of them we will need to assume that ϕ^(Z)ϕ(Z)2P0||\hat\phi(Z) -\phi(Z)||^2 \overset{P}{\to} 0 where the norm is the L2\mathcal L_2 norm f2=f2dP||f||^2 = \int f^2 dP. This is nothing but saying that the true mean-squared error of ϕ^\hat\phi, viewed as a prediction of ϕ\phi, goes to zero. Note ϕ^\hat\phi is itself a random function (because ϕ^\hat\phi comes from P^\hat P, which comes from the random data). In this notation we are not averaging over the variability of ϕ^\hat\phi itself when we take the norm so therefore ϕ^(Z)ϕ(Z)2||\hat\phi(Z) -\phi(Z)||^2 is itself a random variable and the convergence is in probability. We call this condition L2\mathcal L_2-consistency (of the estimated influence function).
L2\mathcal L_2-Consistency: ϕ^(Z)ϕ(Z)2P0||\hat\phi(Z) -\phi(Z)||^2 \overset{P}{\to} 0 and one of:
Use sample splitting
Estimate P^\hat P such that ϕ^\hat\phi is in a Donsker class
This is the first restriction that we need to impose on the plug-in estimate P^\hat P to ensure we get an efficient estimator of ψ\psi. In fact, without this assumption, not only could our estimator be inefficient, it might not even be asymptotically normal.
You may wonder whether this L2\mathcal L_2-consistency condition is reasonable to assume. The truth is that in most cases it's very easy to satisfy! In other words, it's not even an assumption, it's just a required condition that we have to make sure we meet.
For example, when estimating the ATE in an observational study, we know the influence function depends on P^\hat P through the functions π^\hat\pi and μ^a\hat\mu_a. If you do a little bit of "limit algebra" with tools like Slutsky's theorem and the continuous mapping theorem you can quickly show that all that is needed for L2\mathcal L_2-consistency of ϕ^\hat\phi is L2\mathcal L_2-consistency of μ^a\hat\mu_a and π^a\hat\pi_a (and a uniform bound keeping π^a\hat\pi_a away from 0, but a) since we already assume this for πa\pi_a for identification, this is very reasonable and b) this is easy to enforce by truncating the output of the estimate). Why is that good? Well, a large number of commonly used machine learning algorithms have been shown to be nonparametrically L2\mathcal L_2-consistent (e.g. highly-adaptive lasso, nearest neighbors, deep learning, boosting, and random forests). So if we estimate μ^a\hat\mu_a and π^a\hat\pi_a using any of those machine learning algorithms, or any cross-validated ensemble of them, then we know we satisfy the L2\mathcal L_2-consistency condition and we have nothing to worry about from the empirical process term: it will go to zero quickly in large samples as long as we also meet one of the two following criteria.
Control Via Sample Splitting
The simplest way to make sure this term goes away is to use sample splitting (also called cross-fitting or cross-estimation). The idea is to estimate all the parts of PP that go into the influence function using one sample and then to evaluate the estimator using another sample.
For example, when estimating the ATE in an observational study, we know the influence function depends on P^\hat P through the functions π^\hat\pi and μ^a\hat\mu_a. If we were using sample splitting, we'd split our sample (Y,A,X)1,n(Y,A,X)_{1,\dots n} in two, let's say into two datasets (1) (Y,A,X)1,m(Y,A,X)_{1,\dots m} and (2)(Y,A,X)m,n(Y,A,X)_{m,\dots n}. The first of these we will use to estimate π^a\hat\pi_a and μ^a\hat\mu_a, which we then take as fixed. We then evaluate the plug-in estimate using only the second sample: ψ^(2)=1nmi=mnμ^1(Xi)μ^0(Xi)\hat\psi^{(2)} = \frac{1}{n-m} \sum_{i=m}^n \hat\mu_1(X_i) - \hat\mu_0(X_i) (in this case we don't need π^a\hat \pi_a for the plug-in estimate). We can then reverse the roles of the two datasets to get ψ^(1)\hat\psi^{(1)} and obtain our final estimate as the average of ψ^(1)\hat\psi^{(1)} and ψ^(2)\hat\psi^{(2)}.
If this sounds sort of like cross-validation to you, that's exactly right. It's the same idea and the same procedure. The only difference is that we're using the "validation sample" to get an estimate of the parameter ψ\psi instead of specifically using the validation sample to estimate the MSE of some prediction function.
Just as we do in cross-validation, this process can be extended to multiple (KK) splits instead of just two. When we do this we use notation like ϕ^(k)\hat\phi^{(-k)} or μ^a(k)\hat\mu^{(-k)}_a to denote the influence function and its components estimated using the all the data except for the kkth fold and we use Ik\mathcal I_k to denote the set of indices ii that are included in the kkth fold.
A relatively brief proof shows that if we use sample splitting and we assume the L2\mathcal L_2-consistency condition on ϕ^\hat \phi, then the empirical process term is indeed oP(1/n)o_P(1/\sqrt n) as desired. This is remarkable because, as we've already discussed, in most cases it's pretty easy to prove L2\mathcal L_2-consistency for ϕ^\hat \phi. So all we need to do to ensure that the empirical process term goes away is to use sample splitting, something we can implement with a simple for loop in code.
Proof: sample splitting controls the empirical process term
Throughout this proof we'll be considering the random variable U=(Pn(k)P)[ϕ^(Z)ϕ(Z)]U = (\mathbb P_n^{(k)} - P)[\hat\phi(Z) - \phi(Z)] where Pn(k)\mathbb P_n^{(k)} is the empirical measure over the data in fold kk, which we'll denote Z{Ik}Z_{\{I_k\}}. We'll start by computing the mean and variance of this random variable conditioned on having observed a fixed dataset Z{I(k)}Z_{\{\mathcal I_{(-k)}\}} comprising of all the folds except the kkth. By conditioning on Z{I(k)}Z_{\{\mathcal I_{(-k)}\}}, ϕ^\hat\phi becomes the fixed function ϕ^(k)\hat\phi^{(-k)}:
E[UZ{I(k)}]=E[(Pn(k)P)(ϕ^ϕ)Z{I(k)}]=E[Pn(k)(ϕ^ϕ)Z{I(k)}]E[P(ϕ^ϕ)Z{I(k)}]=E[ϕ^(k)ϕ]E[ϕ^(k)ϕ]=0\begin{align*} E\left[ U\big| Z_{\{\mathcal I_{(-k)}\}} \right] &= E\left[(\mathbb P_n^{(k)} -P)(\hat\phi-\phi) \big| Z_{\{\mathcal I_{(-k)}\}}\right] \\ &= E\left[ \mathbb P_n^{(k)}(\hat\phi-\phi) \big| Z_{\{\mathcal I_{(-k)}\}}\right] - E\left[ P(\hat\phi-\phi) \big| Z_{\{\mathcal I_{(-k)}\}}\right] \\ &= E\left[ \hat\phi^{(-k)}-\phi \right] - E\left[ \hat\phi^{(-k)}-\phi \right] \\ &=0 \end{align*}
V[UZ{I(k)}]=V[(Pn(k)P)(ϕ^ϕ)Z{I(k)}]=V[Pn(k)(ϕ^(k)ϕ)P(ϕ^(k)ϕ)undefinedconstant]=V[1nknk(ϕ^(k)(Zi)ϕ(Zi))undefinedIID since fixed fns of IID vars]=1nkV[ϕ^(k)ϕ]1nkϕ^(k)ϕ2\begin{align*} V\left[ U\big| Z_{\{\mathcal I_{(-k)}\}} \right] &= V\left[(\mathbb P_n^{(k)} -P)(\hat\phi-\phi) \big| Z_{\{\mathcal I_{(-k)}\}}\right] \\ &= V\left[ \mathbb P_n^{(k)}(\hat\phi^{(-k)}-\phi) - \underbrace{ P(\hat\phi^{(-k)}-\phi) }_{\text{constant}} \right] \\ &= V\left[ \frac{1}{n_k}\sum^{n_k} \underbrace{ \left( \hat\phi^{(-k)}(Z_i)-\phi(Z_i) \right) }_{\text{IID since fixed fns of IID vars}} \right] \\ &= \frac{1}{n_k} V\left[ \hat\phi^{(-k)}-\phi \right] \\ &\le \frac{1}{n_k} || \hat\phi^{(-k)}-\phi ||^2 \end{align*}
Now we apply Chebyshev's inequality, which generally quantifies the probability that a random variable takes a (de-meaned) value that's larger than its standard deviation: P{UE[U]V[U]a}1a2P\left\{\frac{|U-E[U]|}{\sqrt{V[U]}} \ge a\right\} \le \frac{1}{a^2}. Instead of using UU, though, we use UZ{I(k)}U\big| Z_{\{\mathcal I_{(-k)}\}}, which gives
P{nk(Pn(k)P)(ϕ^ϕ)ϕ^ϕaZ{I(k)}}1/a2P \left\{ \sqrt n_k \frac{| (\mathbb P_n^{(k)}-P)(\hat\phi-\phi) |} { ||\hat\phi-\phi|| } \ge a \bigg| Z_{\{\mathcal I_{(-k)}\}} \right\} \le 1/a^2
Now we can take the expectation on both sides (averaging over Z{I(k)}Z_{\{\mathcal I_{(-k)}\}}) to get
P{nk(Pn(k)P)(ϕ^ϕ)ϕ^ϕa}1/a2P \left\{ \sqrt n_k \frac{| (\mathbb P_n^{(k)}-P)(\hat\phi-\phi) |} { ||\hat\phi-\phi|| } \ge a \right\} \le 1/a^2
Which by the definition of stochastic boundedness shows that
(Pn(k)P)(ϕ^ϕ)=OP(ϕ^ϕ/nk)=OP(ϕ^ϕ/n)=ϕ^ϕnOP(1)=oP(1)nOP(1)=oP(1/n)\begin{align*} (\mathbb P_n^{(k)}-P)(\hat\phi-\phi) &= O_P(||\hat\phi-\phi||/\sqrt n_k) \\ &= O_P(||\hat\phi-\phi||/\sqrt n) \\ &= \frac{||\hat\phi-\phi||}{\sqrt n} O_P(1) \\ &= \frac{o_P(1)}{\sqrt n} O_P(1) \\ &= o_P(1/\sqrt n) \end{align*}
Here we've used the big-O notation for stochastic boundedness. This notation works the same way as the little-o but instead of meaning that the term converges in probability to 0 it means that the term is bounded in probability (i.e. the tails don't get too fat). You can read more about this in chapter 2 of vdV98 or in other sources. In the second-to-last equality we used the L2\mathcal L_2-consistency assumption to say that ϕ^ϕ=oP(1)||\hat\phi - \phi|| = o_P(1). In the last equality we used the fact that oP(1)OP(1)=oP(1)o_P(1)O_P(1) = o_P(1).
This shows that we've controlled the empirical process term that shows up in the estimate from the kkth fold, i.e. ψ^(k)\hat\psi^{(k)}. Since this holds for all folds and the final estimate is the average of these estimates, the empirical process term for the full estimate is the average of the equivalent for the fold estimates. Since all of these are oP(1/n)o_P(1/\sqrt n) it follows that the empirical process term for the full estimate is also oP(1/n)o_P(1/\sqrt n).
The only downside of using sample-splitting to control the empirical process term is that you might be afraid of some finite-sample bias that comes from effectively using less data. This should never be an issue in large samples, but may be of concern in small studies.
Control Via Donsker Conditions
The alternative to using sample splitting is to assume that our estimate of the influence function ϕ^\hat\phi falls into something called a Donsker class. A Donsker class is just a family of functions that are not "too complicated". There are many examples, for instance all functions that are of bounded variation (think: bounded total "elevation gain" + "elevation loss" for a function of 1 variable) form a Donsker class. This example in particular is a very big class of functions, especially for a fixed bound that is large.
If ϕ^\hat\phi is Donsker and L2\mathcal L_2-consistent, it follows immediately from an important result in empirical process theory (lemma 19.24 in vdV98) that (PnP)(ϕ^ϕ)=oP(1/n)(\mathbb P_n - P)(\hat\phi-\phi) = o_P(1/\sqrt n). The proof of that requires more asymptotic statistics that I'd like to assume you already know, but if you're curious (and comfortable with the material in chapter 2 of vdV98), then by all means go ahead and read chapter 19 of vdV98!
So how do we make sure we satisfy the Donsker condition (i.e. make sure ϕ^\hat\phi is in a Donsker class)? Well, thankfully there are theorems that show that Donsker properties are preserved through most algebraic operations (again, see vdV98 ch. 19). The upshot of that is that if we can estimate the components of P^\hat P that we need for ϕ^\hat\phi in a way that guarantees they fall in Donsker classes, then ϕ^\hat\phi will be Donsker. In our observational study ATE example, that means that whatever machine learning algorithms we use to learn the functions μ^a\hat\mu_a and π^a\hat\pi_a need to always return functions that are in a known Donsker class if we want ϕ^\hat\phi to be Donsker.
There are two major problems with this. Firstly, the vast majority of machine learning algorithms are not guaranteed to return learned functions that are in some Donsker class (though some, like highly-adaptive lasso, are). Secondly, for L2\mathcal L_2-consistency to hold at the same time as the Donsker condition, we must assume that the true influence function also falls into some Donsker class (technically the same one as our estimates, but since unions of Donsker classes are Donsker we can just join the two). Donsker classes are very often quite big so this isn't that crazy of an assumption to make. Nonetheless it's something to be aware of.
Overall, Donsker conditions are a less robust way of controlling the empirical process term relative to sample splitting because they impose stricter requirements on the algorithms you can use to estimate the components of P^\hat P and also because they imply a restriction on the model space M\mathcal M. However, in small data settings these may be reasonable costs to pay to ensure asymptotically efficient inference without having to do sample splitting.
Second-Order Remainder
The term R=[ψ^ψ]+Pϕ^R = [\hat\psi -\psi] + P\hat\phi is known as the second-order remainder because, like the empirical process term, it can usually be shown to disappear at a "second-order" rate like op(1/n)o_p(1/\sqrt n) or better under certain conditions. The difference is that these conditions are often specific to the exact parameter and model we're working with, unlike the empirical process term which we've seen can be bounded in a very general way. Therefore the most general condition we can impose on P^\hat P is that the estimate behaves in a way such that R=oP(n1/2)R=o_P(n^{-1/2}).
Depending on the problem, this may require us to be able to estimate regression functions (like μa\mu_a and πa\pi_a in our running example) at fast rates. This is a strong condition that we'll say more about below.
Second-Order Remainder is second order: [ψ^ψ]+Pϕ^=oP(n1/2)[\hat\psi -\psi] + P\hat\phi = o_P(n^{-1/2})
May require fast (e.g. oP(n1/4)o_P(n^{-1/4})) L2\mathcal L_2-convergence of regression estimators
Example: ATE in Observational Study
For example, consider again the ATE in an observational study, or, to simplify things, just the conditional mean ψa=E[YA=a]=E[μa(X)]\psi_a = E[Y|A=a] = E[\mu_a(X)]. We can now exactly compute the second-order remainder using the expression for the influence function we previously derived:
R=ψ^aψa+Pϕ^a=ψ^aE[μa(X)]+E[1a(A)π^a(X)(Yμ^(X))+(μ^(X)ψ^a)]=E[1π^a(X)(πa(X)π^a(X))(μa(X)μ^a(X))]πa(X)π^a(X)μa(X)μ^a(X)\begin{align*} R &= \hat\psi_a - \psi_a + P \hat\phi_a \\ &= \cancel{\hat\psi_a} - E[\mu_a(X)] + E\left[ \frac{1_a(A)}{\hat\pi_a(X)} (Y - \hat\mu(X)) + (\hat\mu(X) - \cancel{\hat\psi_a}) \right] \\ &= E\left[ \frac{1}{\hat\pi_a(X)} \bigg( \pi_a(X) - \hat\pi_a(X) \bigg)\bigg( \mu_a(X) - \hat\mu_a(X) \bigg) \right] \\ &\le \| \pi_a(X) - \hat\pi_a(X) \| \| \mu_a(X) - \hat\mu_a(X) \| \end{align*}
Derivation details
To get to the second-to-last equation we use the fact that E[1a(A)π^a(X)Y]=E[E[1a(A)π^a(X)YX]]=E[1π^a(X)E[1a(A)YX]]=E[1π^a(X)E[1a(A)X]E[YA=a,X]]=E[πa(A)π^a(X)μa(X)]E\left[\frac{1_a(A)}{\hat\pi_a(X)}Y\right] = E\left[E\left[\frac{1_a(A)}{\hat\pi_a(X)}Y \big| X\right]\right] = E\left[\frac{1}{\hat\pi_a(X)}E\left[1_a(A)Y \big| X\right]\right] = E\left[\frac{1}{\hat\pi_a(X)}E\left[1_a(A)|X\right] E\left[Y \big| A=a, X\right]\right] = E\left[\frac{\pi_a(A)}{\hat\pi_a(X)}\mu_a(X)\right]. The last inequality comes via Cauchy-Schwarz and assumes π^a\hat\pi_a is bounded away from 0.
The notation doesn't make it terribly clear, but the expectations and norms here are with respect to (Y,A,X)(Y,A,X) alone- we're not averaging over the randomness in the estimation of μ^a\hat\mu_a or π^a\hat\pi_a (remember these come from P^\hat P, which is something we estimated from the data). One way to think of this is to imagine that we've used a separate sample to estimate μ^a\hat\mu_a and π^a\hat\pi_a and these expectations are conditional on the (independent) data in that sample. As a consequence, RR is indeed a random variable.
Given the final expression we've arrived at, a set of sufficient conditions for R=oP(1/n)R=o_P(1/\sqrt n) is that πa(X)π^a(X)=oP(n1/4)\| \pi_a(X) - \hat\pi_a(X)\| = o_P(n^{-1/4}) and μa(X)μ^a(X)=oP(n1/4)\| \mu_a(X) - \hat\mu_a(X)\| = o_P(n^{-1/4}). This is because oP(n1/4)oP(n1/4)=oP(n1/2)o_P(n^{-1/4}) o_P(n^{-1/4}) = o_P(n^{-1/2}). We could also require that one of these goes faster and the other slower, as long as the product of the rates is n1/2n^{-1/2}.
Attaining oP(n1/4)o_P(n^{-1/4}) Rates and the Highly Adaptive Lasso
Asking for an oP(n1/4)o_P(n^{-1/4}) rate on the estimation of the functions μ^a\hat\mu_a and π^a\hat\pi_a is much stronger than the consistency we required to control the empirical process term (that's just an oP(1)o_P(1) requirement). For the latter, we just need that those functions go to their true values in the limit as we get more and more training data. For the former, we need that limiting behavior to happen fast enough. If the rates are too slow, we can't guarantee that the RR term gets small fast enough for the efficient central limit term to dominate in large samples. As a result, our estimator might not even have a normal sampling distribution, which would crush any chance we have at performing inference.
Without putting unrealistic smoothness conditions on the true functions μa\mu_a and πa\pi_a, it is very, very difficult to attain the required oP(n1/4)o_P(n^{-1/4}) rate, even with a machine learning algorithm. In fact, there are some general theorems that show that attaining such a rate is impossible if we assume general high-dimensional function spaces without such smoothness conditions. This is called the curse of dimensionality.
The curse of dimensionality made efficient inference in observational studies difficult if not impossible in the general nonparametric model. The strategy in the past was to use a cross-validated ensemble of machine learning methods and pray that the true function was smooth or low-dimensional enough that one of these methods actually could attain the oP(n1/4)o_P(n^{-1/4}) rate or better. Cross-validation is known to asymptotically select the best learner, which justifies this strategy as long as prayers are granted.
The highly-adaptive lasso (HAL) is a recent method for supervised learning that effectively solves this problem. Under non-restrictive assumptions, HAL can attain the oP(n1/4)o_P(n^{-1/4}) rate (better, actually!). This is astounding. HAL still requires an assumption to be made on the regression function of interest (thus restricting the model space) but this assumption is much more believable than asking for low-dimensionality or smoothness. Specifically, HAL asks that the true regression function (e.g. μa\mu_a) is right-continuous with left limits (cadlag) with bounded variation norm. This norm can be arbitrarily large, as long as it is finite. But this is barely an assumption at all because it's difficult to imagine a real world scenario with a regression function that isn't cadlag with bounded variation norm. The genius of HAL and the trick to how it "breaks" the curse of dimensionality is that it places no restrictions on the local behavior the regression function (e.g. how wiggly/smooth it is) but instead restricts the global behavior (how much it can go up and down in total).
Double Robustness
In the observational study example we saw that the second-order remained could be controlled if the product of two rates is oP(n1/2)o_P(n^{-1/2}). This happens in several other estimation problems so there is a general name for the phenomenon. When the second-order remainder takes such a form, we say that the estimation problem is doubly robust.
The reason this term is appropriate is because we have two chances to completely cancel out the second order remainder term. For instance, consider the ATE observational study example, but now presume that πa\pi_a is known (i.e. we're in an RCT). Then πa(X)π^a(X)=0\| \pi_a(X) - \hat\pi_a(X) \| = 0 so the whole RR term is exactly 0. The same thing happens if we know μa\mu_a.
Double robustness is a property of the estimation problem, not of any particular estimator. Some estimators that use both π^a\hat\pi_a and μ^a\hat\mu_a (or equivalents for the estimation problem in question) are sometimes called "doubly robust", but it's most accurate to say that these estimators exploit double robustness to attain efficiency and asymptotic normality under certain conditions (e.g. op(n1/4)o_p(n^{-1/4}) rates on the regression estimates). Moreover there are estimation problems that may have triple, quadruple, etc. robustness, or only single-robustness (i.e. only a single function needs to be estimated), etc.
Double robustness was historically more important than it is today because parametric models were used more often to estimate nuisance components. It's rarely if ever the case that a parametric model can capture the truth, but at least if you have two (or more) chances to be right then there's a little more hope. On the other hand, double robustness says nothing about what happens when both models are wrong, even if one or the other is still "close" enough in some sense. Since that's more realistically the case with parametric models, some have argued that the entire premise is somewhat meaningless. Either way, it's rarely justifiable to not to leverage modern, flexible estimators like HAL in current practice, so assuming consistency of nuisance components is no longer a matter of hope. This makes the entire question of double robustness more of a moot point than it has ever been.
Summary
By analyzing the naive plug-in estimator, we managed to come to the following decomposition of estimation error that shows us exactly what conditions must be satisfied in order for our plug-in to be efficient.
[ψ^ψ]=Pnϕ  Pnϕ^+(PnP)(ϕ^ϕ)+Rundefined must be oP(1/n) for efficiency [\hat\psi - \psi] = \mathbb P_n\phi \ \ \underbrace{ - \mathbb P_n\hat\phi + (\mathbb P_n - P)(\hat\phi - \phi) + R }_{ \text{ must be $o_P(1/\sqrt n)$ for efficiency } }
Specifically, we need the following three terms to be oP(n1/2)o_P(n^{-1/2}):
Plug-In Bias: Pnϕ^-\mathbb P_n \hat\phi
Empirical Process: (PnP)(ϕ^ϕ)(\mathbb P_n - P)(\hat\phi-\phi)
Second-Order Remainder: R=[ψ^ψ]+Pϕ^R = [\hat\psi -\psi] + P\hat\phi
And we came up with the following conditions to ensure that these terms do indeed go away:
Control of plug-in bias?
Empirical process is oP(n1/2)o_P(n^{-1/2}) if we have L2\mathcal L_2-Consistency: ϕ^(Z)ϕ(Z)2P0||\hat\phi(Z) -\phi(Z)||^2 \overset{P}{\to} 0 and one of:
Use sample splitting
Estimate ϕ^\hat\phi such that it is in a Donsker class
Requires L2\mathcal L_2-convergence of regression estimators (e.g. μ^a\hat\mu_a)
Second-Order Remainder is second order: [ψ^ψ]+Pϕ^=oP(n1/2)[\hat\psi -\psi] + P\hat\phi = o_P(n^{-1/2})
May require fast (e.g. oP(n1/4)o_P(n^{-1/4})) L2\mathcal L_2-consistency of regression estimators
Effectively, we have found reliable ways to control both the empirical process and second-order remainder terms, as long as a) the components of our plug-in P^\hat P are modeled with powerful machine learning techniques that converge quickly to the truth as data are added and b) we use sample splitting or algorithms that guarantee their estimates are not too flexible.
The only thing that we're missing, therefore, are conditions or methods that control the plug-in bias. There are three ways that have been proposed over the years to handle this and that's exactly what we're going to discuss in the subsequent sections. Once this is handled, we've got an estimator that has the efficient influence function and therefore attains the minimum possible asymptotic variance!
The conditions required for control of the empirical process and 2nd order remainder terms are all totally attainable in practice, meaning that we can build efficient estimators for most estimation problems without making any scientifically meaningful statistical assumptions. Remember, however, that we usually do need scientifically meaningful causal assumptions in order to link our statistical parameter to a causal parameter of interest! These are separate, though, and apply uniformly regardless of how we choose to estimate the statistical parameter.