Search
Duplicate
Try Notion free
4.4 Targeted Maximum Likelihood
The final and most modern approach to the construction of efficient estimators is called targeted maximum likelihood estimation (or TMLE). TMLE avoids the problems with the bias-correction and estimating equation approaches because it goes back to the plug-in philosophy. That is, a TMLE estimator is always a plug-in estimate of the parameter evaluated at some plug-in estimate P^\hat P^*. Formally, ψ^=ψ(P^)\hat\psi = \psi(\hat P^*). We'll discuss more below about why being a plug-in is a nice thing.
But if TMLE is a plug-in estimate, how does it avoid plug-in bias while still eliminating the empirical process and 2nd order remainder terms asymptotically? The trick is that instead of evaluating the plug in at the initial estimate P^\hat P, which would incur bias, TMLE iteratively perturbs this estimate in a special way to obtain another estimate P^\hat P^* that gets rid of the bias.
First we'll describe the algorithm that takes the initial estimate P^\hat P and perturbs it to obtain P^\hat P^*. We'll call this the targeting step. This will show why the plug-in bias of our final plug-in estimator is zero: Pnϕ^=0\mathbb P_n \hat\phi^* = 0 (I'll abbreviate ϕ^=ϕP^\hat\phi^* = \phi_{\hat P^*}). Then we have to argue that by transforming P^\hat P into P^\hat P^* we haven't sacrificed the conditions that help us bound the empirical process and 2nd-order remainder terms. At that point, since we've handled all three terms in the plug-in expansion (leaving only the efficient CLT term), we'll have all we need to show that the TMLE ψ(P^)\psi(\hat P^*) is efficient.
Targeting the initial plug-in estimate P^\hat P produces a new plug-in P^\hat P^* that makes the plug-in estmator more accurate.
Targeting the Plug-In
The only practical difference between a TMLE and a naive plug-in estimator is the targeting step that takes the initial estimator P^\hat P and transforms it into the plug-in P^\hat P^*, which we say is targeted towards the parameter ψ\psi. The way this is done is as follows.
First, we grab the efficient influence function at our initial estimate, which is ϕ^\hat\phi. We then construct a path starting at P^\hat P going in the direction defined by ϕ^\hat \phi (remember the efficient influence function is in the tangent space and is thus also a score, i.e. a valid direction for a path). In terms of densities, we can write this path as p~ϵ=(1+ϵϕ^)p^\tilde p_\epsilon = (1 + \epsilon \hat\phi)\hat p. This path corresponding to the efficient influence function is called the (locally) least-favorable or hardest submodel.
Why is it called the "hardest submodel"?
The term "hardest submodel" is appropriate because, firstly, any path is just a collection of probability distributions that are all in M\mathcal M, though not all of them. Another term for "path" is therefore submodel, though "path" is more specific. The synonymous terms "hardest" or "least favorable" take a little more explaining. These have to do with the fact that estimating ψ\psi is "hardest" within the path that goes in the direction of the canonical gradient, relative to all other possible paths.
"Hardest" is defined in terms of the asymptotic variance of our best possible estimator. In other words, if we were to forget about the larger model M\mathcal M and only consider statistical estimation within the submodel {p~ϵ:ϵ[0,1]}\{\tilde p_\epsilon: \epsilon \in[0,1]\}, what could we say about the minimum attainable asymptotic variance if we were trying to estimate ψ(P^)\psi(\hat P)?
As argued below, this is a parametric model with a single parameter ϵ\epsilon. For the moment, let's consider the problem of estimating the maximizer of the true log-likelihood under P^\hat P. In other words, we're looking for the distribution in the path that is most like P^\hat P. Obviously that will be P^\hat P itself (i.e. ϵ=0\epsilon=0). But if we didn't know that a-priori and we had to estimate ϵ\epsilon using the data, what could we say about the asymptotic variance of our estimate ϵ^\hat\epsilon?
Since this is a parametric model, we can use the Cramer-Rao lower bound, which says that the minimum possible asymptotic variance that a RAL estimator of a univariate parameter ϵ\epsilon can attain is E[d2dϵ2log(p~ϵ)ϵ=0]1E\left[\frac{d^2}{d\epsilon^2} \log(\tilde p_\epsilon)|_{\epsilon = 0}\right]^{-1}. This is the inverse Fisher Information for the parameter ϵ\epsilon at 0 (i.e. the true value). Using the definition of the path you can see that this is the same as E[h2]1E\left[h^2\right]^{-1} (for a path that goes in some direction hh / has score hh).
The reason this is useful is because now we can consider the plug-in ψ(P^ϵ^)\psi(\hat P_{\hat\epsilon}). What is the asymptotic variance of this plug-in? First off, we have to realize that within the path, ψ(P~ϵ)\psi(\tilde P_\epsilon) is a univariate function of ϵ\epsilon alone. The delta method tells us that the asymptotic variance of a function of an estimated parameter is given by ddϵψ(P~ϵ)ϵ=0\frac{d}{d\epsilon} \psi(\tilde P_{\epsilon})|_{\epsilon=0} (i.e. evaluated at the truth) times the asymptotic variance of the estimate for the parameter. Using our fundamental identity relating the pathwise derivative and a covariance between score and gradient, we have that this term is the same as E[hϕ^]E[h\hat\phi] (our path starts at P^\hat P).
Putting it together, our best-case scenario is that the asymptotic variance of ψ(P^ϵ^)\psi(\hat P_{\hat\epsilon}) is E[hϕ^]/E[h2]E[h\hat\phi]/E[h^2], or, geometrically in L20\mathcal L_2^0, h,ϕ^/h2\langle h,\hat\phi \rangle /\|h\|^2. From this it is immediate (e.g. from Cauchy-Schwarz) that h=ϕ^h = \hat\phi gives the maximum value. Since this is the best-case asymptotic variance, and higher is worse in terms of estimation, we have that the path in the direction of ϕ^\hat\phi gives us the worst possible best-case asymptotic variance among all paths.
Intuitively, what makes certain model-parameter combinations in difficult to estimate in parametric models is that there are cases where you can change the distribution just a tiny bit, but the target parameter changes a lot. That's bad because if you try to fit the data and you get some estimated distribution, you still have a lot of uncertainty about what the parameter is: slightly different data would give you a slightly different estimated distribution, but possibly a big difference in the estimated parameter. So that is the "intuitive" meaning behind optimal asymptotic variance for a parameter in a parametric model and explains why "hardness" of the estimation problem is inherently linked to the minimum attainable asymptotic variance.
Since p^\hat p and ϕ^\hat\phi are fixed (i.e. we already estimated them), the space of models that lie along this path (i.e. {p~ϵ:ϵ>0}\{\tilde p_\epsilon: \epsilon > 0\}) is a parametric model space with a single parameter: ϵ\epsilon. As such, we can do boring old maximum likelihood estimation to find a value ϵ^\hat\epsilon that maximizes the log-likelihood of p~ϵ\tilde p_\epsilon along this path:
ϵ^=arg maxϵPnlogp~ϵ\hat\epsilon =\argmax_{\epsilon} \mathbb P_n \log \tilde p_\epsilon
If you prefer, you can also write this out very explicitly as ϵ^=arg maxϵ1ninlog(((1+ϵϕ^(Zi))p^(Zi))\hat\epsilon = \argmax_{\epsilon} \frac{1}{n}\sum_i^n\log\left(((1+\epsilon\hat\phi(Z_i))\hat p(Z_i)\right) so you can see it's clearly a single-dimensional optimization problem in ϵ\epsilon (again recall p^\hat p and ϕ^\hat\phi are fixed functions by now).
The game now is to iterate our initial guess P^\hat P according to the formula
p~ϵ^=(1+ϵ^ϕ^)p^\tilde p_{\hat\epsilon} = (1+ \hat\epsilon \hat\phi)\hat p
at each step replacing the previous plug-in estimate p^\hat p and its influence function ϕ^\hat\phi with the result of the previous iteration (p~ϵ^\tilde p_{\hat\epsilon} and the corresponding ϕP~ϵ^\phi_{\tilde P_{\hat\epsilon}}). At each step we are always moving in the direction of whatever the hardest submodel is at that point and we always increase the empirical log-likelihood relative to the previous plug-in estimate. We do this until the algorithm converges, i.e. ϵ^0\hat\epsilon \approx 0. This must occur eventually unless the empirical likelihood is unbounded over the model space, which wouldn't make any sense. At this point we declare p^=p~ϵ^\hat p^* = \tilde p_{\hat\epsilon}, our final plug-in estimate of the distribution.
Something cool about this is that we've managed to take something that is generally impossible (maximum likelihood in infinite dimensions) and replaced it with a series of maximum likelihood estimation problems, each of which is one-dimensional. This is why this approach is called targeted maximum likelihood: at each step, we're solving a maximum likelihood problem, but we've targeted it in a way that helps estimate our parameter of interest.
What is maximum likelihood estimation?
In case you're not familiar with it, maximum likelihood estimation is a general-purpose technique most commonly used in parametric model spaces. The goal is to find the distribution P^\hat P in the model that maximizes the empirical log-likelihood of the observed data. We're trying to answer the question: "of all the possible probability distributions in the model, which is the most likely to have generated the data that I observe?". Or, formally, P^=arg maxPMp(Zi,n)\hat P = \argmax_{P\in \mathcal M} p(Z_{i, \dots n}). "Likelihood" is just a term that we use to describe the density pp when we're talking about it as a quantity that itself isn't fixed (in the sense that we don't know what the true density is so we have to consider multiple options). If Zi,nZ_{i,\dots n} are assumed independent and identically distributed, the density factors over ZiZ_i and if we take a log (the argmax is the same either way) the product becomes a sum so we get P^=arg maxPM1ninp(Zi)\hat P = \argmax_{P \in \mathcal M} \frac{1}{n}\sum_i^n p(Z_i). We can also use the more condensed notation P^=arg maxPMPnlogp\hat P = \argmax_{P \in \mathcal M} \mathbb P_n \log p.
This strategy is most often only usable for parametric models. If we have a parametric model, the maximum likelihood problem P^=arg maxPMPnlogp\hat P = \argmax_{P \in \mathcal M} \mathbb P_n \log p can be equivalently stated as P^=arg maxθΘPnlogpθ\hat P = \argmax_{\theta \in \mathcal \Theta} \mathbb P_n \log p_\theta. The advantage of this is that we've reduced the problem of looking at every possible distribution in M\mathcal M to a problem of looking at all possible values of the parameter Θ\Theta, which is a finite-dimensional vector space like Rp\mathbb R^p. If the log-likelihood function is differentiable in θ\theta, this is a classical optimization problem, the kind of which is solvable using tools from calculus. In particular, if Pnlogpθ\mathbb P_n \log p_\theta as a function of θ\theta is maximized at some value θ^\hat\theta, then we know that the derivative at θ^\hat\theta must be 0. For simple models (e.g. linear regression) there is often an analytical solution which gives 0=Pnddθlogpθ^0 = \mathbb P_n \frac{d}{d\theta}\log p_{\hat\theta}, but generally you can use algorithms like gradient descent to find the maximum (or minimum if you put in a minus sign). The quantity ddθlogpθ\frac{d}{d\theta}\log p_\theta is called the score with respect to θ\theta and exactly corresponds to a path in the model space where we increment θ\theta by some small amount. You can think of it as the sensitivity of the log-likelihood to small changes in the parameters (which is different at each point p=pθp=p_\theta). When θ\theta has multiple components, the score is a vector of partial derivatives [δδθ1logpθ,δdδθplogpθ][\frac{\delta}{\delta \theta_1}\log p_\theta, \dots \frac{\delta d}{\delta \theta_p}\log p_\theta], each of which may be called the score of θj\theta_j. We call Pnddθlogpθ\mathbb P_n \frac{d}{d\theta}\log p_{\theta} a set of score equations (the free variable here is θ\theta) and we say that the maximum likelihood estimator θ^\hat\theta solves these score equations because at θ=θ^\theta = \hat\theta we know we have to have Pnddθlogpθ\mathbb P_n \frac{d}{d\theta}\log p_{\theta}.
All of this falls apart very quickly if you work in nonparametric model spaces because we can't transform the maximum likelihood problem into a finite-dimensional optimization. How do you search over an infinite dimensional space? Even if you have a sensible way of writing the densities in your model as a function of an infinite dimensional parameter, (e.g. another function, say say θ(ξ)\theta(\xi)), then for each "index" ξ\xi you have a score δδθ(ξ)logpθ\frac{\delta}{\delta \theta(\xi)}\log p_\theta, which means you need to solve an infinite number of score equations.
However, it remains to show that this actually gets us anywhere useful. In particular, we haven't explained why we're always moving in the direction of the hardest submodel. Why not pick some other direction (score) hh? We'd still be increasing the likelihood at each step, after all.
Well, here's the answer. Remember that at each step, ϵ^\hat\epsilon is the maximizer of Pnlog(p~ϵ)\mathbb P_n \log(\tilde p_\epsilon), which is a univariate function of ϵ\epsilon. We know from calculus that the derivative of the maximized function at the maximizing value must be zero: 0=ddϵPnlogp~ϵϵ=ϵ^0 = \frac{d}{d\epsilon} \mathbb P_n \log\tilde p_{\epsilon} \big|_{\epsilon = \hat\epsilon}. Moreover recall that for a path p~ϵ\tilde p_\epsilon in a direction hh, we have that ddϵlogp~ϵϵ=0=h\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = h. After a certain number of iterations, we know that our algorithm converges, so ϵ^=0\hat\epsilon = 0, p~ϵ=0=p^\tilde p_{\epsilon=0} = \hat p^* (and the direction is h=ϕ^h=\hat\phi^*). Putting that all together we get
0=Pnϕ^0 = \mathbb P_n \hat\phi^*
This is precisely the plug-in bias term for the plug-in estimator ψ(P^)\psi(\hat P^*), and we've shown that it is exactly 0. That explains why we need to move in the direction of the gradient ϕ^\hat\phi at each iteration in the targeting step. If we were to move in a different direction we wouldn't end up cancelling the plug-in bias at the very end.
There is also an intuitive reason for choosing to move in the direction of the gradient, which is (as we've called it) the hardest submodel. This model is the hardest to estimate in because small changes in the distribution can create large changes in the target parameter. So by perturbing our estimate in the direction of the hardest submodel, what we're doing is finding the smallest possible change to our initially estimated distribution that creates the biggest possible debiasing effect in the parameter estimate.
Submodels that Span the Gradient
It's easiest to come to terms with TMLE if you think in terms of one-dimensional submodels where the path is always in the direction of the canonical gradient. However, the algorithm actually works the same if instead of one-dimensional submodels, you pick dd-dimensional submodels that still spans the canonical gradient at each step. What I mean by this is that often you can decompose the canonical gradient ϕ^=ϕ^1+ϕ^2\hat\phi = \hat\phi_1 + \hat\phi_2  into separate pieces (we saw an example of this in the previous chapter). Then, instead of creating a one-dimensional submodel in the direction ϕ\phi, we can create a two-dimensional submodel where p~ϵ1,ϵ2=(1+ϵ1ϕ^1+ϵ2ϕ^2)p^\tilde p_{\epsilon_1, \epsilon_2} = (1 + \epsilon_1 \hat\phi_1 + \epsilon_2 \hat\phi_2)\hat p.
We say this submodel spans the gradient because the 1-dimensional path we were using before is actually included in this model if we let ϵ1=ϵ2\epsilon_1 = \epsilon_2. It turns out that's all we need: the targeting step works equally well as long as the canonical gradient can be made up out of a sum of the scores for each parameter ϵj\epsilon_j. In the targeting step we now need to replace optimization over ϵ\epsilon with optimization over (ϵ1,ϵ2)(\epsilon_1, \epsilon_2), but this can in fact be easier than optimization over a single ϵ\epsilon because of the way the distribution factors. If the different directions in the submodel correspond to scores that are orthogonal, and each of these scores corresponds to an independent factor of the distribution, then we can perform the search for each ϵ^j\hat\epsilon_j independently from the others (i.e. temporarily setting ϵj=0\epsilon_{j'}=0 for jjj' \ne j). You should try to rigorously verify for yourself why that is true (and/or draw a picture!).
At any rate, we still end up with 0=Pnϕ^0 = \mathbb P_n \hat\phi^* at the end of the targeting step if we work with these larger submodels. You can again verify that is the case for yourself by following the outline we used above for the one-dimensional case.
Different Paths with the Same Score
Up until now we've only been working with paths (or submodels) that look like p~ϵ=(1+ϵh)p^\tilde p_\epsilon = (1 + \epsilon h)\hat p. However, it's possible to build paths that are defined differently but which still have the same score. This is because the score just needs to determine the initial "direction" of the path- after that it can do whatever it wants. As far as TMLE is concerned (or actually the definition of regularity, etc., for that matter) a path can actually be any function p~ϵ\tilde p_\epsilon of ϵ\epsilon, hh, and p^\hat p such that ddϵlogp~ϵϵ=0=h\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = h (i.e. this path has score hh) and p~ϵ=0=p^\tilde p_{\epsilon=0} = \hat p (the path starts at p^\hat p). Our paths up until this point, e.g. p~ϵ=(1+ϵh)p^\tilde p_\epsilon = (1 + \epsilon h)\hat p are just one possible example. An alternative example is p~ϵ=C(ϵ)eϵhp^\tilde p_\epsilon = C(\epsilon)e^{\epsilon h}\hat p with C(ϵ)C(\epsilon) some normalizing constant to make sure that the result integrates to 1. It's actually really nice that we have so much flexibility in choosing the path because different formulations may be much easier to work with in terms of finding ϵ^\hat\epsilon at each step.
Three different paths that all start at P^\hat P and have score ϕ^\hat\phi.
Therefore, no matter the definition of the path, as long as:
p~0=p^\tilde p_0 = \hat p
ddϵlogp~ϵϵ=0=ϕ^\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phi
(i.e. it starts at our current estimate and has score equal to its canonical gradient), then this path will be valid for the targeting step.
Given this understanding, you should also now realize that there isn't one single hardest submodel. It's only after we decide on a fixed functional form of the submodel in ϵ\epsilon, hh, and p^\hat p that a hardest submodel becomes uniquely defined by setting h=ϕ^h=\hat\phi.
Locally and Universally Least-Favorable
Since we have freedom in choosing whatever form of path is most convenient for us, you might wonder if there is some automated way in which the estimation problem might be able to dictate what path is most convenient. This leads to the idea of a universally least-favorable submodel (or universally hardest path, etc.).
Up until now, the only constraints we've put on our paths have been local constraints in the sense that we only care about how the path behaves at its very beginning: it has to start at P^\hat P and it has to start out in the direction of ϕ^\hat \phi. After that we don't actually care what the path does. That means that if we want to pick among these many possible paths, we have to decide on what we want the path to look like in a universal sense (i.e. at all points along the path, not just at the beginning).
Well, why not just extend our local score condition to the rest of the path? We already have that at ϵ=0\epsilon=0 we want ddϵlogp~ϵϵ=0=ϕ^\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phi, which we can write in a different way as ddϵlogp~ϵϵ=0=ϕP~ϵ=0\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \phi_{\tilde P_{\epsilon=0}}. So why impose that condition universally instead of just locally? That would give us ddϵlogp~ϵ=ϕP~ϵ\frac{d}{d\epsilon} \log \tilde p_\epsilon = \phi_{\tilde P_{\epsilon}}. Any path starting at P^\hat P that satisfies this condition is what we call the universally least-favorable submodel.
These two paths are both locally least-favorable, but only the red (rightmost) path is universally least-favorable.
This path is universally least-favorable because it goes in the hardest direction at each point P~ϵ\tilde P_\epsilon. To be clear, though, what that direction is can and most often does change as we traverse the path. So it's not that the path is a "straight line" that keeps following the original gradient ϕP^\phi_{\hat P} no matter what. Instead, this is a path that reevaluates what the hardest direction is after each infinitesimal step and then course corrects to follow it at all times.
For a given estimation problem it might take a little math to be able to express what the universally least-favorable path actually is. But once we have it our lives do get easier in one sense: if we do an iteration of the targeting step along the universally least-favorable path we get to the final targeted estimate after just one step! Why? Because the universally least-favorable path that starts at the update P~ϵ^\tilde P_{\hat\epsilon} coincides with the continuation of the universally least-favorable path from P^\hat P. So if P~ϵ^\tilde P_{\hat\epsilon} was the maximizer of the log-likelihood along the latter path, it's still the maximizer along the path that starts from there because the two paths coincide. Thus subsequent updates will not change the distribution and this proves that the targeting step converges after just one iteration. For this reason, a TMLE that uses a universally least-favorable submodel is sometimes called a one-step TMLE.
Example: ATE in Observational Study with Binary Outcome
If this all sounds terribly abstract to you, don't worry, you're not alone. Let's work through our familiar example and see where it gets us.
As in our demonstration of the bias-corrected plug-in estimator, presume we've gotten an initial estimate P^\hat P by modeling μ^a(X)P(Y=1A=a,X)\hat\mu_a(X) \approx P(Y=1|A=a,X), π^a(X)P(A=aX)\hat\pi_a(X) \approx P(A=a|X), and the marginal density of XX with the observed empirical distribution Pn(X)\mathbb P_n(X). That's the full distribution.
As before, we know the efficient influence function for ψ\psi at P^\hat P is given by
ϕ^=ϕ^1ϕ^0ϕ^a(Y,A,X)=1a(A)π^a(X)(Yμ^a(X))(μ^a(X)Pnμ^a(X))\hat\phi = \hat\phi_1 - \hat\phi_0 \\ \hat\phi_a(Y,A,X) = \frac{1_a(A)}{\hat\pi_a(X)} (Y - \hat\mu_a(X)) - (\hat\mu_a(X) - \mathbb P_n\hat\mu_a(X))
Defining a Valid Submodel
We need to construct a submodel p~ϵ\tilde p_\epsilon through pp that spans the score ϕ^\hat\phi. How do we do this? Well, we can start brute-force computing stuff, or we can recall that the efficient influence function equals a sum of orthogonal scores in the tangent spaces TX\mathcal T_X, TAX0\mathcal T_{A|X}^0, and TYA,X0\mathcal T_{Y|A,X}^0 because the distribution P^M\hat P \in \mathcal M is a product of independent factors. Specifically, we have:
ϕ^TYX,A0=11(A)π^1(X)(Yμ^1(X))10(A)π^0(X)(Yμ^0(X))=[11(A)π^1(X)10(A)π^0(X)](Yμ^A(X))ϕ^TAX0=0ϕ^TX=[μ^1(X)Pnμ^1(X)][μ^0(X)Pnμ^0(X)]\begin{align*} \hat\phi_{\langle \mathcal T^0_{Y|X,A} \rangle} &= \frac{1_1(A)}{\hat\pi_1(X)}(Y - \hat\mu_1(X)) - \frac{1_0(A)}{\hat\pi_0(X)}(Y - \hat\mu_0(X)) \\ &= \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right] \left( Y - \hat\mu_A(X) \right) \\ \hat\phi_{\langle \mathcal T^0_{A|X} \rangle} &= 0 \\ \hat\phi_{\langle \mathcal T_{X} \rangle} &= \left[ \hat\mu_1(X) - \mathbb P_n\hat\mu_1(X) \right] - \left[ \hat\mu_0(X) - \mathbb P_n\hat\mu_0(X) \right] \end{align*}
Now in each iteration we can update the densities p^(ya,x)\hat p(y|a,x), p^(ax)\hat p(a|x), and p^(x)\hat p(x) separately because each of these scores can only affect its respective density. So although we're really trying to optimize in a three-dimensional submodel with parameters (ϵYA,X,ϵAX,ϵX)(\epsilon_{Y|A,X}, \epsilon_{A|X}, \epsilon_X), the solution in this particular problem is the same as if we did the optimization separately for each of ϵYA,X\epsilon_{Y|A,X}, ϵAX\epsilon_{A|X}, and ϵX\epsilon_{X}. This is what I meant when I said earlier that operating in a higher-dimensional submodel could actually make things easier.
Submodel for XX
Let's start with the marginal distribution of XX, which according to our initial estimate is Pn(X)\mathbb P_n (X). The empirical log-likelihood of the data points XX is actually already maximized at the empirical distribution (which is why it's sometimes called a nonparametric maximum likelihood estimate). That means that no matter where we are in M\mathcal M (or even which direction we build a path along) the optimal perturbation will always be ϵ^X=0\hat\epsilon_X = 0 because we're already at a maximum. Thus the targeting step will never update this part of the distribution so we already have our plug-in for the marginal distribution of the covariates: P^(X)=Pn(X)\hat P^*(X) = \mathbb P_n(X).
Submodel for AXA|X
For the distribution of AXA|X, we see that no matter how we define the path at each point P^\hat P, the fluctuation will always be 00 because ϕ^TAX0=0\hat\phi_{\langle \mathcal T^0_{A|X} \rangle} = 0. So that distribution will also stay the same from iteration to iteration. Thus P^(AX)=π^A(X)\hat P^*(A|X) = \hat\pi_A(X).
Submodel for YA,XY|A,X
Now, finally, we come to the distribution of YA,XY|A,X. Here we need to find ϵ^YA,X\hat\epsilon_{Y|A,X} (I'll just refer to ϵ^\hat\epsilon from here on out since the other components don't matter) to maximize Pnlog(p~ϵ(YA,X))\mathbb P_n \log(\tilde p_\epsilon(Y|A,X)).
Remember that we can construct the path p~ϵ\tilde p_\epsilon any way we like as long as it starts at the initial estimate p^(YA,X)\hat p(Y|A,X) and the derivative of the log-likelihood of ϵ\epsilon at ϵ=0\epsilon=0 is the efficient influence function ϕ^TYX,A0\hat\phi_{\langle \mathcal T^0_{Y|X,A} \rangle}. Therefore I propose using this path:
logit(p~ϵ)=logit(p^)+ϵ[11(A)π^1(X)10(A)π^0(X)]\text{logit}(\tilde p_\epsilon) = \text{logit}(\hat p) + \epsilon \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right]
Obviously if ϵ=0\epsilon=0 we have p~0=p^\tilde p_0 = \hat p so this path starts at p^\hat p. It remains to prove that it has the right score, i.e. ddϵlogp~ϵϵ=0=ϕ^aTYX,A0\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phi_{a\langle \mathcal T^0_{Y|X,A} \rangle}. I show this in the toggle below if you're curious- it just takes some algebra and derivatives.
Proof that ddϵlogp~ϵϵ=0=ϕ^TYX,A0\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phi_{\langle \mathcal T^0_{Y|X,A} \rangle}
Abbreviate U=[11(A)π^1(X)10(A)π^0(X)]U = \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right] so that our definition of the path is logit(p~ϵ)=logit(p^)+ϵU\text{logit}(\tilde p_\epsilon) = \text{logit}(\hat p) + \epsilon U. Taking a derivative w.r.t. ϵ\epsilon on both sides and applying the chain rule gives (1p~ϵ(1p~ϵ))ddϵp~ϵ=U\left(\frac{1}{\tilde p_\epsilon(1-\tilde p_\epsilon)} \right) \frac{d}{d\epsilon} \tilde p_\epsilon = U, or, equivalently, ddϵp~ϵ=Up~ϵ(1p~ϵ)\frac{d}{d\epsilon} \tilde p_\epsilon = U\tilde p_\epsilon(1-\tilde p_\epsilon). Again by the chain rule we know ddϵlogp~ϵϵ=0=(1p~ϵddϵp~ϵ)ϵ=0=1p^ddϵp~ϵϵ=0\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \left( \frac{1}{\tilde p_\epsilon}\frac{d}{d\epsilon} \tilde p_\epsilon \right) \big|_{\epsilon=0} = \frac{1}{\hat p}\frac{d}{d\epsilon} \tilde p_\epsilon\big|_{\epsilon=0}. Combining these two facts gives ddϵlogp~ϵϵ=0=U(1p^)\frac{d}{d\epsilon} \log \tilde p_\epsilon\big|_{\epsilon=0} = U(1-\hat p).
Our likelihood is p^(YA,X)=μ^A(X)\hat p(Y|A,X) = \hat\mu_A(X) if Y=1Y=1 and p^(YA,X)=1μ^A(X)\hat p(Y|A,X) = 1-\hat\mu_A(X) if Y=0Y=0, so a little algebra shows that 1p^1-\hat p can be equivalently written as Yμ^A(X)Y - \hat\mu_A(X). Combining this with the above gives the desired result.
Targeting YA,XY|A,X
Ok, we've got our path, so how do we find ϵ^=arg maxϵPnlogp~ϵ\hat\epsilon =\argmax_{\epsilon} \mathbb P_n \log \tilde p_\epsilon? This is where things clean up nicely. The way we constructed our submodel, ϵ^\hat\epsilon is just the coefficient we get if we regress the vector of YiY_i onto the "clever covariate" vector of Ui=[11(Ai)π^1(Xi)10(Ai)π^0(Xi)]U_i = \left[ \frac{1_1(A_i)}{\hat\pi_1(X_i)} - \frac{1_0(A_i)}{\hat\pi_0(X_i)} \right]with a fixed offset bi=logit(μ^Ai(Xi))b_i = \text{logit}(\hat\mu_{A_i}(X_i)) using logistic regression!
Proof that this is logistic regression with a fixed offset
We can write the conditional likelihood for a binary variable YY where P(Y=1A,X)=μ~A(X)P(Y=1|A,X) = \tilde\mu_A(X) like p~ϵ=μ~A(X)Y(1μ~A(X))(1Y)\tilde p_\epsilon = \tilde\mu_A(X)^Y(1-\tilde\mu_A(X))^{(1-Y)}. This shows that the log-likelihood is
Pnlogp~ϵ=1nin[log(μ~A(X)undefinedp)Y+log(1μ~A(X)undefined(1p))(1Y)]\mathbb P_n\log\tilde p_\epsilon = \frac{1}{n}\sum_i^n \bigg[ \log( \underbrace{\tilde\mu_A(X)}_p ) Y + \log( \underbrace{ 1-\tilde\mu_A(X) }_{(1-p)} ) (1-Y) \bigg]
Moreover we have from our construction of the path that logit(p~ϵ(YA,X))=logit(p^(YA,X))+ϵU\text{logit}(\tilde p_\epsilon(Y|A,X)) = \text{logit}(\hat p(Y|A,X)) + \epsilon U. Plugging in Y=1Y=1 and the definition of μA(X)\mu_A(X) (tilde or hat) immediately gives that
logit(μ~A(X)undefinedp)=0undefinedβ0+1×logit(μ^A(X))undefinedβ1X1+ϵ×Uundefinedβ2X2\text{logit}(\underbrace{\tilde \mu_A(X)}_{p}) = \underbrace{0}_{\beta_0} + \underbrace{ 1\times \text{logit}(\hat \mu_A(X)) }_{\beta_1X_1} + \underbrace{ \epsilon \times U }_{\beta_2X_2}
Compare this to the definition of logistic regression (e.g. eqs. 12.7 and 12.4 here). I've included underbraced quantities to the above so you can directly map the notation.
The only difference with the usual logistic regression is that in our case, we have a fixed intercept β0=0\beta_0=0 and a fixed coefficient β1=1\beta_1=1. This is known as an "offset" in the literature about logistic regression and options to use offsets exist in practically every software package that implements logistic regression. The coefficient β2=ϵ\beta_2 = \epsilon is what we fit using the "data" (Y,U)1,n(Y, U)_{1,\dots n}, which is precisely the solution to ϵ^=arg maxϵPnlogp~ϵ\hat\epsilon =\argmax_{\epsilon} \mathbb P_n \log \tilde p_\epsilon, as desired.
After fitting our logistic regression and extracting the coefficient ϵ^\hat\epsilon, the final thing we have left to do is to update the distribution. For this we just have to compute p~ϵ^\tilde p_{\hat\epsilon}, which in terms of μ~A(X)\tilde \mu_A(X) means
μ~A(X)=logit1(logit(μ^A(X))+ϵ^[11(A)π^1(X)10(A)π^0(X)])\tilde\mu_A(X) = \text{logit}^{-1}\left( \text{logit}(\hat \mu_A(X)) + \hat\epsilon \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right] \right)
This is now a fixed function that we can evaluate at any value of (A,X)(A,X).
Iterating
We can now repeat what we did above, at each step plugging in the result μ~A(X)\tilde\mu_A(X) from the previous iteration into the role played by the current estimate μ^A(X)\hat\mu_A(X). However, the way we've worked out the submodel in this case implies that ϵ^=0\hat\epsilon =0 after the first iteration. Intuitively, that's because we already regressed YY onto UU in the first iteration, so subsequent regressions don't change anything. You can formally prove this to yourself if you like, but it's not actually that important. This suggests that the path we chose is universally least-favorable.
Since we've converged already, our final estimate μ^A(X)\hat\mu^*_A(X) is exactly the function shown in the display above. We can now execute the plug-in ψ(P^)=Pn[μ^1(X)μ^0(X)]\psi(\hat P^*) = \mathbb P_n[\hat\mu_1^*(X) - \hat\mu_0^*(X)] and that's our final estimate.
Efficiency
We've now explained the TMLE algorithm and how it guarantees that the plug-in bias for the targeted plug-in P^\hat P^* is 0. Of course, since TMLE is a plug-in, we can analyze its efficiency using the exact same expansion as we did for the naive plug-in:
ψ(P^)ψ(P)=Pnϕ+Pnϕ^+(PnP)(ϕ^ϕ)+R\begin{align*} \psi(\hat P^*) - \psi(P) &= \mathbb P_n\phi + \cancel{ \mathbb P_n\hat\phi^* } + (\mathbb P_n - P)(\hat\phi^* - \phi) + R \\ \end{align*}
It remains to show that the empirical process term and second-order remainder are both still oP(n1/2)o_P(n^{-1/2}) even though we've replaced our initial plug-in P^\hat P (for which we know these terms are oP(n1/2)o_P(n^{-1/2}) as long as we do things sensibly) with the targeted plug-in P^\hat P^*. In other words, we need to make sure that by eliminating the plug-in bias via targeting we haven't messed up the good things we already had going for us.
Thankfully, it turns out that we're still in business. The heuristic argument for why is that at each iteration of the targeting step, we're always increasing the empirical log-likelihood Pnlogp^\mathbb P_n \log \hat p of our plug-in estimate. Since each update is done with a parametric model, we're also guaranteed that this contributes only an error of order oP(n1/2)o_P(n^{-1/2}) relative to increasing the true log-likelihood Plogp^P \log \hat p. Thus we can claim that the true log-likelihood of our estimated distribution P^\hat P is always increasing (to a first-order approximation). That, in turn, implies that the KL divergence between P^\hat P^* and PP decreases at least at the same rate (in nn) that the original plug-in P^\hat P attains. This implies that p^p\sqrt{\hat p^*}-\sqrt p also attains that rate (in L2\mathcal L_2 norm). Finally, using something called the functional delta method, we can infer that nuisance parameters (which are functionals of the density) also maintain their rates of convergence. You'll recall from the section on the naive plug-in that these are the quantities that we need to converge in order to eliminate the empirical process and 2nd order remainder terms. As a result of all of this, we can say that any conditions required to satisfy the n1/2n^{-1/2} rate for these terms indeed are maintained through the targeting step, and thus TMLE produces an efficient estimator.
Discussion
Deriving and implementing a TMLE for a given estimation problem generally takes more work than doing a bias-corrected or estimating equations estimator. Since all three give asymptotically equivalent results, why should we bother?
Well, as we've already noted, bias-correction and estimating equations each have their own problems. Sometimes they "blow up" or produce estimates that are outside of the natural range of the parameter. Sometimes you can't even construct the estimating equations estimator and sometimes the bias-corrected estimator isn't able to take advantage of the double robustness of the estimation problem.
TMLE, on the other hand, has none of these problems. What makes TMLE so robust is that it goes back to the original plug-in philosophy: all we're doing is estimating a distribution for the data and then asking what the value of the target parameter is for that distribution. There's no fudging based on a bias correction or trying to brute-force solve an estimating equation (though the final TMLE plug-in P^\hat P^* does satisfy the estimating equation). Since it's a plug-in, the TMLE can't help but respect any bounds on the parameter. That also typically makes it harder to blow it up. In general, you can expect estimators built with TMLE to behave more reasonably and consistently in finite samples.
"Blowing up" an estimator
We saw that the bias-corrected and estimating equations approaches give the same estimator for the ATE problem, which we called AIPW. The form of this estimator is a sample average over terms where one of the factors is 1/π^a(X)1/\hat\pi_a(X). If the estimated propensity score π^a(X)\hat\pi_a(X) is close to 0 or 1 for some subjects in the study, that factor ends up being massive and it completely dominates the sample average. Of course, in an infinite sample, that couldn't happen because some other subject would have an equally large weight in the other direction or the large weight would be balanced with many subjects with smaller weights. But we don't have infinite data. We have finite data. That makes the AIPW estimator unstable in finite samples where there are strong predictors of treatment assignment. The only way to address this is to use ad-hoc methods like weight stabilization or trimming.
On the other hand, the TMLE estimator doesn't incorporate terms of the form 1/π^a(X)1/\hat\pi_a(X) directly into a sample average. Instead, it uses them as a covariate in a linear adjustment of the original regression model μ^A(X)\hat\mu_A(X). That makes the procedure more stable in the face of small or large propensity scores. This comes automatically. There are also additional variations on TMLE (e.g. collaborative TMLE) that are theoretically justified and which do even more to improve the finite-sample robustness of the resulting estimator.
Besides the practical benefits, TMLE has a lot going for it in theory. As I touched on earlier, TMLE is the "natural" extension of maximum likelihood estimation to cases where it is impossible to carry out the explicit optimization problem in infinite dimensions. TMLE beautifully reduces this problem to a series of lower-dimensional maximum likelihood problems by correctly identifying the right direction to move in within the statistical model.
Moreover TMLE leaves a lot of room for well-motivated variations. Different choices of submodels or additional scores (remember the submodel just has to span the gradient, but it can include additional directions if desired) can give TMLE more beneficial properties. There is also collaborative TMLE, in which components of the distribution that don't get updated by the targeting step (e.g. π^a\hat\pi_a in our ATE example) are fit in such a way that they maximize the benefit of targeting instead of just to minimize a generic loss function. Finally, TMLE admits natural extensions to attain higher-order efficiency (i.e. cancelling second-order bias terms), which is challenging in the bias-correcting and estimating equations frameworks.
In short, TMLE has all of the advantages of historical approaches in the sense that it can leverage flexible machine learning models and effectively minimize or any eliminate statistical assumptions. However, it eliminates most of the drawbacks of those methods and even leaves room for further innovation.