4.4 Targeted Maximum Likelihood

4.4 Targeted Maximum Likelihood

Table of Contents

Targeting the Plug-In

Submodels that Span the Gradient

Different Paths with the Same Score

Example: ATE in Observational Study with Binary Outcome

Defining a Valid Submodel

The final and most modern approach to the construction of efficient estimators is called targeted maximum likelihood estimation (or TMLE). TMLE avoids the problems with the bias-correction and estimating equation approaches because it goes back to the plug-in philosophy. That is, a TMLE estimator is always a plug-in estimate of the parameter evaluated at some plug-in estimate P^∗\hat P^*P^∗﻿. Formally, ψ^=ψ(P^∗)\hat\psi = \psi(\hat P^*)ψ^​=ψ(P^∗)﻿. We'll discuss more below about why being a plug-in is a nice thing.

But if TMLE is a plug-in estimate, how does it avoid plug-in bias while still eliminating the empirical process and 2nd order remainder terms asymptotically? The trick is that instead of evaluating the plug in at the initial estimate P^\hat PP^﻿, which would incur bias, TMLE iteratively perturbs this estimate in a special way to obtain another estimate P^∗\hat P^*P^∗﻿ that gets rid of the bias. 

First we'll describe the algorithm that takes the initial estimate P^\hat PP^﻿ and perturbs it to obtain P^∗\hat P^*P^∗﻿. We'll call this the targeting step. This will show why the plug-in bias of our final plug-in estimator is zero: Pnϕ^∗=0\mathbb P_n \hat\phi^* = 0Pn​ϕ^​∗=0﻿ (I'll abbreviate ϕ^∗=ϕP^∗\hat\phi^* = \phi_{\hat P^*}ϕ^​∗=ϕP^∗​﻿). Then we have to argue that by transforming P^\hat PP^﻿ into P^∗\hat P^*P^∗﻿ we haven't sacrificed the conditions that help us bound the empirical process and 2nd-order remainder terms. At that point, since we've handled all three terms in the plug-in expansion (leaving only the efficient CLT term), we'll have all we need to show that the TMLE ψ(P^∗)\psi(\hat P^*)ψ(P^∗)﻿ is efficient.

Targeting the initial plug-in estimate P^\hat PP^﻿ produces a new plug-in P^∗\hat P^*P^∗﻿ that makes the plug-in estmator more accurate.

Targeting the Plug-In

The only practical difference between a TMLE and a naive plug-in estimator is the targeting step that takes the initial estimator P^\hat PP^﻿ and transforms it into the plug-in P^∗\hat P^*P^∗﻿, which we say is targeted towards the parameter ψ\psiψ﻿. The way this is done is as follows.

First, we grab the efficient influence function at our initial estimate, which is ϕ^\hat\phiϕ^​﻿. We then construct a path starting at P^\hat PP^﻿ going in the direction defined by ϕ^\hat \phiϕ^​﻿ (remember the efficient influence function is in the tangent space and is thus also a score, i.e. a valid direction for a path). In terms of densities, we can write this path as p~ϵ=(1+ϵϕ^)p^\tilde p_\epsilon = (1 + \epsilon \hat\phi)\hat pp~​ϵ​=(1+ϵϕ^​)p^​﻿. This path corresponding to the efficient influence function is called the (locally) least-favorable or hardest submodel. 

Why is it called the "hardest submodel"?

The term "hardest submodel" is appropriate because, firstly, any path is just a collection of probability distributions that are all in M\mathcal MM﻿, though not all of them. Another term for "path" is therefore submodel, though "path" is more specific. The synonymous terms "hardest" or "least favorable" take a little more explaining. These have to do with the fact that estimating ψ\psiψ﻿ is "hardest" within the path that goes in the direction of the canonical gradient, relative to all other possible paths. 

"Hardest" is defined in terms of the asymptotic variance of our best possible estimator. In other words, if we were to forget about the larger model M\mathcal MM﻿ and only consider statistical estimation within the submodel {p~ϵ:ϵ∈[0,1]}\{\tilde p_\epsilon: \epsilon \in[0,1]\}{p~​ϵ​:ϵ∈[0,1]}﻿, what could we say about the minimum attainable asymptotic variance if we were trying to estimate ψ(P^)\psi(\hat P)ψ(P^)﻿? 

As argued below, this is a parametric model with a single parameter ϵ\epsilonϵ﻿. For the moment, let's consider the problem of estimating the maximizer of the true log-likelihood under P^\hat PP^﻿. In other words, we're looking for the distribution in the path that is most like P^\hat PP^﻿. Obviously that will be P^\hat PP^﻿ itself (i.e. ϵ=0\epsilon=0ϵ=0﻿). But if we didn't know that a-priori and we had to estimate ϵ\epsilonϵ﻿ using the data, what could we say about the asymptotic variance of our estimate ϵ^\hat\epsilonϵ^﻿?

Since this is a parametric model, we can use the Cramer-Rao lower bound, which says that the minimum possible asymptotic variance that a RAL estimator of a univariate parameter ϵ\epsilonϵ﻿ can attain is E[d2dϵ2log⁡(p~ϵ)∣ϵ=0]−1E\left[\frac{d^2}{d\epsilon^2} \log(\tilde p_\epsilon)|_{\epsilon = 0}\right]^{-1}E[dϵ2d2​log(p~​ϵ​)∣ϵ=0​]−1﻿. This is the inverse Fisher Information for the parameter ϵ\epsilonϵ﻿ at 0 (i.e. the true value). Using the definition of the path you can see that this is the same as E[h2]−1E\left[h^2\right]^{-1}E[h2]−1﻿ (for a path that goes in some direction hhh﻿ / has score hhh﻿). 

The reason this is useful is because now we can consider the plug-in ψ(P^ϵ^)\psi(\hat P_{\hat\epsilon})ψ(P^ϵ^​)﻿. What is the asymptotic variance of this plug-in? First off, we have to realize that within the path, ψ(P~ϵ)\psi(\tilde P_\epsilon)ψ(P~ϵ​)﻿ is a univariate function of ϵ\epsilonϵ﻿ alone. The delta method tells us that the asymptotic variance of a function of an estimated parameter is given by ddϵψ(P~ϵ)∣ϵ=0\frac{d}{d\epsilon} \psi(\tilde P_{\epsilon})|_{\epsilon=0}dϵd​ψ(P~ϵ​)∣ϵ=0​﻿ (i.e. evaluated at the truth) times the asymptotic variance of the estimate for the parameter. Using our fundamental identity relating the pathwise derivative and a covariance between score and gradient, we have that this term is the same as E[hϕ^]E[h\hat\phi]E[hϕ^​]﻿ (our path starts at P^\hat PP^﻿).

Putting it together, our best-case scenario is that the asymptotic variance of ψ(P^ϵ^)\psi(\hat P_{\hat\epsilon})ψ(P^ϵ^​)﻿ is E[hϕ^]/E[h2]E[h\hat\phi]/E[h^2]E[hϕ^​]/E[h2]﻿, or, geometrically in L20\mathcal L_2^0L20​﻿, ⟨h,ϕ^⟩/∥h∥2\langle h,\hat\phi \rangle /\|h\|^2⟨h,ϕ^​⟩/∥h∥2﻿. From this it is immediate (e.g. from Cauchy-Schwarz) that h=ϕ^h = \hat\phih=ϕ^​﻿ gives the maximum value. Since this is the best-case asymptotic variance, and higher is worse in terms of estimation, we have that the path in the direction of ϕ^\hat\phiϕ^​﻿ gives us the worst possible best-case asymptotic variance among all paths. 

Intuitively, what makes certain model-parameter combinations in difficult to estimate in parametric models is that there are cases where you can change the distribution just a tiny bit, but the target parameter changes a lot. That's bad because if you try to fit the data and you get some estimated distribution, you still have a lot of uncertainty about what the parameter is: slightly different data would give you a  slightly different estimated distribution, but possibly a big difference in the estimated parameter. So that is the "intuitive" meaning behind optimal asymptotic variance for a parameter in a parametric model and explains why "hardness" of the estimation problem is inherently linked to the minimum attainable asymptotic variance.

Since p^\hat pp^​﻿ and ϕ^\hat\phiϕ^​﻿ are fixed (i.e. we already estimated them), the space of models that lie along this path (i.e. {p~ϵ:ϵ>0}\{\tilde p_\epsilon: \epsilon > 0\}{p~​ϵ​:ϵ>0}﻿) is a parametric model space with a single parameter: ϵ\epsilonϵ﻿. As such, we can do boring old maximum likelihood estimation to find a value ϵ^\hat\epsilonϵ^﻿ that maximizes the log-likelihood of p~ϵ\tilde p_\epsilonp~​ϵ​﻿ along this path:

\hat\epsilon =\argmax_{\epsilon} \mathbb P_n \log \tilde p_\epsilon

If you prefer, you can also write this out very explicitly as ϵ^=arg max⁡ϵ1n∑inlog⁡(((1+ϵϕ^(Zi))p^(Zi))\hat\epsilon = \argmax_{\epsilon} \frac{1}{n}\sum_i^n\log\left(((1+\epsilon\hat\phi(Z_i))\hat p(Z_i)\right)ϵ^=argmaxϵ​n1​∑in​log(((1+ϵϕ^​(Zi​))p^​(Zi​))﻿ so you can see it's clearly a single-dimensional optimization problem in ϵ\epsilonϵ﻿ (again recall p^\hat pp^​﻿ and ϕ^\hat\phiϕ^​﻿ are fixed functions by now). 

The game now is to iterate our initial guess P^\hat PP^﻿ according to the formula

\tilde p_{\hat\epsilon} = (1+ \hat\epsilon \hat\phi)\hat p

at each step replacing the previous plug-in estimate p^\hat pp^​﻿ and its influence function ϕ^\hat\phiϕ^​﻿ with the result of the previous iteration (p~ϵ^\tilde p_{\hat\epsilon}p~​ϵ^​﻿ and the corresponding ϕP~ϵ^\phi_{\tilde P_{\hat\epsilon}}ϕP~ϵ^​​﻿). At each step we are always moving in the direction of whatever the hardest submodel is at that point and we always increase the empirical log-likelihood relative to the previous plug-in estimate. We do this until the algorithm converges, i.e. ϵ^≈0\hat\epsilon \approx 0ϵ^≈0﻿. This must occur eventually unless the empirical likelihood is unbounded over the model space, which wouldn't make any sense. At this point we declare p^∗=p~ϵ^\hat p^* = \tilde p_{\hat\epsilon}p^​∗=p~​ϵ^​﻿, our final plug-in estimate of the distribution.

Something cool about this is that we've managed to take something that is generally impossible (maximum likelihood in infinite dimensions) and replaced it with a series of maximum likelihood estimation problems, each of which is one-dimensional. This is why this approach is called targeted maximum likelihood: at each step, we're solving a maximum likelihood problem, but we've targeted it in a way that helps estimate our parameter of interest. 

What is maximum likelihood estimation?

In case you're not familiar with it, maximum likelihood estimation is a general-purpose technique most commonly used in parametric model spaces. The goal is to find the distribution P^\hat PP^﻿ in the model that maximizes the empirical log-likelihood of the observed data. We're trying to answer the question: "of all the possible probability distributions in the model, which is the most likely to have generated the data that I observe?". Or, formally, P^=arg max⁡P∈Mp(Zi,…n)\hat P = \argmax_{P\in \mathcal M} p(Z_{i, \dots n})P^=argmaxP∈M​p(Zi,…n​)﻿. "Likelihood" is just a term that we use to describe the density ppp﻿ when we're talking about it as a quantity that itself isn't fixed (in the sense that we don't know what the true density is so we have to consider multiple options). If Zi,…nZ_{i,\dots n}Zi,…n​﻿ are assumed independent and identically distributed, the density factors over ZiZ_iZi​﻿ and if we take a log (the argmax is the same either way) the product becomes a sum so we get P^=arg max⁡P∈M1n∑inp(Zi)\hat P = \argmax_{P \in \mathcal M} \frac{1}{n}\sum_i^n p(Z_i)P^=argmaxP∈M​n1​∑in​p(Zi​)﻿. We can also use the more condensed notation P^=arg max⁡P∈MPnlog⁡p\hat P = \argmax_{P \in \mathcal M} \mathbb P_n \log pP^=argmaxP∈M​Pn​logp﻿.

This strategy is most often only usable for parametric models. If we have a parametric model, the maximum likelihood problem P^=arg max⁡P∈MPnlog⁡p\hat P = \argmax_{P \in \mathcal M} \mathbb P_n \log pP^=argmaxP∈M​Pn​logp﻿ can be equivalently stated as P^=arg max⁡θ∈ΘPnlog⁡pθ\hat P = \argmax_{\theta \in \mathcal \Theta} \mathbb P_n \log p_\thetaP^=argmaxθ∈Θ​Pn​logpθ​﻿. The advantage of this is that we've reduced the problem of looking at every possible distribution in M\mathcal MM﻿ to a problem of looking at all possible values of the parameter Θ\ThetaΘ﻿, which is a finite-dimensional vector space like Rp\mathbb R^pRp﻿. If the log-likelihood function is differentiable in θ\thetaθ﻿, this is a classical optimization problem, the kind of which is solvable using tools from calculus. In particular, if Pnlog⁡pθ\mathbb P_n \log p_\thetaPn​logpθ​﻿ as a function of θ\thetaθ﻿ is maximized at some value θ^\hat\thetaθ^﻿, then we know that the derivative at θ^\hat\thetaθ^﻿ must be 0. For simple models (e.g. linear regression) there is often an analytical solution which gives 0=Pnddθlog⁡pθ^0 = \mathbb P_n \frac{d}{d\theta}\log p_{\hat\theta}0=Pn​dθd​logpθ^​﻿, but generally you can use algorithms like gradient descent to find the maximum (or minimum if you put in a minus sign). The quantity ddθlog⁡pθ\frac{d}{d\theta}\log p_\thetadθd​logpθ​﻿ is called the score with respect to θ\thetaθ﻿ and exactly corresponds to a path in the model space where we increment θ\thetaθ﻿ by some small amount. You can think of it as the sensitivity of the log-likelihood to small changes in the parameters (which is different at each point p=pθp=p_\thetap=pθ​﻿). When θ\thetaθ﻿ has multiple components, the score is a vector of partial derivatives [δδθ1log⁡pθ,…δdδθplog⁡pθ][\frac{\delta}{\delta \theta_1}\log p_\theta, \dots \frac{\delta d}{\delta \theta_p}\log p_\theta][δθ1​δ​logpθ​,…δθp​δd​logpθ​]﻿, each of which may be called the score of θj\theta_jθj​﻿. We call Pnddθlog⁡pθ\mathbb P_n \frac{d}{d\theta}\log p_{\theta}Pn​dθd​logpθ​﻿ a set of score equations (the free variable here is θ\thetaθ﻿) and we say that the maximum likelihood estimator θ^\hat\thetaθ^﻿ solves these score equations because at θ=θ^\theta = \hat\thetaθ=θ^﻿ we know we have to have Pnddθlog⁡pθ\mathbb P_n \frac{d}{d\theta}\log p_{\theta}Pn​dθd​logpθ​﻿.

All of this falls apart very quickly if you work in nonparametric model spaces because we can't transform the maximum likelihood problem into a finite-dimensional optimization. How do you search over an infinite dimensional space? Even if you have a sensible way of writing the densities in your model as a function of an infinite dimensional parameter, (e.g. another function, say say θ(ξ)\theta(\xi)θ(ξ)﻿), then for each "index" ξ\xiξ﻿ you have a score δδθ(ξ)log⁡pθ\frac{\delta}{\delta \theta(\xi)}\log p_\thetaδθ(ξ)δ​logpθ​﻿, which means you need to solve an infinite number of score equations. 

However, it remains to show that this actually gets us anywhere useful. In particular, we haven't explained why we're always moving in the direction of the hardest submodel. Why not pick some other direction (score) hhh﻿? We'd still be increasing the likelihood at each step, after all.

Well, here's the answer. Remember that at each step, ϵ^\hat\epsilonϵ^﻿ is the maximizer of Pnlog⁡(p~ϵ)\mathbb P_n \log(\tilde p_\epsilon)Pn​log(p~​ϵ​)﻿, which is a univariate function of ϵ\epsilonϵ﻿. We know from calculus that the derivative of the maximized function at the maximizing value must be zero: 0=ddϵPnlog⁡p~ϵ∣ϵ=ϵ^0 = \frac{d}{d\epsilon} \mathbb P_n \log\tilde p_{\epsilon} \big|_{\epsilon = \hat\epsilon}0=dϵd​Pn​logp~​ϵ​∣∣​ϵ=ϵ^​﻿. Moreover recall that for a path p~ϵ\tilde p_\epsilonp~​ϵ​﻿ in a direction hhh﻿, we have that ddϵlog⁡p~ϵ∣ϵ=0=h\frac{d}{d\epsilon} \log  \tilde p_\epsilon \big|_{\epsilon=0} = hdϵd​logp~​ϵ​∣∣​ϵ=0​=h﻿. After a certain number of iterations, we know that our algorithm converges, so ϵ^=0\hat\epsilon = 0ϵ^=0﻿, p~ϵ=0=p^∗\tilde p_{\epsilon=0}  = \hat p^*p~​ϵ=0​=p^​∗﻿ (and the direction is h=ϕ^∗h=\hat\phi^*h=ϕ^​∗﻿). Putting that all together we get

0 = \mathbb P_n \hat\phi^*

This is precisely the plug-in bias term for the plug-in estimator ψ(P^∗)\psi(\hat P^*)ψ(P^∗)﻿, and we've shown that it is exactly 0. That explains why we need to move in the direction of the gradient ϕ^\hat\phiϕ^​﻿ at each iteration in the targeting step. If we were to move in a different direction we wouldn't end up cancelling the plug-in bias at the very end. 

There is also an intuitive reason for choosing to move in the direction of the gradient, which is (as we've called it) the hardest submodel. This model is the hardest to estimate in because small changes in the distribution can create large changes in the target parameter. So by perturbing our estimate in the direction of the hardest submodel, what we're doing is finding the smallest possible change to our initially estimated distribution that creates the biggest possible debiasing effect in the parameter estimate. 

Submodels that Span the Gradient

It's easiest to come to terms with TMLE if you think in terms of one-dimensional submodels where the path is always in the direction of the canonical gradient. However, the algorithm actually works the same if instead of one-dimensional submodels, you pick ddd﻿-dimensional submodels that still spans the canonical gradient at each step. What I mean by this is that often you can decompose the canonical gradient ϕ^=ϕ^1+ϕ^2\hat\phi = \hat\phi_1 + \hat\phi_2 ϕ^​=ϕ^​1​+ϕ^​2​﻿ into separate pieces (we saw an example of this in the previous chapter). Then, instead of creating a one-dimensional submodel in the direction ϕ\phiϕ﻿, we can create a two-dimensional submodel where p~ϵ1,ϵ2=(1+ϵ1ϕ^1+ϵ2ϕ^2)p^\tilde p_{\epsilon_1, \epsilon_2} = (1 + \epsilon_1 \hat\phi_1 + \epsilon_2 \hat\phi_2)\hat pp~​ϵ1​,ϵ2​​=(1+ϵ1​ϕ^​1​+ϵ2​ϕ^​2​)p^​﻿.

We say this submodel spans the gradient because the 1-dimensional path we were using before is actually included in this model if we let ϵ1=ϵ2\epsilon_1 = \epsilon_2ϵ1​=ϵ2​﻿. It turns out that's all we need: the targeting step works equally well as long as the canonical gradient can be made up out of a sum of the scores for each parameter ϵj\epsilon_jϵj​﻿. In the targeting step we now need to replace optimization over ϵ\epsilonϵ﻿ with optimization over (ϵ1,ϵ2)(\epsilon_1, \epsilon_2)(ϵ1​,ϵ2​)﻿, but this can in fact be easier than optimization over a single ϵ\epsilonϵ﻿ because of the way the distribution factors. If the different directions in the submodel correspond to scores that are orthogonal, and each of these scores corresponds to an independent factor of the distribution, then we can perform the search for each ϵ^j\hat\epsilon_jϵ^j​﻿ independently from the others (i.e. temporarily setting ϵj′=0\epsilon_{j'}=0ϵj′​=0﻿ for j′≠jj' \ne jj′=j﻿). You should try to rigorously verify for yourself why that is true (and/or draw a picture!).

At any rate, we still end up with 0=Pnϕ^∗0 = \mathbb P_n \hat\phi^*0=Pn​ϕ^​∗﻿ at the end of the targeting step if we work with these larger submodels. You can again verify that is the case for yourself by following the outline we used above for the one-dimensional case.

Different Paths with the Same Score

Up until now we've only been working with paths (or submodels) that look like p~ϵ=(1+ϵh)p^\tilde p_\epsilon = (1 + \epsilon h)\hat pp~​ϵ​=(1+ϵh)p^​﻿. However, it's possible to build paths that are defined differently but which still have the same score. This is because the score just needs to determine the initial "direction" of the path- after that it can do whatever it wants. As far as TMLE is concerned (or actually the definition of regularity, etc., for that matter) a path can actually be any function p~ϵ\tilde p_\epsilonp~​ϵ​﻿ of ϵ\epsilonϵ﻿, hhh﻿, and p^\hat pp^​﻿ such that ddϵlog⁡p~ϵ∣ϵ=0=h\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = hdϵd​logp~​ϵ​∣∣​ϵ=0​=h﻿ (i.e. this path has score hhh﻿) and p~ϵ=0=p^\tilde p_{\epsilon=0} = \hat pp~​ϵ=0​=p^​﻿ (the path starts at p^\hat pp^​﻿). Our paths up until this point, e.g. p~ϵ=(1+ϵh)p^\tilde p_\epsilon = (1 + \epsilon h)\hat pp~​ϵ​=(1+ϵh)p^​﻿ are just one possible example. An alternative example is p~ϵ=C(ϵ)eϵhp^\tilde p_\epsilon = C(\epsilon)e^{\epsilon h}\hat pp~​ϵ​=C(ϵ)eϵhp^​﻿ with C(ϵ)C(\epsilon)C(ϵ)﻿ some normalizing constant to make sure that the result integrates to 1. It's actually really nice that we have so much flexibility in choosing the path because different formulations may be much easier to work with in terms of finding ϵ^\hat\epsilonϵ^﻿ at each step.

Three different paths that all start at P^\hat PP^﻿ and have score ϕ^\hat\phiϕ^​﻿.

Therefore, no matter the definition of the path, as long as:

p~0=p^\tilde p_0 = \hat pp~​0​=p^​﻿

ddϵlog⁡p~ϵ∣ϵ=0=ϕ^\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phidϵd​logp~​ϵ​∣∣​ϵ=0​=ϕ^​﻿

(i.e. it starts at our current estimate and has score equal to its canonical gradient), then this path will be valid for the targeting step.

Given this understanding, you should also now realize that there isn't one single hardest submodel. It's only after we decide on a fixed functional form of the submodel in ϵ\epsilonϵ﻿, hhh﻿, and p^\hat pp^​﻿ that a hardest submodel becomes uniquely defined by setting h=ϕ^h=\hat\phih=ϕ^​﻿.

Locally and Universally Least-Favorable

Since we have freedom in choosing whatever form of path is most convenient for us, you might wonder if there is some automated way in which the estimation problem might be able to dictate what path is most convenient. This leads to the idea of a universally least-favorable submodel (or universally hardest path, etc.). 

Up until now, the only constraints we've put on our paths have been local constraints in the sense that we only care about how the path behaves at its very beginning: it has to start at P^\hat PP^﻿ and it has to start out in the direction of ϕ^\hat \phiϕ^​﻿. After that we don't actually care what the path does. That means that if we want to pick among these many possible paths, we have to decide on what we want the path to look like in a universal sense (i.e. at all points along the path, not just at the beginning).

Well, why not just extend our local score condition to the rest of the path? We already have that at ϵ=0\epsilon=0ϵ=0﻿ we want ddϵlog⁡p~ϵ∣ϵ=0=ϕ^\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phidϵd​logp~​ϵ​∣∣​ϵ=0​=ϕ^​﻿, which we can write in a different way as ddϵlog⁡p~ϵ∣ϵ=0=ϕP~ϵ=0\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \phi_{\tilde P_{\epsilon=0}}dϵd​logp~​ϵ​∣∣​ϵ=0​=ϕP~ϵ=0​​﻿. So why impose that condition universally instead of just locally? That would give us ddϵlog⁡p~ϵ=ϕP~ϵ\frac{d}{d\epsilon} \log \tilde p_\epsilon = \phi_{\tilde P_{\epsilon}}dϵd​logp~​ϵ​=ϕP~ϵ​​﻿. Any path starting at P^\hat PP^﻿ that satisfies this condition is what we call the universally least-favorable submodel. 

These two paths are both locally least-favorable, but only the red (rightmost) path is universally least-favorable.

This path is universally least-favorable because it goes in the hardest direction at each point P~ϵ\tilde P_\epsilonP~ϵ​﻿. To be clear, though, what that direction is can and most often does change as we traverse the path. So it's not that the path is a "straight line" that keeps following the original gradient ϕP^\phi_{\hat P}ϕP^​﻿ no matter what. Instead, this is a path that reevaluates what the hardest direction is after each infinitesimal step and then course corrects to follow it at all times.

For a given estimation problem it might take a little math to be able to express what the universally least-favorable path actually is. But once we have it our lives do get easier in one sense: if we do an iteration of the targeting step along the universally least-favorable path we get to the final targeted estimate after just one step! Why? Because the universally least-favorable path that starts at the update P~ϵ^\tilde P_{\hat\epsilon}P~ϵ^​﻿ coincides with the continuation of the universally least-favorable path from P^\hat PP^﻿. So if P~ϵ^\tilde P_{\hat\epsilon}P~ϵ^​﻿ was the maximizer of the log-likelihood along the latter path, it's still the maximizer along the path that starts from there because the two paths coincide. Thus subsequent updates will not change the distribution and this proves that the targeting step converges after just one iteration. For this reason, a TMLE that uses a universally least-favorable submodel is sometimes called a one-step TMLE.

Example: ATE in Observational Study with Binary Outcome

If this all sounds terribly abstract to you, don't worry, you're not alone. Let's work through our familiar example and see where it gets us. 

As in our demonstration of the bias-corrected plug-in estimator, presume we've gotten an initial estimate P^\hat PP^﻿ by modeling μ^a(X)≈P(Y=1∣A=a,X)\hat\mu_a(X) \approx P(Y=1|A=a,X)μ^​a​(X)≈P(Y=1∣A=a,X)﻿, π^a(X)≈P(A=a∣X)\hat\pi_a(X) \approx P(A=a|X)π^a​(X)≈P(A=a∣X)﻿, and the marginal density of XXX﻿ with the observed empirical distribution Pn(X)\mathbb P_n(X)Pn​(X)﻿. That's the full distribution.

As before, we know the efficient influence function for ψ\psiψ﻿ at P^\hat PP^﻿ is given by

\hat\phi = \hat\phi_1 - \hat\phi_0 \\ \hat\phi_a(Y,A,X) = \frac{1_a(A)}{\hat\pi_a(X)} (Y - \hat\mu_a(X)) - (\hat\mu_a(X) - \mathbb P_n\hat\mu_a(X))

Defining a Valid Submodel

We need to construct a submodel p~ϵ\tilde p_\epsilonp~​ϵ​﻿ through ppp﻿ that spans the score ϕ^\hat\phiϕ^​﻿. How do we do this? Well, we can start brute-force computing stuff, or we can recall that the efficient influence function equals a sum of orthogonal scores in the tangent spaces TX\mathcal T_XTX​﻿, TA∣X0\mathcal T_{A|X}^0TA∣X0​﻿, and TY∣A,X0\mathcal T_{Y|A,X}^0TY∣A,X0​﻿ because the distribution P^∈M\hat P \in \mathcal MP^∈M﻿ is a product of independent factors. Specifically, we have:

\begin{align*} \hat\phi_{\langle \mathcal T^0_{Y|X,A} \rangle} &= \frac{1_1(A)}{\hat\pi_1(X)}(Y - \hat\mu_1(X)) - \frac{1_0(A)}{\hat\pi_0(X)}(Y - \hat\mu_0(X)) \\ &= \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right] \left( Y - \hat\mu_A(X) \right) \\ \hat\phi_{\langle \mathcal T^0_{A|X} \rangle} &= 0 \\ \hat\phi_{\langle \mathcal T_{X} \rangle} &= \left[ \hat\mu_1(X) - \mathbb P_n\hat\mu_1(X) \right] - \left[ \hat\mu_0(X) - \mathbb P_n\hat\mu_0(X) \right] \end{align*}

Now in each iteration we can update the densities p^(y∣a,x)\hat p(y|a,x)p^​(y∣a,x)﻿, p^(a∣x)\hat p(a|x)p^​(a∣x)﻿, and p^(x)\hat p(x)p^​(x)﻿ separately because each of these scores can only affect its respective density. So although we're really trying to optimize in a three-dimensional submodel with parameters (ϵY∣A,X,ϵA∣X,ϵX)(\epsilon_{Y|A,X}, \epsilon_{A|X}, \epsilon_X)(ϵY∣A,X​,ϵA∣X​,ϵX​)﻿, the solution in this particular problem is the same as if we did the optimization separately for each of ϵY∣A,X\epsilon_{Y|A,X}ϵY∣A,X​﻿, ϵA∣X\epsilon_{A|X}ϵA∣X​﻿, and ϵX\epsilon_{X}ϵX​﻿. This is what I meant when I said earlier that operating in a higher-dimensional submodel could actually make things easier.

Submodel for XXX﻿

Let's start with the marginal distribution of XXX﻿, which according to our initial estimate is Pn(X)\mathbb P_n (X)Pn​(X)﻿. The empirical log-likelihood of the data points XXX﻿ is actually already maximized at the empirical distribution (which is why it's sometimes called a nonparametric maximum likelihood estimate). That means that no matter where we are in M\mathcal MM﻿ (or even which direction we build a path along) the optimal perturbation will always be ϵ^X=0\hat\epsilon_X = 0ϵ^X​=0﻿ because we're already at a maximum. Thus the targeting step will never update this part of the distribution so we already have our plug-in for the marginal distribution of the covariates: P^∗(X)=Pn(X)\hat P^*(X) = \mathbb P_n(X)P^∗(X)=Pn​(X)﻿. 

Submodel for A∣XA|XA∣X﻿

For the distribution of A∣XA|XA∣X﻿, we see that no matter how we define the path at each point P^\hat PP^﻿, the fluctuation will always be 000﻿ because ϕ^⟨TA∣X0⟩=0\hat\phi_{\langle \mathcal T^0_{A|X} \rangle} = 0ϕ^​⟨TA∣X0​⟩​=0﻿. So that distribution will also stay the same from iteration to iteration. Thus P^∗(A∣X)=π^A(X)\hat P^*(A|X) = \hat\pi_A(X)P^∗(A∣X)=π^A​(X)﻿.

Submodel for Y∣A,XY|A,XY∣A,X﻿

Now, finally, we come to the distribution of Y∣A,XY|A,XY∣A,X﻿. Here we need to find ϵ^Y∣A,X\hat\epsilon_{Y|A,X}ϵ^Y∣A,X​﻿ (I'll just refer to ϵ^\hat\epsilonϵ^﻿ from here on out since the other components don't matter) to maximize Pnlog⁡(p~ϵ(Y∣A,X))\mathbb P_n \log(\tilde p_\epsilon(Y|A,X))Pn​log(p~​ϵ​(Y∣A,X))﻿. 

Remember that we can construct the path p~ϵ\tilde p_\epsilonp~​ϵ​﻿ any way we like as long as it starts at the initial estimate p^(Y∣A,X)\hat p(Y|A,X)p^​(Y∣A,X)﻿ and the derivative of the log-likelihood of ϵ\epsilonϵ﻿ at ϵ=0\epsilon=0ϵ=0﻿ is the efficient influence function ϕ^⟨TY∣X,A0⟩\hat\phi_{\langle \mathcal T^0_{Y|X,A} \rangle}ϕ^​⟨TY∣X,A0​⟩​﻿. Therefore I propose using this path:

\text{logit}(\tilde p_\epsilon) = \text{logit}(\hat p) + \epsilon \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right]

Obviously if ϵ=0\epsilon=0ϵ=0﻿ we have p~0=p^\tilde p_0 = \hat pp~​0​=p^​﻿ so this path starts at p^\hat pp^​﻿. It remains to prove that it has the right score, i.e. ddϵlog⁡p~ϵ∣ϵ=0=ϕ^a⟨TY∣X,A0⟩\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phi_{a\langle \mathcal T^0_{Y|X,A} \rangle}dϵd​logp~​ϵ​∣∣​ϵ=0​=ϕ^​a⟨TY∣X,A0​⟩​﻿. I show this in the toggle below if you're curious- it just takes some algebra and derivatives.

Proof that ddϵlog⁡p~ϵ∣ϵ=0=ϕ^⟨TY∣X,A0⟩\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} = \hat\phi_{\langle \mathcal T^0_{Y|X,A} \rangle}dϵd​logp~​ϵ​∣∣​ϵ=0​=ϕ^​⟨TY∣X,A0​⟩​﻿

Abbreviate U=[11(A)π^1(X)−10(A)π^0(X)]U = \left[
\frac{1_1(A)}{\hat\pi_1(X)} -
\frac{1_0(A)}{\hat\pi_0(X)}
\right]U=[π^1​(X)11​(A)​−π^0​(X)10​(A)​]﻿ so that our definition of the path is logit(p~ϵ)=logit(p^)+ϵU\text{logit}(\tilde p_\epsilon) 
= 
\text{logit}(\hat p) 
+
\epsilon
Ulogit(p~​ϵ​)=logit(p^​)+ϵU﻿. Taking a derivative w.r.t. ϵ\epsilonϵ﻿ on both sides and applying the chain rule gives (1p~ϵ(1−p~ϵ))ddϵp~ϵ=U\left(\frac{1}{\tilde p_\epsilon(1-\tilde p_\epsilon)} \right) \frac{d}{d\epsilon} \tilde p_\epsilon = U(p~​ϵ​(1−p~​ϵ​)1​)dϵd​p~​ϵ​=U﻿, or, equivalently, ddϵp~ϵ=Up~ϵ(1−p~ϵ)\frac{d}{d\epsilon} \tilde p_\epsilon = U\tilde p_\epsilon(1-\tilde p_\epsilon)dϵd​p~​ϵ​=Up~​ϵ​(1−p~​ϵ​)﻿. Again by the chain rule we know ddϵlog⁡p~ϵ∣ϵ=0=(1p~ϵddϵp~ϵ)∣ϵ=0=1p^ddϵp~ϵ∣ϵ=0\frac{d}{d\epsilon} \log \tilde p_\epsilon \big|_{\epsilon=0} 
=
\left(
\frac{1}{\tilde p_\epsilon}\frac{d}{d\epsilon} \tilde p_\epsilon
\right) \big|_{\epsilon=0}
= 
\frac{1}{\hat p}\frac{d}{d\epsilon} \tilde p_\epsilon\big|_{\epsilon=0}dϵd​logp~​ϵ​∣∣​ϵ=0​=(p~​ϵ​1​dϵd​p~​ϵ​)∣∣​ϵ=0​=p^​1​dϵd​p~​ϵ​∣∣​ϵ=0​﻿. Combining these two facts gives ddϵlog⁡p~ϵ∣ϵ=0=U(1−p^)\frac{d}{d\epsilon} \log \tilde p_\epsilon\big|_{\epsilon=0} = U(1-\hat p)dϵd​logp~​ϵ​∣∣​ϵ=0​=U(1−p^​)﻿. 

Our likelihood is p^(Y∣A,X)=μ^A(X)\hat p(Y|A,X) = \hat\mu_A(X)p^​(Y∣A,X)=μ^​A​(X)﻿ if Y=1Y=1Y=1﻿ and  p^(Y∣A,X)=1−μ^A(X)\hat p(Y|A,X) = 1-\hat\mu_A(X)p^​(Y∣A,X)=1−μ^​A​(X)﻿ if Y=0Y=0Y=0﻿, so a little algebra shows that 1−p^1-\hat p1−p^​﻿ can be equivalently written as Y−μ^A(X)Y - \hat\mu_A(X)Y−μ^​A​(X)﻿. Combining this with the above gives the desired result.

Targeting Y∣A,XY|A,XY∣A,X﻿

Ok, we've got our path, so how do we find ϵ^=arg max⁡ϵPnlog⁡p~ϵ\hat\epsilon 
=\argmax_{\epsilon} \mathbb P_n \log \tilde p_\epsilonϵ^=argmaxϵ​Pn​logp~​ϵ​﻿? This is where things clean up nicely. The way we constructed our submodel, ϵ^\hat\epsilonϵ^﻿ is just the coefficient we get if we regress the vector of YiY_iYi​﻿ onto the "clever covariate" vector of Ui=[11(Ai)π^1(Xi)−10(Ai)π^0(Xi)]U_i = \left[
\frac{1_1(A_i)}{\hat\pi_1(X_i)} -
\frac{1_0(A_i)}{\hat\pi_0(X_i)}
\right]Ui​=[π^1​(Xi​)11​(Ai​)​−π^0​(Xi​)10​(Ai​)​]﻿with a fixed offset bi=logit(μ^Ai(Xi))b_i = \text{logit}(\hat\mu_{A_i}(X_i))bi​=logit(μ^​Ai​​(Xi​))﻿ using logistic regression! 

Proof that this is logistic regression with a fixed offset

We can write the conditional likelihood for a binary variable YYY﻿ where P(Y=1∣A,X)=μ~A(X)P(Y=1|A,X) = \tilde\mu_A(X)P(Y=1∣A,X)=μ~​A​(X)﻿ like p~ϵ=μ~A(X)Y(1−μ~A(X))(1−Y)\tilde p_\epsilon = \tilde\mu_A(X)^Y(1-\tilde\mu_A(X))^{(1-Y)}p~​ϵ​=μ~​A​(X)Y(1−μ~​A​(X))(1−Y)﻿. This shows that the log-likelihood is 

\mathbb P_n\log\tilde p_\epsilon = \frac{1}{n}\sum_i^n \bigg[ \log( \underbrace{\tilde\mu_A(X)}_p ) Y + \log( \underbrace{ 1-\tilde\mu_A(X) }_{(1-p)} ) (1-Y) \bigg]

Moreover we have from our construction of the path that logit(p~ϵ(Y∣A,X))=logit(p^(Y∣A,X))+ϵU\text{logit}(\tilde p_\epsilon(Y|A,X)) 
= 
\text{logit}(\hat p(Y|A,X)) 
+
\epsilon
Ulogit(p~​ϵ​(Y∣A,X))=logit(p^​(Y∣A,X))+ϵU﻿. Plugging in Y=1Y=1Y=1﻿ and the definition of μA(X)\mu_A(X)μA​(X)﻿ (tilde or hat) immediately gives that

\text{logit}(\underbrace{\tilde \mu_A(X)}_{p}) = \underbrace{0}_{\beta_0} + \underbrace{ 1\times \text{logit}(\hat \mu_A(X)) }_{\beta_1X_1} + \underbrace{ \epsilon \times U }_{\beta_2X_2}

Compare this to the definition of logistic regression (e.g. eqs. 12.7 and 12.4 here). I've included underbraced quantities to the above so you can directly map the notation.

The only difference with the usual logistic regression is that in our case, we have a fixed intercept β0=0\beta_0=0β0​=0﻿ and a fixed coefficient β1=1\beta_1=1β1​=1﻿. This is known as an "offset" in the literature about logistic regression and options to use offsets exist in practically every software package that implements logistic regression. The coefficient β2=ϵ\beta_2 = \epsilonβ2​=ϵ﻿ is what we fit using the "data" (Y,U)1,…n(Y, U)_{1,\dots n}(Y,U)1,…n​﻿, which is precisely the solution to ϵ^=arg max⁡ϵPnlog⁡p~ϵ\hat\epsilon 
=\argmax_{\epsilon} \mathbb P_n \log \tilde p_\epsilonϵ^=argmaxϵ​Pn​logp~​ϵ​﻿, as desired.

After fitting our logistic regression and extracting the coefficient ϵ^\hat\epsilonϵ^﻿, the final thing we have left to do is to update the distribution. For this we just have to compute p~ϵ^\tilde p_{\hat\epsilon}p~​ϵ^​﻿, which in terms of μ~A(X)\tilde \mu_A(X)μ~​A​(X)﻿ means

\tilde\mu_A(X) = \text{logit}^{-1}\left( \text{logit}(\hat \mu_A(X)) + \hat\epsilon \left[ \frac{1_1(A)}{\hat\pi_1(X)} - \frac{1_0(A)}{\hat\pi_0(X)} \right] \right)

This is now a fixed function that we can evaluate at any value of (A,X)(A,X)(A,X)﻿. 

Iterating

We can now repeat what we did above, at each step plugging in the result μ~A(X)\tilde\mu_A(X)μ~​A​(X)﻿ from the previous iteration into the role played by the current estimate μ^A(X)\hat\mu_A(X)μ^​A​(X)﻿. However, the way we've worked out the submodel in this case implies that ϵ^=0\hat\epsilon =0ϵ^=0﻿ after the first iteration. Intuitively, that's because we already regressed YYY﻿ onto UUU﻿ in the first iteration, so subsequent regressions don't change anything. You can formally prove this to yourself if you like, but it's not actually that important. This suggests that the path we chose is universally least-favorable.

Since we've converged already, our final estimate μ^A∗(X)\hat\mu^*_A(X)μ^​A∗​(X)﻿ is exactly the function shown in the display above. We can now execute the plug-in ψ(P^∗)=Pn[μ^1∗(X)−μ^0∗(X)]\psi(\hat P^*) = \mathbb P_n[\hat\mu_1^*(X) - \hat\mu_0^*(X)]ψ(P^∗)=Pn​[μ^​1∗​(X)−μ^​0∗​(X)]﻿ and that's our final estimate.

Efficiency

We've now explained the TMLE algorithm and how it guarantees that the plug-in bias for the targeted plug-in P^∗\hat P^*P^∗﻿ is 0. Of course, since TMLE is a plug-in, we can analyze its efficiency using the exact same expansion as we did for the naive plug-in:

\begin{align*} \psi(\hat P^*) - \psi(P) &= \mathbb P_n\phi + \cancel{ \mathbb P_n\hat\phi^* } + (\mathbb P_n - P)(\hat\phi^* - \phi) + R \\ \end{align*}

It remains to show that the empirical process term and second-order remainder are both still oP(n−1/2)o_P(n^{-1/2})oP​(n−1/2)﻿ even though we've replaced our initial plug-in P^\hat PP^﻿ (for which we know these terms are oP(n−1/2)o_P(n^{-1/2})oP​(n−1/2)﻿ as long as we do things sensibly) with the targeted plug-in P^∗\hat P^*P^∗﻿. In other words, we need to make sure that by eliminating the plug-in bias via targeting we haven't messed up the good things we already had going for us.

Thankfully, it turns out that we're still in business. The heuristic argument for why is that at each iteration of the targeting step, we're always increasing the empirical log-likelihood Pnlog⁡p^\mathbb P_n \log \hat pPn​logp^​﻿ of our plug-in estimate. Since each update is done with a parametric model, we're also guaranteed that this contributes only an error of order oP(n−1/2)o_P(n^{-1/2})oP​(n−1/2)﻿ relative to increasing the true log-likelihood Plog⁡p^P \log \hat pPlogp^​﻿. Thus we can claim that the true log-likelihood of our estimated distribution P^\hat PP^﻿ is always increasing (to a first-order approximation). That, in turn, implies that the KL divergence between P^∗\hat P^*P^∗﻿ and PPP﻿ decreases at least at the same rate (in nnn﻿) that the original plug-in P^\hat PP^﻿ attains. This implies that p^∗−p\sqrt{\hat p^*}-\sqrt pp^​∗​−p​﻿ also attains that rate (in L2\mathcal L_2L2​﻿ norm). Finally, using something called the functional delta method, we can infer that nuisance parameters (which are functionals of the density) also maintain their rates of convergence. You'll recall from the section on the naive plug-in that these are the quantities that we need to converge in order to eliminate the empirical process and 2nd order remainder terms. As a result of all of this, we can say that any conditions required to satisfy the n−1/2n^{-1/2}n−1/2﻿ rate for these terms indeed are maintained through the targeting step, and thus TMLE produces an efficient estimator. 

Discussion

Deriving and implementing a TMLE for a given estimation problem generally takes more work than doing a bias-corrected or estimating equations estimator. Since all three give asymptotically equivalent results, why should we bother? 

Well, as we've already noted, bias-correction and estimating equations each have their own problems. Sometimes they "blow up" or produce estimates that are outside of the natural range of the parameter. Sometimes you can't even construct the estimating equations estimator and sometimes the bias-corrected estimator isn't able to take advantage of the double robustness of the estimation problem. 

TMLE, on the other hand, has none of these problems. What makes TMLE so robust is that it goes back to the original plug-in philosophy: all we're doing is estimating a distribution for the data and then asking what the value of the target parameter is for that distribution. There's no fudging based on a bias correction or trying to brute-force solve an estimating equation (though the final TMLE plug-in P^∗\hat P^*P^∗﻿ does satisfy the estimating equation). Since it's a plug-in, the TMLE can't help but respect any bounds on the parameter. That also typically makes it harder to blow it up. In general, you can expect estimators built with TMLE to behave more reasonably and consistently in finite samples.

"Blowing up" an estimator

We saw that the bias-corrected and estimating equations approaches give the same estimator for the ATE problem, which we called AIPW. The form of this estimator is a sample average over terms where one of the factors is 1/π^a(X)1/\hat\pi_a(X)1/π^a​(X)﻿. If the estimated propensity score π^a(X)\hat\pi_a(X)π^a​(X)﻿ is close to 0 or 1 for some subjects in the study, that factor ends up being massive and it completely dominates the sample average. Of course, in an infinite sample, that couldn't happen because some other subject would have an equally large weight in the other direction or the large weight would be balanced with many subjects with smaller weights. But we don't have infinite data. We have finite data. That makes the AIPW estimator unstable in finite samples where there are strong predictors of treatment assignment. The only way to address this is to use ad-hoc methods like weight stabilization or trimming.

On the other hand, the TMLE estimator doesn't incorporate terms of the form 1/π^a(X)1/\hat\pi_a(X)1/π^a​(X)﻿ directly into a sample average. Instead, it uses them as a covariate in a linear adjustment of the original regression model μ^A(X)\hat\mu_A(X)μ^​A​(X)﻿. That makes the procedure more stable in the face of small or large propensity scores. This comes automatically. There are also additional variations on TMLE (e.g. collaborative TMLE) that are theoretically justified and which do even more to improve the finite-sample robustness of the resulting estimator. 

Besides the practical benefits, TMLE has a lot going for it in theory. As I touched on earlier, TMLE is the "natural" extension of maximum likelihood estimation to cases where it is impossible to carry out the explicit optimization problem in infinite dimensions. TMLE beautifully reduces this problem to a series of lower-dimensional maximum likelihood problems by correctly identifying the right direction to move in within the statistical model. 

Moreover TMLE leaves a lot of room for well-motivated variations. Different choices of submodels or additional scores (remember the submodel just has to span the gradient, but it can include additional directions if desired) can give TMLE more beneficial properties. There is also collaborative TMLE, in which components of the distribution that don't get updated by the targeting step (e.g. π^a\hat\pi_aπ^a​﻿ in our ATE example) are fit in such a way that they maximize the benefit of targeting instead of just to minimize a generic loss function. Finally, TMLE admits natural extensions to attain higher-order efficiency (i.e. cancelling second-order bias terms), which is challenging in the bias-correcting and estimating equations frameworks. 

In short, TMLE has all of the advantages of historical approaches in the sense that it can leverage flexible machine learning models and effectively minimize or any eliminate statistical assumptions. However, it eliminates most of the drawbacks of those methods and even leaves room for further innovation.

⬅️

BACK: 4.3 Estimating Equations

➡️

NEXT: 4.5 Inference