4.3 Estimating Equations

4.3 Estimating Equations

Table of Contents

Example: ATE in an Observational Study

Efficiency

Discussion

History

The second approach that exists to control the plug-in bias term is often referred to as the (efficient) estimating equations approach. This approach constructs an estimator that doesn't require a full plug-in estimate P^\hat PP^﻿ at all. The way this estimator works is by finding a value for the target parameter and a set of nuisance parameters such that the estimated influence function satisfies Pnϕ^=0\mathbb P_n \hat \phi = 0Pn​ϕ^​=0﻿. First we'll describe the approach in a little more detail and give an example and then we'll return and prove that we get an efficient estimator.

To use the estimating equation approach, we must presume that the efficient influence function at any given distribution depends only the target parameter ψ\psiψ﻿ along with a (possibly infinite) set of nuisance parameters η\etaη﻿. Therefore instead of writing ϕP\phi_PϕP​﻿ for the efficient influence function at PPP﻿ we can write ϕψ,η\phi_{\psi, \eta}ϕψ,η​﻿ (sometimes you'll also see the notation ϕ(Z;ψ,η)\phi(Z;\psi, \eta)ϕ(Z;ψ,η)﻿). If we can write the influence function in this way, we can proceed by first estimating η^\hat\etaη^​﻿ using some flexible, fast-converging method (and, e.g. sample splitting). After fixing η^\hat\etaη^​﻿, we then find a value ψ^(ee)\hat\psi^{(\text{ee})}ψ^​(ee)﻿ ("ee" standing for "estimating equations") such that when we plug it into Pnϕ^=Pnϕψ^(ee),η^\mathbb P_n \hat\phi = \mathbb P_n \phi_{\hat\psi^{(\text{ee})}, \hat\eta}Pn​ϕ^​=Pn​ϕψ^​(ee),η^​​﻿, we get 0. That's all there is to it.

Example: ATE in an Observational Study

The estimating equation approach is best explained with an example. As before we'll focus first on the conditional mean ψa=E[Y∣A=a]\psi_a = E[Y|A=a]ψa​=E[Y∣A=a]﻿. The first step is to estimate any nuisance parameters that we need for the estimated influence function. In this case, that's the functions μa\mu_aμa​﻿ and πa\pi_aπa​﻿ because at a given distribution PPP﻿ we know the efficient influence function for ψa\psi_aψa​﻿ is

\phi_{\psi_a, (\mu_a, \pi_a)} = \frac{1_a(A)}{\pi_a(X)} (Y - \mu_a(X)) - (\mu_a(X) - \psi_a)

The fact that we can cleanly write it in this form (with target and nuisance parameters clearly separated) means we can proceed with the estimating equations approach.

As before, we estimate the nuisance parameters μa\mu_aμa​﻿ and πa\pi_aπa​﻿  using algorithms that are powerful enough to control the remainder and use sample splitting the control the empirical process term (or make the Donsker assumption).

We now set Pnϕψ^a,(μ^a,π^a)=0\mathbb P_n \phi_{\hat\psi_a, (\hat\mu_a, \hat\pi_a)} = 0Pn​ϕψ^​a​,(μ^​a​,π^a​)​=0﻿ and solve for ψ^a\hat\psi_aψ^​a​﻿:

\begin{align*} 0 &= \mathbb P_n \left[ \frac{1_a(A)}{\hat\pi_a(X)} (Y - \hat\mu_a(X)) - (\hat\mu_a(X) - \hat\psi_a^{(\text{ee})}) \right] \\ \hat\psi_a^{(\text{ee})} &= \mathbb P_n \left[ \frac{1_a(A)}{\hat\pi_a(X)} (Y - \hat\mu_a(X)) - \hat\mu_a(X) \right] \end{align*}

And once again we obtain an estimate of the ATE with ψ^(ee)=ψ^1(ee)−ψ^0(ee)\hat\psi^{(\text{ee})} = \hat\psi_1^{(\text{ee})} - \hat\psi_0^{(\text{ee})}ψ^​(ee)=ψ^​1(ee)​−ψ^​0(ee)​﻿. 

Does this look familiar? In this example, the estimating equation approach and the bias correction approach give the same estimator!

Efficiency

Although the estimating equation approach does not require a full plug-in estimate P^\hat PP^﻿, we can still analyze it in an identical way. To wit, 

\begin{align*} 0 &= \mathbb P_n\hat \phi \\ \hat\psi^{(\text{ee})} - \psi &= \mathbb P_n\hat \phi + \left[\hat\psi^{(\text{ee})} -\psi\right] + \underbrace{ \mathbb P_n(\phi-\phi) + \mathbb P_n(\hat\phi -\hat\phi) + P(\hat\phi -\hat\phi) + P\phi }_{0} \\ &= \mathbb P_n\phi + \underbrace{ \underbrace{ (\mathbb P_n - P)(\hat\phi -\phi) }_{\text{empirical process}} + \underbrace{ \left[\hat\psi^{(\text{ee})} - \psi\right] + P\hat\phi }_{\text{2nd-order remainder}} }_{o_P(n^{-1/2})} \end{align*}

You can see that the empirical process and 2nd order remainder terms here are exactly what we had before in our analysis of the naive plug-in, so those terms disappear under the same assumptions (e.g. fast-enough estimation of nuisance parameters) that we can easily meet by using tools from machine learning. As a consequence, the only thing left over is the efficient CLT term Pnϕ\mathbb P_n \phiPn​ϕ﻿ which proves ψ^\hat\psiψ^​﻿ has influence function ϕ\phiϕ﻿ (the efficient influence function) and is thus an efficient estimator.

Discussion

Estimating equations are an extremely powerful and broadly studied approach. The advantage that estimating equations has over bias-correction is that the resulting estimator may still be asymptotically normal and unbiased (although not necessarily efficient) even if some of the nuisance parameters don't converge to their true values. On the other hand, the bias-corrected approach may sometimes require convergence of all nuisance parameters because it relies on consistency of the plug-in ϕ(P^)\phi(\hat P)ϕ(P^)﻿. In a sense, the estimating equations approach can better take advantage of the doubly-robust nature of certain estimation problems. This is not much of a concern these days because we have powerful machine learning methods to estimate nuisance parameters, meaning that the initial plug-in should generally be good enough to ensure all the convergences we want. But, of course, it's possible that this phenomenon could impact finite-sample robustness. 

Wait... what??

How can the properties of an estimating equations and a bias-corrected estimator can be different if the empirical process and remainder terms are the same for both of them?? Well... they're actually not. In particular, the problem is the 2nd-order remainder. 

For the bias corrected estimator, the 2nd order remainder is R=[ψ^−ψ]+Pϕ^R=[\hat\psi - \psi] + P\hat\phiR=[ψ^​−ψ]+Pϕ^​﻿ whereas for the estimating equations estimator it's R=[ψ^(ee)−ψ]+Pϕ^R=[\hat\psi^{(\text{ee})} - \psi] + P\hat\phiR=[ψ^​(ee)−ψ]+Pϕ^​﻿. Do you see the difference? The bias-corrected estimator has the original naive plug-in estimate ψ^=ψ(P^)\hat\psi = \psi(\hat P)ψ^​=ψ(P^)﻿ in the 2nd-order remainder, whereas the estimating equations estimator has its own estimate ψ^(ee)\hat\psi^{(\text{ee})}ψ^​(ee)﻿. In many cases, this difference doesn't matter because whatever the estimate is used there gets cancelled with what's in the estimated influence function. But if that doesn't happen, then the second order remainder will have different convergence properties for the bias-corrected and estimating equations estimators. 

This explains why the bias-corrected estimator is sometimes not doubly robust when the estimating equations estimator is. The 2nd-order remainder for the bias-corrected estimator has the naive plug-in estimate in it, so sometimes we need everything in the naive plug-in to converge appropriately.

Despite this arguably minor advantage, estimating equations estimators suffer from some of the same drawbacks of the bias-correction approach in finite samples: namely, unstable estimates or estimates that fall outside of their natural range. 

Moreover, the estimating equations approach requires that the efficient influence function be such that it explicitly includes the target parameter somewhere in it in a form that is cleanly separated from the nuisance parameters. In our example above this did happen to be the case, so we could proceed. But if the efficient influence function does not admit such a representation, the estimating equation approach can't be used to construct an efficient estimator. Moreover, even if the efficient influence function does admit such a representation, there's no guarantee that there exists a solution to this equation or that it is unique. 

Double Machine Learning (DML)

A relatively recent paper by Chernozhukov et al. has gained some traction in the literature and online. The approach they describe is exactly a subset of the estimating equation approach described in this section if used with sample splitting instead of relying on Donsker conditions. The condition that they refer to as "Neyman orthogonality" of a score hhh﻿ used to construct the score equation Pnh(Z;ϕ^,η^)\mathbb P_n h(Z; \hat\phi, \hat\eta)Pn​h(Z;ϕ^​,η^​)﻿ is, to my understanding, essentially equivalent to requiring that hhh﻿ is a gradient of ψ\psiψ﻿, i.e. that ∇h′ψ=E[hh′]\nabla_{h'} \psi = E[hh']∇h′​ψ=E[hh′]﻿. So we may as well talk about influence functions instead of "Neyman orthogonal scores". That means that DML, for most practical problems, is just the estimating equations approach to efficient estimation using sample splitting.

History

Once again we can turn to Mark van der Laan for some context:

In the 1990s, the corresponding approach of estimating equation-based estimators (EEE) was rigorously developed by [Jamie] Robins and collaborators, in which one constructs efficient estimators as solutions of the [efficient] estimating equation... Fundamental work on EEE was done by Robins and Rontitzky (circa 1992 and onward), and collaborators, in the context of censored data and causal inference models, involving clever representations and derivations of the [efficient influence function] (e.g., the augmented IPCW representation of the [efficient influence function]). A comprehensive review and treatment of the general efficient estimating equation methodology – going beyond application to censored data and causal inference models, including its theory – is presented in the book Unified Methods for Censored Longitudinal Data and Causality by van der Laan and Robins (2001).

⬅️

BACK: 4.2 Bias Correction

➡️

NEXT: 4.4 Targeted Maximum Likelihood