2.3 Unidentified Estimands

2.3 Unidentified Estimands

Table of Contents

Partial Identification

Example: Deterministic Treatment

As we saw in the last section, identification is the process of mathematically tying a statistical estimand to the causal estimand of interest. Since we can observe data from the statistical distribution, we can infer the value of the statistical estimand. The identification result takes us the final mile and makes that inference causal. 

There are cases, however, where we just can’t get the identification result we need to get a causal interpretation. For example, let’s say we have an observational study and we know we’ve missed an important confounder (e.g. some socioeconomic indicator). We also don’t have any special structure we can exploit to identify a different estimand, or perhaps we’re really only interested in the ATE (e.g.). In that case, we can still proceed with our analysis and get an estimate ψ^\hat\psiψ^​﻿ of the statistical estimand ψ\psiψ﻿, but we can’t say that that the statistical estimand represents our causal estimand ψ∗\psi^*ψ∗﻿. 

The difference between these two, i.e. ψ∗−ψ\psi^* - \psiψ∗−ψ﻿, is what we call the causal gap. The causal gap isn’t something you can ever observe. You don’t know the true values of the causal or even statistical estimands. But it’s a concept you should keep in the back of your head when you’re estimating the statistical estimand. Ask yourself: even if I had infinite data, how far off might I be from the true answer to my causal question? Do I really believe in a set of identifying assumptions that would give a causal interpretation to this estimate?

Thinking there is a causal gap isn’t the end of the world though. You can still often say something useful without identifying your causal estimand. For one, maybe it’s already good to know whether or not there appears to be a statistical association between the outcome and treatment- by itself that could be grounds for further, more careful investigation. There are also pre- and post-hoc methods you can use to interrogate and communicate the causal gap: that’s what this section is about.

Lastly, you should also remember that the causal gap is just one part of the total estimation error:

\underbrace{\hat\psi - \psi^*}_\text{total error} = {\color{red} \underbrace{(\hat\psi - \psi)}_\text{statistical error} } + {\color{blue} \underbrace{(\psi - \psi^*)}_\text{causal gap} }

Even if your estimand is identified, you still have to estimate it well in practice to get small statistical error. The statistical error is itself decomposable into statistical bias and statistical variance components. So if you use an inconsistent estimator (e.g. linear regression in an observational study), you get statistical bias on top of the bias from the causal gap. And if you have a small dataset, your statistical variance may contribute much more to total error than the causal gap. 

{\color{red} \underbrace{\hat\psi - \psi}_\text{statistical error} } = {\color{red} \underbrace{(\hat\psi-E[\hat\psi])}_\text{statistical variance} } + {\color{red} \underbrace{(E[\hat\psi] - \psi)}_\text{statistical bias} }

That said, the statistical error is exactly what we try to control with statistical tools. We use consistent estimators so the statistical bias is (at least asymptotically) zero. We use confidence intervals and p-values to quantify the amount of statistical variance, which we keep small by using efficient estimators (the subject of the next two chapters). Point being: we already have tools to control, interrogate, and communicate statistical error. What we need are the equivalent tools to for the causal gap.

Partial Identification

We actually already have an extremely powerful tool to control the causal gap: an identification proof! If we have identification, the causal gap is eliminated completely. But in this section we’re interested in cases where we can’t prove identification of our causal estimand. It turns out identification proofs can still help us sometimes- we just have to shift the goalposts a little bit.

Instead of finding a statistical estimand that is always equal to the causal estimand, sometimes we can find one that is always greater than or equal to the causal estimand (and typically another that is always less than or equal). To make the discussion simpler, let’s assume that we know ψ∗≥0\psi^* \ge 0ψ∗≥0﻿ so we only care about an upper bound. Then, formally, what we’re looking for is a statistical estimand ψ↑\psi^\uparrowψ↑﻿ such that ψ↑(O(P∗))≥ψ∗(P∗)\psi^\uparrow(\mathcal O(P^*)) \ge \psi^*(P^*)ψ↑(O(P∗))≥ψ∗(P∗)﻿ for all P∗∈M∗P^* \in \mathcal M^*P∗∈M∗﻿. 

We haven’t eliminated the causal gap, but the point of this is that we have bounded it. For any hypothetical statistical estimand ψ\psiψ﻿ that we pick such that ψ≤ψ↑\psi \le \psi^\uparrowψ≤ψ↑﻿, we know we get a bounded causal gap: ψ−ψ∗≤ψ↑−ψ∗\psi - \psi^* \le \psi^\uparrow - \psi^*ψ−ψ∗≤ψ↑−ψ∗﻿.

The process of identifying ψ↑\psi^\uparrowψ↑﻿ is called partial identification. This is in contrast to point identification, which is what we’ve been discussing so far. In partial identification, we’re formally estimating a set that contains the true causal parameter, not the exact value of the causal parameter (which would be impossible without more assumptions). 

Here there are a number of causal distributions that all map to the same statistical distribution, making causal estimand ψ∗\psi^*ψ∗﻿ unidentifiable by the statistical parameter ψ\psiψ﻿. In effect, many values of ψ∗\psi^*ψ∗﻿ are equally compatible with full knowledge of PPP﻿. However, the statistical estimands ψ↓\psi^\downarrowψ↓﻿ and ψ↑\psi^\uparrowψ↑﻿ can still give us provably accurate bounds on ψ∗\psi^*ψ∗﻿.

Once we find a bounding statistical estimand ψ↑\psi^\uparrowψ↑﻿ we can use data to generate an estimate of it ψ^↑\hat\psi^\uparrowψ^​↑﻿ with our usual estimation toolkit. We then have a bound 0≤ψ∗≤ψ^↑0 \le \psi^* \le \hat\psi^\uparrow0≤ψ∗≤ψ^​↑﻿ where the upper boundary holds with some probability that we can try and quantify with confidence intervals or p-values on ψ^↑\hat\psi^\uparrowψ^​↑﻿.

Example: Deterministic Treatment

Consider an observational study with a binary outcome YYY﻿ where treatment AAA﻿ was deterministically assigned on the basis of a single binary covariate XXX﻿ (i.e. A=XA=XA=X﻿). Our target causal parameter is ψ∗=E[Y(1)]\psi^* = E[Y(1)]ψ∗=E[Y(1)]﻿, the outcome rate if we were to treat everyone in the population.

The problem is that we have absolutely no way of knowing what happens when we treat people with X=0X=0X=0﻿ because we have nobody like that in our study sample. Formally, this is a violation of the positivity assumption: P(A=1∣X=0)=0P(A=1|X=0) = 0P(A=1∣X=0)=0﻿. Thus we cannot identify ψ∗\psi^*ψ∗﻿.

We can however, identify an upper bound! Consider the structural equations:

\begin{align*} X &\sim \text{Binom}(\rho) \\ A &= X \\ Y(a) &\sim \text{Binom}(\mu(a,X)) \end{align*}

Then, by iterated expectation, 

\begin{align*} E[Y(1)] &= E[E[Y(1)|X]] \\ &= \mu(1,1)\rho + \mu(1,0)(1-\rho) \end{align*}

The constant ρ\rhoρ﻿ we can estimate by the sample proportion with X=1X=1X=1﻿ (call it ρ^\hat\rhoρ^​﻿). Similarly, we can identify μ(1,1)=E[Y∣A=1,X=1]\mu(1,1) = E[Y|A=1,X=1]μ(1,1)=E[Y∣A=1,X=1]﻿ without difficulty and estimate that as the outcome rate in our sample among those people with X=1X=1X=1﻿ and A=1A=1A=1﻿ (call it μ^1,1\hat\mu_{1,1}μ^​1,1​﻿). The problem is the remaining term, μ(1,0)\mu(1,0)μ(1,0)﻿, which we cannot identify due to the positivity violation (the conditioning event (A=1,X=0)(A=1,X=0)(A=1,X=0)﻿ has zero probability).

What we do know is that 0≤μ(1,0)≤10 \le \mu(1,0) \le 10≤μ(1,0)≤1﻿, since this number is a probability. From this we can derive bounds on ψ∗=E[Y(1)]\psi^* = E[Y(1)]ψ∗=E[Y(1)]﻿

\underbrace{\mu(1,1)\rho}_{\psi^\downarrow} \le \psi^* \le \underbrace{\mu(1,1)\rho + (1-\rho)}_{\psi^\uparrow}

These two bounds are identified and we can estimate them by plugging in μ^1,1\hat\mu_{1,1}μ^​1,1​﻿ and ρ^\hat\rhoρ^​﻿. If the point estimates for these are, say 0.2 and 0.5 and we have enough data that their variances are negligible, then the conclusion would be 0.1≤ψ∗≤0.60.1 \le \psi^* \le 0.60.1≤ψ∗≤0.6﻿. This example is very simple and the bound we get isn’t particularly tight, but the point is to help you understand the mechanics of partial identification. 

More examples and a more thorough explanation are given in Tamer 2010. Partial identification is a rich field in and of itself- this section is just a starting point for you to understand the general ideas.

Sensitivity Analyses

Partial identification is a “prospective” way of dealing with an identification problem in the sense that we change what we set out to estimate in order to meter our hubris. An alternative is to perform a sensitivity analysis: in this approach we stubbornly point estimate a statistical estimand ψ\psiψ﻿ and then “retrospectively” attempt to argue away the causal gap. The idea here isn’t necessarily to bound the causal gap per se, but to get an idea of how big the causal gap would have to be in order to qualitatively change our interpretation of the point estimate. 

For example, if we do an analysis and obtain a point estimate ψ^=100\hat\psi = 100ψ^​=100﻿ with a standard error of 1, the causal gap would have to be on the order of ψ−ψ∗≈100\psi - \psi^* \approx 100ψ−ψ∗≈100﻿ in order to “wash away” a statistically significant result vis-a-vis the null ψ∗=0\psi^*=0ψ∗=0﻿. Depending on the application, it might be very difficult if not impossible to imagine what could cause confounding of that magnitude (or violation of whatever relevant assumption is in question). Therefore the qualitative result would stand even if there is some substantial question as to the precise value of the point estimate.

Simulation

The most “brute force” approach to sensitivity analyses is via simulation. 

The idea is straightforward: first, pick or make up a set of causal distributions {Pk∗∈M∗:k∈1…K}\{P^*_k \in \mathcal M^*: k\in 1\dots K\}{Pk∗​∈M∗:k∈1…K}﻿ that all seem plausible to you or to domain experts based on what is known about your question, but which don’t enforce any identifying assumptions (e.g. allow Pk∗∉M°P^*_k \notin \mathcal M\degreePk∗​∈/M°﻿). In fact, you might even choose distributions that substantially deviate from these assumptions (e.g. that have a strong unobserved confounder).

Since you made up these distributions, you can directly calculate ψk∗\psi^*_kψk∗​﻿ analytically or, if the distribution is complicated, estimate it to arbitrary precision by drawing an arbitrarily large number of samples and computing ψ∗(Pk,n∗)\psi^*(P_{k,n}^*)ψ∗(Pk,n∗​)﻿; that’s where the simulation comes in.

Next, generate Pk=O(Pk∗)P_k = \mathcal O(P_k^*)Pk​=O(Pk∗​)﻿ for each putative causal distribution and calculate the statistical estimand ψk\psi_kψk​﻿ by similar means. You can even generate datasets of realistic sample sizes from PkP_kPk​﻿ and compute estimates ψ^k\hat\psi_kψ^​k​﻿.

Finally, you can compare ψk\psi_kψk​﻿ and ψk∗\psi_k^*ψk∗​﻿ for different kkk﻿ to get an idea of the causal gaps you might encounter “in the wild”. You might even compare the empirical distribution of ψ^k\hat\psi_kψ^​k​﻿ to ψ∗\psi^*ψ∗﻿ to quantify the total error in these scenarios. After you’ve performed your actual analysis and obtained an estimate ψ^\hat\psiψ^​﻿ from real data, you can compare its magnitude to the range of causal gaps you got in your simulations. If none of the causal gaps comes close, you can argue that no plausible violation of your identifying assumptions would change your qualitative conclusion.

This approach is straightforward and intuitive, but has several shortcomings. For one, it’s not clear what is meant by “plausible” causal distributions. Different people will disagree about what simulations are plausible and which are not. Moreover it’s impossible to satisfy everyone because you can only simulate a finite number of distributions but there are infinite possibilities for what might be plausible. Lastly, simulations are typically simple and parametric because that’s what’s easiest to implement in code. It’s hard to argue that replicates any “realistic” data-generating process, though to some extent this problem can be mitigated by using semi-synthetic data.

Critical Causal Gap

Instead of generating a range of plausible causal gaps via simulation, we can instead directly calculate and report the smallest causal gap that would qualitatively invalidate our finding. This is a fully nonparametric and general-purpose method for any estimator with an asymptotically normal distribution (which is practically every useful estimator).

For example, let’s say our estimated effect was ψ^=1±0.5\hat\psi = 1\pm0.5ψ^​=1±0.5﻿ (point estimate ±\pm±﻿ radius of 95% confidence interval). Presume we are testing against a null hypothesis ψ∗=0\psi^* =0ψ∗=0﻿. If we imagine there is no causal gap, then ψ^∗=ψ^\hat\psi^* = \hat\psiψ^​∗=ψ^​﻿ and the effect is statistically significant because the confidence interval doesn’t span 0. On the other hand, if there is a causal gap ψ−ψ∗=1\psi-\psi^* = 1ψ−ψ∗=1﻿, then our estimate of the causal parameter should be adjusted to be ψ^∗=0±0.5\hat\psi^* = 0 \pm 0.5ψ^​∗=0±0.5﻿ which we see is no longer statistically significant. 

Indeed, the exact value of the causal gap for which stastistical significance ceases to hold is ψ−ψ∗=0.5\psi-\psi^* = 0.5ψ−ψ∗=0.5﻿. This is the number we would report as the result of our sensitivity analysis (call it Δ\DeltaΔ﻿, the critical causal gap). It represents the smallest causal gap that could invalidate our finding. More generally, we could also produce a plot showing the estimated causal parameter ψ^∗\hat\psi^*ψ^​∗﻿ and confidence interval as a function of the hypothetical causal gap:

This entire exercise can just as easily be repeated in a framework where practical significance (instead of statistical) is of interest. For example, let’s say that effects of size <0.3<0.3<0.3﻿ are not of practical interest. Then perhaps it would make sense to report Δ=0.2\Delta = 0.2Δ=0.2﻿ in our example (since the CI for ψ^∗=0.8±0.5\hat\psi^* = 0.8 \pm 0.5ψ^​∗=0.8±0.5﻿ would overlap 0.30.30.3﻿). Any information of this kind can also easily be read off of a plot like the one shown above.

In contrast to running simulations, this kind of sensitivity analysis puts the onus on the reader to posit a realistic data-generating mechanism that could attain a causal gap greater than Δ\DeltaΔ﻿. This could be seen as a weakness of the framework: Δ\DeltaΔ﻿ doesn’t tell you anything specific: is our result sensitive to some kind of unobserved confounding? Is it sensitive to a positivity violation? In what subpopulation? All these questions go unanswered. 

But this vagueness is also a strength: with just one plot, any stakeholder is empowered to come to the conclusion that is appropriate for their interests. The analyst is therefore not in the role of enforcing their value judgements on the reader. Moreover we don’t have to impose any assumptions at all.

Alternatives

There is a huge literature on sensitivity analyses in causal inference but you’ll find very little about purely simulation-driven methods or critical causal gaps in it. That’s because these methods are so simple and so broadly applicable that there’s really not much left to be said about them! 

On the other hand, things get more “interesting” if you’re willing to work with specific problems, estimators, or make a few assumptions. Obviously if you’re willing to believe a specific parametric model you can assign a meaning to the values of different parameters and see how off your statistical parameter is from the causal parameter. But you might also try to describe part of your system nonparametrically and leave only one relevant part parametrically specified so you can test against violations of your assumptions there. There is also the question of how to quantitatively measure a violation of such an assumption (e.g. the “amount” of unmeasured confounding) without reducing it too much to a structural parameter. Basically, there is a large “gray area” between parametric simulations and a critical causal gap analysis.

The point of all this is to be able to more clearly interpret the result of the sensitivity analysis. For example, in some of these approaches, you can say things like “as long as no unobserved confounder has marginal odds ratios greater than α\alphaα﻿ with the outcome and treatment, then observed data support a statistically significant estimate of the causal estimand”.

Instead of taking you through a zoo of these approaches, I’ll give you a list of a few papers that you can skim to get an idea of what I mean without having to dig into all the details. 

VanderWeele 2012

Ding 2016

Lin 1998

Tan 2006

Heng 2021

The important thing is to focus on what the restrictions are: is this method specific to a specific estimator, estimand, and/or causal model? These are often not stated explicitly! Also assess the benefits: does this method clarify how much certain assumptions can be violated vis-a-vis an agnostic causal gap analysis?

Conclusions

Partial identification and sensitivity analysis are both huge, messy fields. It can be difficult to navigate the literature and decide what to do with whatever problem you’re actually working on. Moreover, all of this requires careful consultation with domain experts who can tell you what is a reasonable assumption and what estimands are really of interest. Since it’s all a bit tricky, We’ll give you some personal guidelines- feel free to seek a second or third opinion. 

First off, figure out if the causal estimand you really care about (e.g. ATE) is plausibly identified. If there is substantial doubt in the required assumptions (e.g. you know you’re missing an important confounder) then think about whether a different estimand might be identified (e.g. a complier ATE) because you have some special structure you can take advantage of in your problem (e.g. an instrument). If you can get a more defensible identification of that estimand, then go from there.

If you’re not willing to change your estimand or you can’t exploit any special structure, next see if you have enough to at least partially identify the estimand (i.e. if you can identify bounds on the estimand). There will usually be different approaches to do this that might also exploit some special structure or require particular assumptions, so review the literature and see what you can find and make use of.

If bounds aren’t good enough for you or the partial identification assumptions are themselves too complicated to defend, you should go ahead and estimate the statistical estimand you’ve got but protect yourself by doing a sensitivity analysis. We recommend a combination of a critical causal gap analysis and some simulations with semi-synthetic data since these approaches are so widely applicable and interpretable. However, if your problem has specific structure or you want a specific kind of interpretation of the sensitivity parameter, then you should go look through the literature and see if there’s something that better fits your needs.

⬅️

BACK: 2.2 Identification

➡️

NEXT: 3. Efficiency Theory