You wake up. It’s 2006. You look in the mirror and you see you are Mark van der Laan. You’ve just invented TMLE.
❓ What is meant by “better”? When should it be better and when worse?
Across the board, AIPW and TMLE behave similarly in terms of point estimate and SE estimate in large samples when they have access to the same learners
Like AIPW, TMLE improves the MSE of plugins in moderate sample sizes and allows for good coverage of Wald intervals based on IF SEs.
Fix \(n = 300\), do \(B\) reps from the same 3 DGPs, tables like:
| plugin | AIPW | TMLE | |
|---|---|---|---|
| MSE | |||
| bias | |||
| variance | |||
| coverage |
One per each DGP (or 3 sub-tables). Probably just ATE, send the other estimand to the appendix.
TMLE is better than AIPW in small samples when the outcome is bounded (?)
The evidence for this goal might be a subset of evidence for the above. So just suggests that one of our DGPs should have a bounded outcome, and then we can point to the relevant rows in the above table to support this claim.
For each bit of evidence you present, you should tell the reader why it is there. Tell them your goal! Then give them the evidence. Then tell them how the evidence supports the goal. Be prosaic and direct.
“In the first part I tell ’em what I am going to tell ’em; in the second part — well, I tell ’em; in the third part I tell ’em what I’ve told ’em.”
Example
The point of this application is to show how fast TMLE is on a real-world dataset of size blah blah and to prove that it works with mixed datatypes, demonstrate how it can be prespecified, and produces plausible results.
From Morris et al. (2019)
“Using simulation studies to evaluate statistical methods”
Statistics in Medicine, 38(11), 2074–2102.
Before presenting results, clearly describe:
DGPsATE() helperESTIMATORSsummarize step and your tables/figures| Method | Speed | Notes |
|---|---|---|
MARS (earth) |
⚡ Super fast | Works well with a single set of hyperparams. Ensure depth is high enough, enable pruning, allow interaction. |
| Random forests | 🏃 Fast | Works well with default hyperparams. Worse for smooth functions. |
| GBT (lightgbm/xgboost) | 🏃 Fast | Beats almost everything. Requires tuning over # trees, depth, learning rate. Use early stopping. |
| Elastic net | 🏃 Fast (ridge) | Loses to others with moderate nonlinearity. Good for baselines. Requires tuning regularization. |
| KRR (kernel ridge) | 🐌 Slow | Good for smooth functions, easy to analyze theoretically. Use only for \(n < 1000\). |
Being able to reason about your simulations and iterate on them is much more important than making them realistic. For example:
Pay attention to:
Across the board, AIPW and TMLE behave similarly in terms of point estimate and SE estimate in large samples when they have access to the same learners
Across the board, AIPW and TMLE behave similarly in terms of point estimate and SE estimate in large samples when they have access to the same learners
draw method(est, est_se)furrr for drop-in parallelization from map in R, parallelize across repssimple_dgp <- structure(
list(
rho = \(w1, w2) w1 + w2,
mu = \(a, w1, w2) w1 + w2 + a,
sigma = 1
),
class = "dgp"
)
complex_dgp <- structure(
list(
rho = \(w1, w2) w1 + w2 + w1 * abs(w2),
mu = \(a, w1, w2) w1 + w2 + abs(w2) + 0.5 * w1 * w2,
sigma = 1
),
class = "dgp"
)
draw <- function(x, ...) UseMethod("draw")
draw.dgp <- \(dgp, n) dgp %$% tibble(
W1 = runif(n, -1, 1),
W2 = runif(n, -1, 1),
A = rbinom(n, 1, plogis(rho(W1, W2))),
Y = mu(A, W1, W2) + rnorm(n, 0, sigma)
)
draw_intervene <- function(x, ...) UseMethod("draw_intervene")
draw_intervene.dgp <- \(dgp, n, treatment) dgp %$% tibble(
W1 = runif(n, -1, 1),
W2 = runif(n, -1, 1),
A = treatment(W1, W2),
Y = mu(A, W1, W2) + rnorm(n)
)# A tibble: 3 × 4
W1 W2 A Y
<dbl> <dbl> <int> <dbl>
1 0.351 0.133 1 0.780
2 0.966 0.699 1 1.59
3 0.519 -0.621 1 0.846
# A tibble: 3 × 4
W1 W2 A Y
<dbl> <dbl> <int> <dbl>
1 0.439 0.0288 1 1.26
2 -0.984 -0.997 0 -1.22
3 -0.249 0.163 1 -1.31
# A tibble: 3 × 4
W1 W2 A Y
<dbl> <dbl> <dbl> <dbl>
1 0.335 0.866 1 1.77
2 -1.000 0.851 1 1.51
3 -0.583 0.468 1 1.21
ATE <- function(x, ...) UseMethod("ATE")
ATE.dgp <- \(dgp, n = 1e6) {
mu0 <- dgp |> draw_intervene(n, \(w1, w2) 0) %$% mean(Y)
mu1 <- dgp |> draw_intervene(n, \(w1, w2) 1) %$% mean(Y)
mu1 - mu0
}
R2 <- function(x, ...) UseMethod("R2")
R2.dgp <- \(dgp, n = 1e6) {
pop <- dgp |>
draw(n) |>
mutate(mu = dgp$mu(A, W1, W2))
mu_lin <- lm(mu ~ A + W1 + W2, data = pop)
pop <- pop |>
mutate(mu_lin = predict(mu_lin, pop))
list(
R2 = pop %$% { var(mu) / var(Y) },
R2_lin = pop %$% { var(mu_lin) / var(mu) }
)
}So that it can be used like this:
library(zeallot)
tmle <- function(data, nuisances) {
c(pi, alpha, mu) %<-% nuisances
eps <- lm(Y - mu(A, W1, W2) ~ alpha(A, W1, W2) - 1, data) |> coef() |> unname()
data |>
mutate(
mu1 = mu(1, W1, W2) + eps * alpha(1, W1, W2),
mu0 = mu(0, W1, W2) + eps * alpha(0, W1, W2),
phi = alpha(A, W1, W2) * (Y - mu(A, W1, W2)) + (mu1 - mu0)
) %$%
list(
est = mean(mu1 - mu0),
est_se = sd(phi) / sqrt(nrow(data))
)
}library(tidymodels)
fit_learners <- function(data) {
# Convert A to factor for classification
data <- data |> mutate(A = factor(A))
pi_obj <- mars(prod_degree = 3, num_terms = 50) |>
set_engine("earth") |>
set_mode("classification") |>
fit(A ~ W1 + W2, data)
mu_obj <- mars(prod_degree = 3, num_terms = 50) |>
set_engine("earth") |>
set_mode("regression") |>
fit(Y ~ A + W1 + W2, data)
pi = \(w1, w2) predict(pi_obj, tibble(W1 = w1, W2 = w2), type = "prob")$.pred_1
alpha = \(a, w1, w2) a / pi(w1, w2) -
(1 - a) / (1 - pi(w1, w2))
mu = \(a, w1, w2) predict(mu_obj, tibble(A = factor(a, levels = c(0, 1)), W1 = w1, W2 = w2))$.pred
list(pi = pi, alpha = alpha, mu = mu)
}REPS <- 20
N <- 500
DGPS <- list(simple = simple_dgp, complex = complex_dgp)
ESTIMATORS <- list(dm = diff_means, tmle = tmle, aipw = aipw)
1:REPS |> map(\(rep) {
DGPS |> imap(\(dgp, dgp_name) {
nuisances <- dgp |> draw(N) |> fit_learners()
data <- dgp |> draw(N)
ESTIMATORS |> imap(\(estimator, estimator_name) {
res <- estimator(data, nuisances)
tibble(
rep = rep,
DGP = dgp_name,
estimator = estimator_name,
est = res$est,
se = res$est_se
)
}) |> bind_rows()
}) |> bind_rows()
}) |> bind_rows() -> resultsresults object to file, possibly with a pointer to the commit hash for replicabilityfurrrlibrary(furrr)
plan(multicore, workers = 10)
1:REPS |> furrr::future_map(\(rep) {
DGPS |> imap(\(dgp, dgp_name) {
nuisances <- dgp |> draw(N) |> fit_learners()
data <- dgp |> draw(N)
ESTIMATORS |> imap(\(estimator, estimator_name) {
res <- estimator(data, nuisances)
tibble(
rep = rep,
DGP = dgp_name,
estimator = estimator_name,
est = res$est,
se = res$est_se
)
}) |> bind_rows()
}) |> bind_rows()
}) |> bind_rows() -> resultsREPS = 5 and possibly fewer DGPs and estimators.future_map instead of map and set up a parallel plan with plan()true_ATEs <- DGPS |> map_dbl(ATE)
results |>
group_by(DGP, estimator) |>
summarize(
bias = mean(est - true_ATEs[DGP]),
true_se = sd(est),
mean_est_se = mean(se),
rmse = sqrt(mean((est - true_ATEs[DGP])^2)),
coverage = mean(
(est - 1.96 * se <= true_ATEs[DGP]) &
(true_ATEs[DGP] <= est + 1.96 * se)
)
)# A tibble: 6 × 7
# Groups: DGP [2]
DGP estimator bias true_se mean_est_se rmse coverage
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 complex aipw -0.0137 0.0884 0.116 0.0873 1
2 complex dm 0.663 0.105 0.115 0.671 0
3 complex tmle -0.00608 0.0742 0.116 0.0726 1
4 simple aipw -0.0874 0.430 0.175 0.428 0.9
5 simple dm 0.594 0.130 0.112 0.608 0
6 simple tmle -0.00738 0.103 0.175 0.101 0.9
From here you can use tidyr, kable, and ggplot to make tables and figures.
In your initial TMLE simulation, you find that the ML plug-in estimator outperforms TMLE and AIPW in terms of MSE. Debiasing improves bias, but at cost of variance.
Like AIPW, TMLE improves the MSE of plugins in moderate sample sizes when overlap is reasonable and initial estimates are biased and allows for good coverage of Wald intervals based on IF SEs.