1.2 Inferential Analysis

1.2 Inferential Analysis

Table of Contents

Statistical Model

Independent and Identically Distributed

Structural Equation Model

Example: Observational Study

Example: Simple Randomized Trial

Estimand

Inference

Summary

A purely descriptive analysis is pretty useless because the estimate is devoid of any formal interpretation. By itself, it only tells us something about our data, but nothing about the real world. 

Of course, people tend to interpret descriptive analyses all the time. For example, let's say the average daily temperature in San Francisco last year was 18C (e.g. that was the estimate we computed in our descriptive analysis). You might be tempted to say that it's 18C on a "usual" day in the city. But in what sense can you make that a rigorous, formal statement? What do you really mean by "usual" day? What are you assuming to make that conclusion? How sure are you?

In order to do answer these questions we need to have language to describe phenomena in the real world which are complex enough that they appear somewhat random. We use mathematical probability for this purpose. In the language of probability, we imagine that the real world (or the part of it that we're interested in) can be fully described by a probability distribution. One way to think of a probability distribution is as a "machine" that generates data. In fact, you'll often see distributions described as data-generating processes. Different distributions "generate" data with different properties. We don't know what's going on in the real world, so instead of trying to understand all of the gory complexity, we instead hypothesize that the data we observe is generated by some probability distribution. 

Of course we don't know exactly what the distribution is, so we need to lay out a set of possibilities. Each of these is, in a sense, a possible "universe" that our data could have come from. We call this set of possibilities the statistical model. The definition of a statistical model explicitly specifies our assumptions about what we believe is possible and what is impossible.

We also use this formalism to clearly define the quantity that we're after in the analysis. What is it that we are trying to "estimate" in the first place? With a statistical model in place, we can construct an estimand that assigns the true value of the answer to our question to each possible probability distribution. Thus if we knew what probability distribution generated our data, we could know the answer to our question without looking at any data at all. Being able to mathematically define a statistical estimand is exactly what we mean by having a clearly defined question. 

Lastly, we can use the language of probability to enable inference, which is a formalized quantification of uncertainty. Having a probability distribution lets us imagine what would happen if we could repeat our entire experiment over and over again. Specifically, we could figure out how the estimate varies due to only to random chance and how far off from the truth we are on average.

All of this is the "imaginary world" that we have to create for ourselves if we want to give a meaningful interpretation to our analyses. Without a clear definition of what the statistical model is we're dealing with, we can't have a clear definition of our scientific question, i.e. the estimand. And without some notion of what kind of randomness we're dealing with in our data, we can't create any meaningful notion of uncertainty in our estimate of the estimand.

An inferential analysis therefore adds the following components in addition to those present in the descriptive analysis (data, estimator, estimate):

Statistical Model: a set of possible "worlds" from which our data may have arisen, one of which is our own reality

Estimand: the property of the real world that we are trying to determine

Putting all these components together lets us perform valid inference: saying something meaningful about the real world and assigning a notion of uncertainty to our assessment. 

Statistical Model

In an inferential analysis, we hypothesize that our data is generated from some probability distribution PPP﻿. Of course, we don't know what the probability distribution PPP﻿ actually is. If we did, that would be like saying we already know everything there is to know about the world, in which case there's no point gathering data or doing science because we already know the answer to all our questions. 

Since we don't know PPP﻿, we have to consider a large number of possibilities that align with what we know to be true. We therefore build up a set of possible distributions {P1,P2,P3… }\{P_1, P_2, P_3 \dots \}{P1​,P2​,P3​…}﻿, which we call the statistical model M\mathcal MM﻿. The statistical model represents a the "multiverse" of possibilities. Maybe in one of these worlds, San Francisco is a hot, humid place, and in another, it's a cold tundra. Or maybe we already know that it's not a tundra, but nothing else. The point is that we have a set of possibilities. This set certainly could be discrete, but in most cases it's taken to be continuously infinite to accommodate all the possible realities that we can't rule out.

Here we see how the statistical model M\mathcal MM﻿, estimand ψ\psiψ﻿, and data Pn\mathbb P_nPn​﻿ are related. 

We often talk about models that are "larger" or "smaller" than other models. By this we usually mean that one statistical model is a superset or subset of another, or that one includes a greater or lesser diversity of candidate distributions. If a model is "small", it necessarily imposes more assumptions because we are a-priori ruling out a larger number of distributions. For example, a model where we assume a variable to have a normal distribution is necessarily smaller (more restrictive) than one where, all else equal, we do not put any assumption on the distribution of that variable. On the other hand, "large" models require few assumptions because very little is ruled out a-priori: we instead rely more on the data to tell us which world we are most likely to be in.

Measure-Theoretic Probability and Notation 

Here we take PPP﻿ to be the probability measure of our observed data. It maps subsets of Z\mathcal ZZ﻿ (the space ZZZ﻿ takes values in) to numbers in [0,1][0,1][0,1]﻿ and obeys certain rules like additivity over disjoint sets.

The (Lebesgue) integral of a function over a set AAA﻿ with respect to a probability measure is written as ∫AfdP\int_A f dP∫A​fdP﻿, which is approximately the same as saying "break R\mathbb RR﻿ up into intervals [ai,bi][a_i,b_i][ai​,bi​]﻿ and compute the finite sum ∑if(ai)P(f−1[ai,bi])\sum_i f(a_i) P(f^{-1}[a_i,b_i])∑i​f(ai​)P(f−1[ai​,bi​])﻿". Here f−1[ai,bi]f^{-1}[a_i,b_i]f−1[ai​,bi​]﻿ means "the set of points in Z\mathcal ZZ﻿ that end up going to the interval [ai,bi][a_i, b_i][ai​,bi​]﻿ under fff﻿. The role of PPP﻿ here is to assign some notion of "length" or "weight" (or "probability"!) to the sets f−1[ai,bi]f^{-1}[a_i,b_i]f−1[ai​,bi​]﻿. 

If we are integrating with respect to a measure PPP﻿ that has some density ppp﻿ (with respect to the "Lebesgue measure") then we can write ∫AfdP=∫Af(z)p(z)dz\int_A f d P = \int_A f(z) p(z) dz∫A​fdP=∫A​f(z)p(z)dz﻿. This latter form should look familiar from your introductory probability classes.

Obviously the value of the integral depends on what measure we integrate with respect to. We also call the integral of a function w.r.t. a probability measure the expectation of that function, denoted E[f]E[f]E[f]﻿ or even just PfPfPf﻿. This latter is technically an abuse of notation since we're using PPP﻿ both for the measure and as an operator that maps functions to numbers. However, it's a) very common and b) minimizes visual clutter, and c) makes it clear what underlying measure we're integrating with. If we use densities, we can write the expected value of some random variable ZZZ﻿ with distribution PPP﻿ as PZ=∫zp(z)dzP Z = \int z p(z) dzPZ=∫zp(z)dz﻿. Again this form should look familiar to you.

Parametric, semiparametric, and nonparametric models

A parametric model is one where each probability distribution P∈M\mathcal P \in \mathcal MP∈M﻿ can be uniquely described with a finite-dimensional set of parameters. For example, if we had a dataset Z1…nZ_{1\dots n}Z1…n​﻿ where each ZiZ_iZi​﻿ takes scalar values, we might suppose that each observation ZiZ_iZi​﻿ is an identically distributed normal random variable. That's an assumption, of course, and it might not be realistic or good because that rules out a lot of other possible distributions. Nonetheless, it's one way we could define a statistical model. 

Since ZiZ_iZi​﻿ is normal in this model, we can describe its full distribution with just a a mean and a standard deviation. Knowing only these two numbers tells us everything about the distribution. We say that this model is parametrized by these two numbers. An example of a slightly more complicated parametric model is a normal-linear model where each measurement in the data takes the form of a tuple (Y,X)(Y,X)(Y,X)﻿ and we assume Y=Xβ+N(μ,σ2)Y = X\beta + \mathcal N(\mu, \sigma^2)Y=Xβ+N(μ,σ2)﻿. Now we have three parameters: the slope β\betaβ﻿ as well as the mean μ\muμ﻿ and the standard deviation σ\sigmaσ﻿ of the normally-distributed random noise. Again, this implies a strict set of assumptions which may not be true: namely that the true relationship between XXX﻿ and the conditional mean of YYY﻿ is linear and that the conditional distribution of YYY﻿ given XXX﻿ is normal with a standard deviation that is fixed and doesn't depend on XXX﻿. 

In general, having a parametric model means that we can write the density of the data ZZZ﻿ as a function of some finite-dimensional vector of kkk﻿ parameters θ\thetaθ﻿ (e.g. θ=[β,μ,σ]\theta = [\beta, \mu, \sigma]θ=[β,μ,σ]﻿). We might write the density like pθ(Z)p_\theta(Z)pθ​(Z)﻿. Each value of θ\thetaθ﻿ in the domain Θ\ThetaΘ﻿ implies one density ppp﻿ in M\mathcal MM﻿ and each density in the model has one unique value of the parameters that characterizes it. 

Parametric models are popular because they are easy to work with: finding the distribution most likely to have generated the data reduces to a finite-dimensional optimization problem in the model parameters. However, they are usually also completely artificial! Do you really believe that the "true" relationship between the dose you take of a drug and the toxicity is exactly a straight line? Any parametric model with a few parameters is necessarily a very "small" model that rules out the vast, vast majority of possibilities. Depending on the situation, basing statistical inference on a parametric model can give you completely meaningless inference: biased estimates and invalid confidence intervals and p-values. We'll have more to say about that later.

Semiparametric and nonparametric are both terms used to describe models that cannot be boiled down to a finite-dimensional set of parameters. 

Empirical distributions

We use the symbol Pn\mathbb P_nPn​﻿ to denote an empirical distribution. This measure assigns mass 1/n1/n1/n﻿ at each of the points taken by the random variables Z1,…n∼IIDPZ_{1,\dots n} \overset{IID}\sim PZ1,…n​∼IIDP﻿. Mathematically, if we let δa(A)\delta_a(A)δa​(A)﻿ be the measure that returns 1 if a∈Aa\in Aa∈A﻿ and 0 otherwise, then we can define Pn(Z∈A)=1n∑inδZi(A)\mathbb P_n(Z \in A) = \frac{1}{n}\sum_i^n \delta_{Z_i}(A)Pn​(Z∈A)=n1​∑in​δZi​​(A)﻿. This is just the fraction of samples ZiZ_i Zi​﻿ that happen to have fallen into the set AAA﻿. 

What you have to remember about the empirical measure, though, is that it's random. In other words, Pn(A)\mathbb P_n(A)Pn​(A)﻿ evaluated at any fixed set AAA﻿ is itself a random variable because the data themselves are taken to be random variables (with their own underlying distribution) in an inferential analysis. 

Pn\mathbb P_nPn​﻿ evaluated at some set is a random variable because the data that define it are random. PPP﻿ evaluated on some set is a fixed number. As nnn﻿ increases, the empirical distribution tends to concentrate around the value that PPP﻿ takes.

Or you can think of Pn\mathbb P_nPn​﻿ as a random measure that takes values in the space of probability measures. In this case Pn\mathbb P_nPn​﻿ has a distribution over the model space itself. Once you fix a particular sample Z1,…nZ_{1, \dots n}Z1,…n​﻿ from PPP﻿, then the empirical distribution takes a single point value in M\mathcal MM﻿. The empirical measure Pn\mathbb P_nPn​﻿ (for fixed nnn﻿) is also one element of a sequence of random measures that in some sense gets closer and closer to PPP﻿.

Pn\mathbb P_nPn​﻿ as a "density" over the model space M\mathcal MM﻿ with a typical realization shown. Note that this picture is just one possible example. In general, Pn\mathbb P_nPn​﻿ may take values that are actually outside of M\mathcal MM﻿! 

Any way you look at it, this is an example of a random "thing" that takes values in a set that isn't Rp\mathbb R^pRp﻿. Note that the empirical measure depends on the underlying distribution PPP﻿. So if we had a different underlying distribution, say QQQ﻿, we might denote an empirical measure based on nnn﻿ IID samples from QQQ﻿ as Qn\mathbb Q_nQn​﻿.

Independent and Identically Distributed

In this book, we will only consider statistical models where each observation ZiZ_iZi​﻿ is assumed to be independent and identically distributed (IID). In other words, we assume P(Z1,…Zn)=P(Z1)×⋯P(Zn)=P(Z)nP(Z_1, \dots Z_n) = P(Z_1)\times \cdots P(Z_n) =P(Z)^nP(Z1​,…Zn​)=P(Z1​)×⋯P(Zn​)=P(Z)n﻿. Analytically, this is extremely helpful because we only need to describe the distribution of a single generic observation ZZZ﻿ in order to describe the distribution of our whole dataset. Therefore when we talk about the statistical model, we're talking about the space of distributions P(Z)P(Z)P(Z)﻿ of a single observation with the implication that the data distribution is P(Z)nP(Z)^nP(Z)n﻿.

This is an important assumption and limitation on the statistical model. A model where the data are IID is less generic (i.e. "smaller") than a model where we don't make that assumption. The reason we limit our focus is that the IID assumption is easily justifiable in many important problems. For example, in an observational study on cancer it's reasonable to presume that each subject is a sample from some eligible population (i.e. subjects are identically distributed) and that what happens to one subject doesn't affect what happens to another. 

Nonetheless, there are many interesting and important problems where the IID assumption is untenable. For example, any large study of a highly infectious disease must take dependencies between subjects into account because any one person getting sick increases the chance that others get the disease. There is a large literature that deals with inference where observations cannot be considered independent. One useful trick is to clump observations into clusters which can themselves be treated as independent (e.g. cluster randomized trials), but this is not always possible. In general, however, it is important to understand the well-developed techniques for performing inference with IID data before you try to tackle more challenging problems that will require you to extend and build on these strategies.

Structural Equation Model

Each "point" in a statistical model is a probability distribution PPP﻿. As we move from one distribution to another, say P~\tilde PP~﻿, what exactly changes? How can we break up the model into pieces that allow us to reason more clearly about specific relationships? 

A (nonparametric) structural equation model presents a solution to that problem. Let our observation ZZZ﻿ be a vector of ppp﻿ random variables [A1…Ap][A_1 \dots A_p][A1​…Ap​]﻿. Recall that we can always factor an arbitrary joint distribution P(A1,A2,…Ap)P(A_1, A_2, \dots A_p)P(A1​,A2​,…Ap​)﻿ into a product of conditional distributions P(Ap∣Ap−1,…A2,A1)×P(Ap−1∣Ap−2,…A2,A1)×⋯P(A2∣A1)P(A1)P(A_p | A_{p-1}, \dots A_2, A_1) \times P(A_{p-1} | A_{p-2}, \dots A_2, A_1) \times \cdots P(A_2 | A_1) P(A_1)P(Ap​∣Ap−1​,…A2​,A1​)×P(Ap−1​∣Ap−2​,…A2​,A1​)×⋯P(A2​∣A1​)P(A1​)﻿. It therefore suffices to describe each of the conditional distributions Ak∣Ak−1…A1A_k | A_{k-1} \dots A_1Ak​∣Ak−1​…A1​﻿ and the marginal distribution of A1A_1A1​﻿ (we can also, of course, reorder the indexing 1…j…p1 \dots j \dots p1…j…p﻿ in any way that is convenient to minimize notational and analytical burden). That's the entire idea of a structural equation model. We can write any joint distribution of A1…ApA_1 \dots A_pA1​…Ap​﻿ as:

\begin{align*} A_1 &= f_1(U_1) \\ A_2 &= f_2(A_1, U_2) \\ A_3 &= f_3(A_1, A_2, U_3) \\ &\cdots \\ A_p &= f_p(A_{p-1}, \dots A_1, U_p) \end{align*}

Where U1…UpU_1 \dots U_pU1​…Up​﻿  are exogenous variables (basically a fancy word for "random" or "external" noise), none of which depend in any way on the observables A1,…ApA_1, \dots A_pA1​,…Ap​﻿, and f1…fpf_1 \dots f_pf1​…fp​﻿ are fixed (nonrandom) functions. We have therefore described the joint distribution of Z=[A1,…Ap]Z=[A_1, \dots A_p]Z=[A1​,…Ap​]﻿ with the joint distribution of [U1…Up][U_1 \dots U_p][U1​…Up​]﻿ and the functions fjf_jfj​﻿. Once these are specified, everything else follows. This way of specifying a distribution is what we call a structural equation model. You might also think of this as "parametrizing" the distribution PPP﻿ with the random variables UjU_jUj​﻿ and the functions fjf_jfj​﻿. The space of possible UjU_jUj​﻿ and fjf_jfj​﻿ define the statistical model, and any one choice of these identifies one particular data-generating process within the model.

The structural equation model is extremely generic because we can describe any distribution in this way. If we want to impose additional assumptions, it's easy to do so. For example, if we want to impose conditional independence between A2A_2A2​﻿ and A3A_3A3​﻿ given A1A_1A1​﻿, we could specify A2=f2(A1,U2)A_2 = f_2(A_1, U_2)A2​=f2​(A1​,U2​)﻿, A3=f3(A1,U3)A_3 = f_3(A_1, U_3)A3​=f3​(A1​,U3​)﻿ and U2⊥U3U_2 \perp U_3U2​⊥U3​﻿. As another example, if we want to specify that A2A_2A2​﻿ has conditional mean given by sin⁡(A1)\sin(A_1)sin(A1​)﻿ and has normal conditional distributions with fixed variance σ2\sigma^2σ2﻿, we can impose f2(A1,U2)=sin⁡(A1)+U2f_2(A_1, U_2) = \sin(A_1) + U_2f2​(A1​,U2​)=sin(A1​)+U2​﻿ and U2∼N(0,σ2)U_2 \sim N(0, \sigma^2)U2​∼N(0,σ2)﻿. The possibilities are literally endless.

It can also be useful to draw a picture that specifies the conditional independence relations present in any given structural equation model. This kind of picture is often called a directed acyclic graph (DAG) or graphical model. To draw a DAG for a model, you first draw your random variables as nodes. For any given variable, AkA_kAk​﻿ you then look at its corresponding structural equation and draw an arrow into AkA_kAk​﻿ coming from any variable AjA_jAj​﻿ that appears on the right-hand side of the equation. Alternatively, you can have a DAG for a model and then translate it to a set of structural equations by following the reverse process. DAGs are necessarily a bit less expressive than the full algebraic structural equations, but there are many properties of complex joint distributions that can be more easily deduced by looking at a DAG than by reading through equations.

An example DAG for a model with three observed variables.

Example: Observational Study

To fix ideas we'll write up a structural equation model for a common use case: modeling an observational study. In our formal statistical model of an observational study, we imagine that for each subject iii﻿ we observe a vector XiX_iXi​﻿ of pre-treatment covariates (e.g. their age, sex, history of disease), an indicator AiA_iAi​﻿ of whether they received an active treatment (Ai=1A_i=1Ai​=1﻿) or not (Ai=0A_i=0Ai​=0﻿), and a scalar YiY_iYi​﻿ that represents their outcome some time later (e.g. survival, score on a test, disease state). We'll factor the joint distribution as P(Y∣A,X)P(A∣X)P(X)P(Y|A,X)P(A|X)P(X)P(Y∣A,X)P(A∣X)P(X)﻿ since this reflects the intuitive time ordering of the variables: the covariates XXX﻿ are measured, then some treatment AAA﻿ is given, and then finally the outcome YYY﻿ is observed. The treatment that is received may depend stochastically on the covariates (e.g. sicker people may be more likely to receive treatment) and the outcome almost certainly depends stochastically on both the covariates and the treatment. We could also have ordered the factorization differently, but this factorization reflects our understanding of the world and gives a more intuitive meaning to each factor. 

\begin{align*} X &= U_X \\ A &= f_A(X, U_A) \\ Y &= f_Y(X,A,U_Y)\\ &\text{where} \\ f_A &\in \{0,1\} \end{align*}

Notice that the only restriction we've placed on this model besides enumerating the variables is that the function fAf_AfA​﻿ return either a 0 or a 1 (enforcing the fact that the treatment is binary, which we know is the case). We've omitted fXf_XfX​﻿ since we can always write UX=fX′(UX′)U_X = f_{X'}(U_{X'})UX​=fX′​(UX′​)﻿ for arbitrary fX′f_{X'}fX′​﻿ and UX′U_{X'}UX′​﻿ so this is not an assumption. Thus for all intents and purposes we have not assumed anything that we don't already know is true. We don't know that the relationship between YYY﻿ and XXX﻿ is linear, or that the conditional distribution of Y∣XY|XY∣X﻿ is normal, etc. If we don't know those things to be true, why would we impose them on our statistical model?

Example: Simple Randomized Trial

Another example we can give is a simple randomized trial. A simple randomized trial is precisely the same as an observational study, except that that the experimenter has precise control over who gets the treatment and who doesn't. In other words, we know what P(A∣X)P(A|X)P(A∣X)﻿ is. In the simplest case, the experimenter assigns treatment to each subject on the basis of a coin flip. We can therefore write:

\begin{align*} X &= U_X \\ A &\sim \text{Bernoulli}(1/2) \\ Y &= f_Y(X,A,U_Y)\\ \end{align*}

Notice that we've imposed the assumption that we know exactly what UAU_AUA​﻿ is (a coin flip) and what fAf_AfA​﻿ is (the identity). Controlling the distribution of AAA﻿ in this way eliminates all incoming arrows. It therefore breaks all dependence between AAA﻿ and XXX﻿: in a simple randomized trial the covariates have no role in what treatment someone gets. On the other hand, we still have not made any unfounded assumptions about YYY﻿ because we don't know anything precise about how the covariates and treatment relate to the outcome. This is easy to see in the DAG, where it's now impossible to find a path from a common ancestor of both YYY﻿ and AAA﻿ that doesn't pass through XXX﻿ (because AAA﻿ has no ancestors!).

Estimand

Now that we've represented the space of possible worlds with a space of probability distributions, we have to represent our scientific question in the same terms. For example, let's say we're interested in what the "usual" temperature is on a day in San Francisco. A sensible way to formalize that would be to ask what the expected value is of a variable ZZZ﻿ (representing temperature) with some given distribution: in other words, E[Z]E[Z]E[Z]﻿. Another question of interest might be: what is the average outcome (e.g. disease-free survival time) for a group of patients who were all exposed to a particular drug? In this case, we might represent each subject's outcome as a variable YYY﻿ and their treatment status as A∈{0,1}A \in \{0,1\}A∈{0,1}﻿ (0 corresponding to no drug, 1 corresponding to drug) and the quantity of interest might be the conditional expectation E[Y∣A=1]E[Y|A=1]E[Y∣A=1]﻿. 

We refer to the quantity of interest as the statistical estimand or statistical target parameter. Formally, an estimand is a mapping (function) from the statistical model M\mathcal MM﻿ to the domain of the parameter (usually the real numbers). In other words, each distribution PPP﻿ in the model has its own (not necessarily unique) value of the estimand. This formalizes the fact that the answer to our scientific question is different in different "universes". We use the notation ψ(P)\psi(P)ψ(P)﻿ to denote the value of the estimand if the true data-generating distribution is PPP﻿. For example, if we had data Z=(Y,A)Z = (Y,A)Z=(Y,A)﻿ and the estimand were the conditional mean of YYY﻿ given A=1A=1A=1﻿, then we would have ψ(P)=EP[Y∣A=1]\psi(P) = E_P[Y|A=1]ψ(P)=EP​[Y∣A=1]﻿. Notice that the argument PPP﻿ is what we're taking the expectation with respect to. We can write the expectation more explicitly either in the measure theoretic notation ψ(P)=∫{A=1}YdP\psi(P) = \int_{\{A=1\}} Y dPψ(P)=∫{A=1}​YdP﻿ or if we have that the density of Y∣AY|AY∣A﻿ under PPP﻿ is p(y∣a)p(y|a)p(y∣a)﻿ we'd write ψ(P)=∫yp(y∣1)dy\psi(P) = \int y p(y|1) dyψ(P)=∫yp(y∣1)dy﻿ (as opposed to, e.g. ψ(P~)=∫yp~(y∣1)dy\psi(\tilde P) = \int y \tilde p(y|1) dyψ(P~)=∫yp~​(y∣1)dy﻿ for a different distribution P~\tilde PP~﻿) . Either way, the point is to demonstrate why PPP﻿ is itself the argument here: the estimand maps each distribution to whatever the "true" answer to our question would be if we were in the universe where our data were generated according to that distribution. That's why, if we knew PPP﻿, we wouldn't need any data at all to answer our question. 

“Estimand” vs “target parameter”?

These two terms are completely synonymous in the context of modern causal inference. 

In traditional (parametric) statistics, the statistical distributions could always be described with a finite number of parameters. These parameters (or simple functions of them) were always the objects of interest- i.e. the estimands. As statistics started moving into nonparametrics, the term “parameter” stuck around but acquired a new, generalized meaning as a generic attribute of a distribution. That was then formalized into the definition we have now, which is that a parameter is a mapping from the model space into, say, the real numbers. The point is that parameters are now defined in a way that divorces them from models, whereas it used to be that parameters really only made sense within their own models. 

We think this shift is fundamental enough to constitute the use of a new term (”estimand”), especially since most people still carry around the more intuitive but limited perspective they learned in their first statistics class. Moreover, “estimand” makes it clear that this is the aspect of the distribution that you actually care about. “Parameter” is agnostic to the analyst, which is why you’ll see “target parameter” used to impose the required normativity. 

I (Alejandro) like “estimand” because it’s shorter and makes for a cleaner mental break with the world of purely parametric statistics, but you’ll see “parameter” or “target parameter” used more frequently in the literature. Throughout this book we’ll use these terms interchangeably so that you get used to seeing both.

The term "estimand" is used in the literature to refer both to this mapping ψ\psiψ﻿ and sometimes also to the actual value ψ(P)\psi(P)ψ(P)﻿ that the mapping obtains at the truth PPP﻿ (we also sometimes abuse notation and write ψ\psiψ﻿ to mean the value ψ(P)\psi(P)ψ(P)﻿ and not the function ψ\psiψ﻿ when it's obvious what is intended). If you like, you can think of the function ψ\psiψ﻿ as the "estimandor" and the value ψ(P)\psi(P)ψ(P)﻿ as the "estimand" (in analogy to an "estimator" and "estimate"). Just note that these aren't terms you'll find anywhere else. Either way, you should notice the parallel between an estimator and estimand(or): both are known functions that map distributions into real numbers (that's part of why it's nice to be able to write the data as the empirical distribution Pn\mathbb P_nPn​﻿- we get a nice symmetry). Then there's also the parallel between estimate and estimand: both are real numbers, the former being a visible guess of the latter, which is unknown.

Inference

With these pieces in place we can now discuss inference, which is the process of assigning a formal notion of uncertainty to our estimate. 

Imagine that we knew, with certainty, that our data Pn\mathbb P_nPn​﻿ were generated by taking nnn﻿ random draws from a particular probability distribution PPP﻿. Since the estimator ψ^\hat\psiψ^​﻿ is a deterministic function of the data, we can directly compute the distribution of the estimate ψ^(Pn)\hat\psi(\mathbb P_n)ψ^​(Pn​)﻿. It's important to understand that the estimate is itself a random variable, since the data are themselves random. The distribution of ψ^(Pn)\hat\psi(\mathbb P_n)ψ^​(Pn​)﻿ is called the sampling distribution of the estimate (or of the estimator). The sampling distribution is what we would get if we repeated our entire experiment under identical conditions (i.e. sampling nnn﻿ observations from PPP﻿) an infinite number of times and made a histogram of all the estimates ψ^(Pn)\hat\psi(\mathbb P_n)ψ^​(Pn​)﻿ we got each time. In reality, we only ever see a single realization of our experiment and a single realized estimate. We therefore need to use probability to imagine what would have happened if we repeated the experiment many, many times. 

We never know PPP﻿, so the exact sampling distribution of our estimator is generally unknown. Most of the challenge of statistics is figuring out how to say something about the sampling distribution of an estimate despite not knowing everything about PPP﻿. Undoubtedly the most useful tool we have for this task is the central limit theorem, which says that the distribution of an average of nnn﻿ IID variables ZZZ﻿ starts looking more and more like a normal (specifically N(E[Z],V[Z]/n)\mathcal N(E[Z], V[Z]/n)N(E[Z],V[Z]/n)﻿) as the number of observations gets large. If our estimator takes the form of a sample average of some variable ZZZ﻿, or can be shown to behave like a sample average as the number of observations grows, then we can apply the central limit theorem to understand its sampling distribution. All we have to do is estimate E[Z]E[Z]E[Z]﻿ and V[Z]V[Z]V[Z]﻿ consistently. 

Knowing the sampling distribution of our estimate is the key to quantifying the uncertainty we have in our estimate. While we could calculate any property of the sampling distribution, the two things we're usually interested are the bias and variance of our estimate. 

Bias is the systematic error of our estimator: E[ψ^]−ψE[\hat\psi] - \psiE[ψ^​]−ψ﻿. This is the amount by which the center of mass of our sampling distribution for the estimate ψ^\hat\psiψ^​﻿ is shifted away from the true estimand ψ\psiψ﻿. We generally look for estimators that are unbiased, or at a minimum that become unbiased as the number of observations increases (are asymptotically unbiased or consistent). The variance of our estimator is nothing other than V[ψ^]V[\hat\psi]V[ψ^​]﻿. It is a measure of the "spread" of the estimate, or how much we'd expect our answer to vary just by random chance. Having an estimate of the sampling variance is key to constructing confidence intervals and p-values, which are the usual ways to quantify uncertainty. 

One of the most important things to understand about inference is that the properties of any given estimator depend on the statistical model the estimator is used under. Under certain assumptions, an estimator might be shown to be unbiased and attain a particular sampling variance. But if those assumptions are violated, everything might be off the table. It is therefore of great interest to construct estimators that have nice properties under very general conditions so that our inference is not resting on unrealistic assumptions. In the next chapter I'll take you through a detailed example and you'll see what we mean.

Summary 

We can sum all of this up in one picture which neatly captures the pieces of an inferential analysis. The part inside the red rectangle is just a descriptive analysis. It's all you ever get to see in real life. But if you want to give your estimate a meaningful interpretation and measure of uncertainty, you need an inferential analysis. Inference requires you to imagine a space of possible distributions that could have generated your data (the model) and to identify a transformation that summarizes some characteristic of the distribution that you're interested in (the estimand). Finally, we derive how our estimate would vary across an infinite number of hypothetical repetitions of our whole experiment, drawing new data each time from the true distribution. From this we can establish, for example, that our estimate is right on average (has little or no bias) and is unlikely to vary more than a certain amount by random chance (has a particular variance). That's inference.

⬅️

BACK: 1.1 Descriptive Analysis

➡️

NEXT: 1.3 Inference with Linear Regression