Search
Duplicate
Try Notion
1.2 Inferential Analysis
A purely descriptive analysis is pretty useless because the estimate is devoid of any formal interpretation. By itself, it only tells us something about our data, but nothing about the real world.
Of course, people tend to interpret descriptive analyses all the time. For example, let's say the average daily temperature in San Francisco last year was 18C (e.g. that was the estimate we computed in our descriptive analysis). You might be tempted to say that it's 18C on a "usual" day in the city. But in what sense can you make that a rigorous, formal statement? What do you really mean by "usual" day? What are you assuming to make that conclusion? How sure are you?
In order to do answer these questions we need to have language to describe phenomena in the real world which are complex enough that they appear somewhat random. We use mathematical probability for this purpose. In the language of probability, we imagine that the real world (or the part of it that we're interested in) can be fully described by a probability distribution. One way to think of a probability distribution is as a "machine" that generates data. In fact, you'll often see distributions described as data-generating processes. Different distributions "generate" data with different properties. We don't know what's going on in the real world, so instead of trying to understand all of the gory complexity, we instead hypothesize that the data we observe is generated by some probability distribution.
Of course we don't know exactly what the distribution is, so we need to lay out a set of possibilities. Each of these is, in a sense, a possible "universe" that our data could have come from. We call this set of possibilities the statistical model. The definition of a statistical model explicitly specifies our assumptions about what we believe is possible and what is impossible.
We also use this formalism to clearly define the quantity that we're after in the analysis. What is it that we are trying to "estimate" in the first place? With a statistical model in place, we can construct an estimand that assigns the true value of the answer to our question to each possible probability distribution. Thus if we knew what probability distribution generated our data, we could know the answer to our question without looking at any data at all. Being able to mathematically define a statistical estimand is exactly what we mean by having a clearly defined question.
Lastly, we can use the language of probability to enable inference, which is a formalized quantification of uncertainty. Having a probability distribution lets us imagine what would happen if we could repeat our entire experiment over and over again. Specifically, we could figure out how the estimate varies due to only to random chance and how far off from the truth we are on average.
All of this is the "imaginary world" that we have to create for ourselves if we want to give a meaningful interpretation to our analyses. Without a clear definition of what the statistical model is we're dealing with, we can't have a clear definition of our scientific question, i.e. the estimand. And without some notion of what kind of randomness we're dealing with in our data, we can't create any meaningful notion of uncertainty in our estimate of the estimand.
An inferential analysis therefore adds the following components in addition to those present in the descriptive analysis (data, estimator, estimate):
Statistical Model: a set of possible "worlds" from which our data may have arisen, one of which is our own reality
Estimand: the property of the real world that we are trying to determine
Putting all these components together lets us perform valid inference: saying something meaningful about the real world and assigning a notion of uncertainty to our assessment.
Statistical Model
In an inferential analysis, we hypothesize that our data is generated from some probability distribution PP. Of course, we don't know what the probability distribution PP actually is. If we did, that would be like saying we already know everything there is to know about the world, in which case there's no point gathering data or doing science because we already know the answer to all our questions.
Since we don't know PP, we have to consider a large number of possibilities that align with what we know to be true. We therefore build up a set of possible distributions {P1,P2,P3}\{P_1, P_2, P_3 \dots \}, which we call the statistical model M\mathcal M. The statistical model represents a the "multiverse" of possibilities. Maybe in one of these worlds, San Francisco is a hot, humid place, and in another, it's a cold tundra. Or maybe we already know that it's not a tundra, but nothing else. The point is that we have a set of possibilities. This set certainly could be discrete, but in most cases it's taken to be continuously infinite to accommodate all the possible realities that we can't rule out.
Here we see how the statistical model M\mathcal M, estimand ψ\psi, and data Pn\mathbb P_n are related.
We often talk about models that are "larger" or "smaller" than other models. By this we usually mean that one statistical model is a superset or subset of another, or that one includes a greater or lesser diversity of candidate distributions. If a model is "small", it necessarily imposes more assumptions because we are a-priori ruling out a larger number of distributions. For example, a model where we assume a variable to have a normal distribution is necessarily smaller (more restrictive) than one where, all else equal, we do not put any assumption on the distribution of that variable. On the other hand, "large" models require few assumptions because very little is ruled out a-priori: we instead rely more on the data to tell us which world we are most likely to be in.
Measure-Theoretic Probability and Notation
Here we take PP to be the probability measure of our observed data. It maps subsets of Z\mathcal Z (the space ZZ takes values in) to numbers in [0,1][0,1] and obeys certain rules like additivity over disjoint sets.
The (Lebesgue) integral of a function over a set AA with respect to a probability measure is written as AfdP\int_A f dP, which is approximately the same as saying "break R\mathbb R up into intervals [ai,bi][a_i,b_i] and compute the finite sum if(ai)P(f1[ai,bi])\sum_i f(a_i) P(f^{-1}[a_i,b_i])". Here f1[ai,bi]f^{-1}[a_i,b_i] means "the set of points in Z\mathcal Z that end up going to the interval [ai,bi][a_i, b_i] under ff. The role of PP here is to assign some notion of "length" or "weight" (or "probability"!) to the sets f1[ai,bi]f^{-1}[a_i,b_i].
If we are integrating with respect to a measure PP that has some density pp (with respect to the "Lebesgue measure") then we can write AfdP=Af(z)p(z)dz\int_A f d P = \int_A f(z) p(z) dz. This latter form should look familiar from your introductory probability classes.
Obviously the value of the integral depends on what measure we integrate with respect to. We also call the integral of a function w.r.t. a probability measure the expectation of that function, denoted E[f]E[f] or even just PfPf. This latter is technically an abuse of notation since we're using PP both for the measure and as an operator that maps functions to numbers. However, it's a) very common and b) minimizes visual clutter, and c) makes it clear what underlying measure we're integrating with. If we use densities, we can write the expected value of some random variable ZZ with distribution PP as PZ=zp(z)dzP Z = \int z p(z) dz. Again this form should look familiar to you.
Parametric, semiparametric, and nonparametric models
A parametric model is one where each probability distribution PM\mathcal P \in \mathcal M can be uniquely described with a finite-dimensional set of parameters. For example, if we had a dataset Z1nZ_{1\dots n} where each ZiZ_i takes scalar values, we might suppose that each observation ZiZ_i is an identically distributed normal random variable. That's an assumption, of course, and it might not be realistic or good because that rules out a lot of other possible distributions. Nonetheless, it's one way we could define a statistical model.
Since ZiZ_i is normal in this model, we can describe its full distribution with just a a mean and a standard deviation. Knowing only these two numbers tells us everything about the distribution. We say that this model is parametrized by these two numbers. An example of a slightly more complicated parametric model is a normal-linear model where each measurement in the data takes the form of a tuple (Y,X)(Y,X) and we assume Y=Xβ+N(μ,σ2)Y = X\beta + \mathcal N(\mu, \sigma^2). Now we have three parameters: the slope β\beta as well as the mean μ\mu and the standard deviation σ\sigma of the normally-distributed random noise. Again, this implies a strict set of assumptions which may not be true: namely that the true relationship between XX and the conditional mean of YY is linear and that the conditional distribution of YY given XX is normal with a standard deviation that is fixed and doesn't depend on XX.
In general, having a parametric model means that we can write the density of the data ZZ as a function of some finite-dimensional vector of kk parameters θ\theta (e.g. θ=[β,μ,σ]\theta = [\beta, \mu, \sigma]). We might write the density like pθ(Z)p_\theta(Z). Each value of θ\theta in the domain Θ\Theta implies one density pp in M\mathcal M and each density in the model has one unique value of the parameters that characterizes it.
Parametric models are popular because they are easy to work with: finding the distribution most likely to have generated the data reduces to a finite-dimensional optimization problem in the model parameters. However, they are usually also completely artificial! Do you really believe that the "true" relationship between the dose you take of a drug and the toxicity is exactly a straight line? Any parametric model with a few parameters is necessarily a very "small" model that rules out the vast, vast majority of possibilities. Depending on the situation, basing statistical inference on a parametric model can give you completely meaningless inference: biased estimates and invalid confidence intervals and p-values. We'll have more to say about that later.
Semiparametric and nonparametric are both terms used to describe models that cannot be boiled down to a finite-dimensional set of parameters.
Empirical distributions
We use the symbol Pn\mathbb P_n to denote an empirical distribution. This measure assigns mass 1/n1/n at each of the points taken by the random variables Z1,nIIDPZ_{1,\dots n} \overset{IID}\sim P. Mathematically, if we let δa(A)\delta_a(A) be the measure that returns 1 if aAa\in A and 0 otherwise, then we can define Pn(ZA)=1ninδZi(A)\mathbb P_n(Z \in A) = \frac{1}{n}\sum_i^n \delta_{Z_i}(A). This is just the fraction of samples ZiZ_i  that happen to have fallen into the set AA.
What you have to remember about the empirical measure, though, is that it's random. In other words, Pn(A)\mathbb P_n(A) evaluated at any fixed set AA is itself a random variable because the data themselves are taken to be random variables (with their own underlying distribution) in an inferential analysis.
Pn\mathbb P_n evaluated at some set is a random variable because the data that define it are random. PP evaluated on some set is a fixed number. As nn increases, the empirical distribution tends to concentrate around the value that PP takes.
Or you can think of Pn\mathbb P_n as a random measure that takes values in the space of probability measures. In this case Pn\mathbb P_n has a distribution over the model space itself. Once you fix a particular sample Z1,nZ_{1, \dots n} from PP, then the empirical distribution takes a single point value in M\mathcal M. The empirical measure Pn\mathbb P_n (for fixed nn) is also one element of a sequence of random measures that in some sense gets closer and closer to PP.
Pn\mathbb P_n as a "density" over the model space M\mathcal M with a typical realization shown. Note that this picture is just one possible example. In general, Pn\mathbb P_n may take values that are actually outside of M\mathcal M!
Any way you look at it, this is an example of a random "thing" that takes values in a set that isn't Rp\mathbb R^p. Note that the empirical measure depends on the underlying distribution PP. So if we had a different underlying distribution, say QQ, we might denote an empirical measure based on nn IID samples from QQ as Qn\mathbb Q_n.
Independent and Identically Distributed
In this book, we will only consider statistical models where each observation ZiZ_i is assumed to be independent and identically distributed (IID). In other words, we assume P(Z1,Zn)=P(Z1)×P(Zn)=P(Z)nP(Z_1, \dots Z_n) = P(Z_1)\times \cdots P(Z_n) =P(Z)^n. Analytically, this is extremely helpful because we only need to describe the distribution of a single generic observation ZZ in order to describe the distribution of our whole dataset. Therefore when we talk about the statistical model, we're talking about the space of distributions P(Z)P(Z) of a single observation with the implication that the data distribution is P(Z)nP(Z)^n.
This is an important assumption and limitation on the statistical model. A model where the data are IID is less generic (i.e. "smaller") than a model where we don't make that assumption. The reason we limit our focus is that the IID assumption is easily justifiable in many important problems. For example, in an observational study on cancer it's reasonable to presume that each subject is a sample from some eligible population (i.e. subjects are identically distributed) and that what happens to one subject doesn't affect what happens to another.
Nonetheless, there are many interesting and important problems where the IID assumption is untenable. For example, any large study of a highly infectious disease must take dependencies between subjects into account because any one person getting sick increases the chance that others get the disease. There is a large literature that deals with inference where observations cannot be considered independent. One useful trick is to clump observations into clusters which can themselves be treated as independent (e.g. cluster randomized trials), but this is not always possible. In general, however, it is important to understand the well-developed techniques for performing inference with IID data before you try to tackle more challenging problems that will require you to extend and build on these strategies.
Structural Equation Model
Each "point" in a statistical model is a probability distribution PP. As we move from one distribution to another, say P~\tilde P, what exactly changes? How can we break up the model into pieces that allow us to reason more clearly about specific relationships?
A (nonparametric) structural equation model presents a solution to that problem. Let our observation ZZ be a vector of pp random variables [A1Ap][A_1 \dots A_p]. Recall that we can always factor an arbitrary joint distribution P(A1,A2,Ap)P(A_1, A_2, \dots A_p) into a product of conditional distributions P(ApAp1,A2,A1)×P(Ap1Ap2,A2,A1)×P(A2A1)P(A1)P(A_p | A_{p-1}, \dots A_2, A_1) \times P(A_{p-1} | A_{p-2}, \dots A_2, A_1) \times \cdots P(A_2 | A_1) P(A_1). It therefore suffices to describe each of the conditional distributions AkAk1A1A_k | A_{k-1} \dots A_1 and the marginal distribution of A1A_1 (we can also, of course, reorder the indexing 1jp1 \dots j \dots p in any way that is convenient to minimize notational and analytical burden). That's the entire idea of a structural equation model. We can write any joint distribution of A1ApA_1 \dots A_p as:
A1=f1(U1)A2=f2(A1,U2)A3=f3(A1,A2,U3)Ap=fp(Ap1,A1,Up)\begin{align*} A_1 &= f_1(U_1) \\ A_2 &= f_2(A_1, U_2) \\ A_3 &= f_3(A_1, A_2, U_3) \\ &\cdots \\ A_p &= f_p(A_{p-1}, \dots A_1, U_p) \end{align*}
Where U1UpU_1 \dots U_p are exogenous variables (basically a fancy word for "random" or "external" noise), none of which depend in any way on the observables A1,ApA_1, \dots A_p, and f1fpf_1 \dots f_p are fixed (nonrandom) functions. We have therefore described the joint distribution of Z=[A1,Ap]Z=[A_1, \dots A_p] with the joint distribution of [U1Up][U_1 \dots U_p] and the functions fjf_j. Once these are specified, everything else follows. This way of specifying a distribution is what we call a structural equation model. You might also think of this as "parametrizing" the distribution PP with the random variables UjU_j and the functions fjf_j. The space of possible UjU_j and fjf_j define the statistical model, and any one choice of these identifies one particular data-generating process within the model.
The structural equation model is extremely generic because we can describe any distribution in this way. If we want to impose additional assumptions, it's easy to do so. For example, if we want to impose conditional independence between A2A_2 and A3A_3 given A1A_1, we could specify A2=f2(A1,U2)A_2 = f_2(A_1, U_2), A3=f3(A1,U3)A_3 = f_3(A_1, U_3) and U2U3U_2 \perp U_3. As another example, if we want to specify that A2A_2 has conditional mean given by sin(A1)\sin(A_1) and has normal conditional distributions with fixed variance σ2\sigma^2, we can impose f2(A1,U2)=sin(A1)+U2f_2(A_1, U_2) = \sin(A_1) + U_2 and U2N(0,σ2)U_2 \sim N(0, \sigma^2). The possibilities are literally endless.
It can also be useful to draw a picture that specifies the conditional independence relations present in any given structural equation model. This kind of picture is often called a directed acyclic graph (DAG) or graphical model. To draw a DAG for a model, you first draw your random variables as nodes. For any given variable, AkA_k you then look at its corresponding structural equation and draw an arrow into AkA_k coming from any variable AjA_j that appears on the right-hand side of the equation. Alternatively, you can have a DAG for a model and then translate it to a set of structural equations by following the reverse process. DAGs are necessarily a bit less expressive than the full algebraic structural equations, but there are many properties of complex joint distributions that can be more easily deduced by looking at a DAG than by reading through equations.
An example DAG for a model with three observed variables.
Example: Observational Study
To fix ideas we'll write up a structural equation model for a common use case: modeling an observational study. In our formal statistical model of an observational study, we imagine that for each subject ii we observe a vector XiX_i of pre-treatment covariates (e.g. their age, sex, history of disease), an indicator AiA_i of whether they received an active treatment (Ai=1A_i=1) or not (Ai=0A_i=0), and a scalar YiY_i that represents their outcome some time later (e.g. survival, score on a test, disease state). We'll factor the joint distribution as P(YA,X)P(AX)P(X)P(Y|A,X)P(A|X)P(X) since this reflects the intuitive time ordering of the variables: the covariates XX are measured, then some treatment AA is given, and then finally the outcome YY is observed. The treatment that is received may depend stochastically on the covariates (e.g. sicker people may be more likely to receive treatment) and the outcome almost certainly depends stochastically on both the covariates and the treatment. We could also have ordered the factorization differently, but this factorization reflects our understanding of the world and gives a more intuitive meaning to each factor.
X=UXA=fA(X,UA)Y=fY(X,A,UY)wherefA{0,1}\begin{align*} X &= U_X \\ A &= f_A(X, U_A) \\ Y &= f_Y(X,A,U_Y)\\ &\text{where} \\ f_A &\in \{0,1\} \end{align*}
Notice that the only restriction we've placed on this model besides enumerating the variables is that the function fAf_A return either a 0 or a 1 (enforcing the fact that the treatment is binary, which we know is the case). We've omitted fXf_X since we can always write UX=fX(UX)U_X = f_{X'}(U_{X'}) for arbitrary fXf_{X'} and UXU_{X'} so this is not an assumption. Thus for all intents and purposes we have not assumed anything that we don't already know is true. We don't know that the relationship between YY and XX is linear, or that the conditional distribution of YXY|X is normal, etc. If we don't know those things to be true, why would we impose them on our statistical model?
Example: Simple Randomized Trial
Another example we can give is a simple randomized trial. A simple randomized trial is precisely the same as an observational study, except that that the experimenter has precise control over who gets the treatment and who doesn't. In other words, we know what P(AX)P(A|X) is. In the simplest case, the experimenter assigns treatment to each subject on the basis of a coin flip. We can therefore write:
X=UXABernoulli(1/2)Y=fY(X,A,UY)\begin{align*} X &= U_X \\ A &\sim \text{Bernoulli}(1/2) \\ Y &= f_Y(X,A,U_Y)\\ \end{align*}
Notice that we've imposed the assumption that we know exactly what UAU_A is (a coin flip) and what fAf_A is (the identity). Controlling the distribution of AA in this way eliminates all incoming arrows. It therefore breaks all dependence between AA and XX: in a simple randomized trial the covariates have no role in what treatment someone gets. On the other hand, we still have not made any unfounded assumptions about YY because we don't know anything precise about how the covariates and treatment relate to the outcome. This is easy to see in the DAG, where it's now impossible to find a path from a common ancestor of both YY and AA that doesn't pass through XX (because AA has no ancestors!).
Estimand
Now that we've represented the space of possible worlds with a space of probability distributions, we have to represent our scientific question in the same terms. For example, let's say we're interested in what the "usual" temperature is on a day in San Francisco. A sensible way to formalize that would be to ask what the expected value is of a variable ZZ (representing temperature) with some given distribution: in other words, E[Z]E[Z]. Another question of interest might be: what is the average outcome (e.g. disease-free survival time) for a group of patients who were all exposed to a particular drug? In this case, we might represent each subject's outcome as a variable YY and their treatment status as A{0,1}A \in \{0,1\} (0 corresponding to no drug, 1 corresponding to drug) and the quantity of interest might be the conditional expectation E[YA=1]E[Y|A=1].
We refer to the quantity of interest as the statistical estimand or statistical target parameter. Formally, an estimand is a mapping (function) from the statistical model M\mathcal M to the domain of the parameter (usually the real numbers). In other words, each distribution PP in the model has its own (not necessarily unique) value of the estimand. This formalizes the fact that the answer to our scientific question is different in different "universes". We use the notation ψ(P)\psi(P) to denote the value of the estimand if the true data-generating distribution is PP. For example, if we had data Z=(Y,A)Z = (Y,A) and the estimand were the conditional mean of YY given A=1A=1, then we would have ψ(P)=EP[YA=1]\psi(P) = E_P[Y|A=1]. Notice that the argument PP is what we're taking the expectation with respect to. We can write the expectation more explicitly either in the measure theoretic notation ψ(P)={A=1}YdP\psi(P) = \int_{\{A=1\}} Y dP or if we have that the density of YAY|A under PP is p(ya)p(y|a) we'd write ψ(P)=yp(y1)dy\psi(P) = \int y p(y|1) dy (as opposed to, e.g. ψ(P~)=yp~(y1)dy\psi(\tilde P) = \int y \tilde p(y|1) dy for a different distribution P~\tilde P) . Either way, the point is to demonstrate why PP is itself the argument here: the estimand maps each distribution to whatever the "true" answer to our question would be if we were in the universe where our data were generated according to that distribution. That's why, if we knew PP, we wouldn't need any data at all to answer our question.
“Estimand” vs “target parameter”?
These two terms are completely synonymous in the context of modern causal inference.
In traditional (parametric) statistics, the statistical distributions could always be described with a finite number of parameters. These parameters (or simple functions of them) were always the objects of interest- i.e. the estimands. As statistics started moving into nonparametrics, the term “parameter” stuck around but acquired a new, generalized meaning as a generic attribute of a distribution. That was then formalized into the definition we have now, which is that a parameter is a mapping from the model space into, say, the real numbers. The point is that parameters are now defined in a way that divorces them from models, whereas it used to be that parameters really only made sense within their own models.
We think this shift is fundamental enough to constitute the use of a new term (”estimand”), especially since most people still carry around the more intuitive but limited perspective they learned in their first statistics class. Moreover, “estimand” makes it clear that this is the aspect of the distribution that you actually care about. “Parameter” is agnostic to the analyst, which is why you’ll see “target parameter” used to impose the required normativity.
I (Alejandro) like “estimand” because it’s shorter and makes for a cleaner mental break with the world of purely parametric statistics, but you’ll see “parameter” or “target parameter” used more frequently in the literature. Throughout this book we’ll use these terms interchangeably so that you get used to seeing both.
The term "estimand" is used in the literature to refer both to this mapping ψ\psi and sometimes also to the actual value ψ(P)\psi(P) that the mapping obtains at the truth PP (we also sometimes abuse notation and write ψ\psi to mean the value ψ(P)\psi(P) and not the function ψ\psi when it's obvious what is intended). If you like, you can think of the function ψ\psi as the "estimandor" and the value ψ(P)\psi(P) as the "estimand" (in analogy to an "estimator" and "estimate"). Just note that these aren't terms you'll find anywhere else. Either way, you should notice the parallel between an estimator and estimand(or): both are known functions that map distributions into real numbers (that's part of why it's nice to be able to write the data as the empirical distribution Pn\mathbb P_n- we get a nice symmetry). Then there's also the parallel between estimate and estimand: both are real numbers, the former being a visible guess of the latter, which is unknown.
Inference
With these pieces in place we can now discuss inference, which is the process of assigning a formal notion of uncertainty to our estimate.
Imagine that we knew, with certainty, that our data Pn\mathbb P_n were generated by taking nn random draws from a particular probability distribution PP. Since the estimator ψ^\hat\psi is a deterministic function of the data, we can directly compute the distribution of the estimate ψ^(Pn)\hat\psi(\mathbb P_n). It's important to understand that the estimate is itself a random variable, since the data are themselves random. The distribution of ψ^(Pn)\hat\psi(\mathbb P_n) is called the sampling distribution of the estimate (or of the estimator). The sampling distribution is what we would get if we repeated our entire experiment under identical conditions (i.e. sampling nn observations from PP) an infinite number of times and made a histogram of all the estimates ψ^(Pn)\hat\psi(\mathbb P_n) we got each time. In reality, we only ever see a single realization of our experiment and a single realized estimate. We therefore need to use probability to imagine what would have happened if we repeated the experiment many, many times.
We never know PP, so the exact sampling distribution of our estimator is generally unknown. Most of the challenge of statistics is figuring out how to say something about the sampling distribution of an estimate despite not knowing everything about PP. Undoubtedly the most useful tool we have for this task is the central limit theorem, which says that the distribution of an average of nn IID variables ZZ starts looking more and more like a normal (specifically N(E[Z],V[Z]/n)\mathcal N(E[Z], V[Z]/n)) as the number of observations gets large. If our estimator takes the form of a sample average of some variable ZZ, or can be shown to behave like a sample average as the number of observations grows, then we can apply the central limit theorem to understand its sampling distribution. All we have to do is estimate E[Z]E[Z] and V[Z]V[Z] consistently.
Knowing the sampling distribution of our estimate is the key to quantifying the uncertainty we have in our estimate. While we could calculate any property of the sampling distribution, the two things we're usually interested are the bias and variance of our estimate.
Bias is the systematic error of our estimator: E[ψ^]ψE[\hat\psi] - \psi. This is the amount by which the center of mass of our sampling distribution for the estimate ψ^\hat\psi is shifted away from the true estimand ψ\psi. We generally look for estimators that are unbiased, or at a minimum that become unbiased as the number of observations increases (are asymptotically unbiased or consistent). The variance of our estimator is nothing other than V[ψ^]V[\hat\psi]. It is a measure of the "spread" of the estimate, or how much we'd expect our answer to vary just by random chance. Having an estimate of the sampling variance is key to constructing confidence intervals and p-values, which are the usual ways to quantify uncertainty.
One of the most important things to understand about inference is that the properties of any given estimator depend on the statistical model the estimator is used under. Under certain assumptions, an estimator might be shown to be unbiased and attain a particular sampling variance. But if those assumptions are violated, everything might be off the table. It is therefore of great interest to construct estimators that have nice properties under very general conditions so that our inference is not resting on unrealistic assumptions. In the next chapter I'll take you through a detailed example and you'll see what we mean.
Summary
We can sum all of this up in one picture which neatly captures the pieces of an inferential analysis. The part inside the red rectangle is just a descriptive analysis. It's all you ever get to see in real life. But if you want to give your estimate a meaningful interpretation and measure of uncertainty, you need an inferential analysis. Inference requires you to imagine a space of possible distributions that could have generated your data (the model) and to identify a transformation that summarizes some characteristic of the distribution that you're interested in (the estimand). Finally, we derive how our estimate would vary across an infinite number of hypothetical repetitions of our whole experiment, drawing new data each time from the true distribution. From this we can establish, for example, that our estimate is right on average (has little or no bias) and is unlikely to vary more than a certain amount by random chance (has a particular variance). That's inference.