1.1 Descriptive Analysis

1.1 Descriptive Analysis

Table of Contents

Data

Estimator

Estimate

A descriptive analysis is what we call any transformation of observed data into a summary number (or numbers) that doesn't require an "imaginary world" to interpret. Maybe that sounds a little abstract but hopefully when we contrast it with an inferential analysis later you'll see what we mean.

A descriptive analysis only has three components:

Data: some measurements about the real world (e.g. a spreadsheet with rows representing subjects in a clinical trial and columns representing their measured characteristics)

Estimator: a fixed algorithm or process or function that takes data and returns a number (e.g. a function like mean() in R or python that takes data as an input and returns one number)

Estimate: the summary number that is returned (e.g. the mean age of the subjects in our trial dataset)

As another example, consider obtaining a dataset that contains the daily high temperature in San Francisco for every day in 2021 (Z1,Z2…Z365Z_1, Z_2 \dots Z_{365}Z1​,Z2​…Z365​﻿) and taking the average of all those temperatures: Zˉ=1n∑in=365Zi\bar Z = \frac{1}{n}\sum_i^{n=365} Z_iZˉ=n1​∑in=365​Zi​﻿. Here the data are the measurements Z1,…365Z_{1,\dots 365}Z1,…365​﻿, the estimator is the function ψ^(Z1,…n)=1n∑inZi\hat\psi(Z_{1,\dots n}) = \frac{1}{n}\sum_i^n Z_iψ^​(Z1,…n​)=n1​∑in​Zi​﻿, and the estimate is the average Zˉ\bar ZZˉ﻿. At this point there isn't any notion of what it is we are "estimating". The estimate Zˉ\bar ZZˉ﻿ is just a transformation of the data we observe, we haven't connected it in any way to some hidden property of the real world.

Data

In general each data point ZiZ_iZi​﻿ may itself be a vector of variables instead of a scalar. A running example that we'll use throughout is the case where we're observing a sample of nnn﻿ individuals, each of whom has a vector of baseline covariates XiX_iXi​﻿, receives a treatment AiA_iAi​﻿, and then experiences some outcome YiY_iYi​﻿. For example, we might imagine that we're looking at a group of people who all presented to the emergency room with difficulty breathing. The baseline covariates might include a person's age, sex, and weight, the treatment might be whether or not that person was admitted to the hospital, and the outcome could be whether or not the person recovered from their illness within the week. In this case, the data are observations of the tuple Zi=(Xi,Ai,Yi)Z_i = (X_i, A_i, Y_i)Zi​=(Xi​,Ai​,Yi​)﻿. 

Instead of writing out Z1,…nZ_{1,\dots n}Z1,…n​﻿ to denote the full set of data, we'll often use the symbol Pn(Z)\mathbb P_n(Z)Pn​(Z)﻿ or more often just Pn\mathbb P_nPn​﻿ for short. This symbol represents a probability distribution that places mass 1/n1/n1/n﻿ at each observed ZiZ_iZi​﻿. As long as the order of the observations doesn't matter, for any given dataset Z1,…nZ_{1, \dots n}Z1,…n​﻿ there is exactly one empirical distribution Pn\mathbb P_nPn​﻿ that represents it, and vice versa, so we capture everything we need with this formalism. Writing Pn\mathbb P_nPn​﻿ is easier than Z1,…nZ_{1,\dots n}Z1,…n​﻿ and it turns that thinking of the data as an empirical distribution also has conceptual benefits that we will exploit later.

Estimator

An estimator can be any function of the full dataset. In the example above, we returned an average, but we could compute any kind of summary and call it an estimator. For example, we could compute a max, or a range, or a median, or even do something much more complex. For example, if we have data Zi=(Xi,Ai,Yi)Z_i = (X_i, A_i, Y_i)Zi​=(Xi​,Ai​,Yi​)﻿ we might do a linear regression of the YYY﻿ values onto the matrix of covariates [A,X][A,X][A,X]﻿ and extract the coefficient for AAA﻿. 

In general we use the notation ψ^\hat\psiψ^​﻿ to denote a function that ingests a dataset and returns a number. We can also just as easily define estimators that return multiple numbers at once, or that return a function, or anything else. These are no different, really, so throughout this book we'll stick to estimators that return real numbers for simplicity.

Estimate

The estimate is just whatever the estimator returns! It may or may not have a meaningful interpretation, and we'll get into that shortly. The estimate is denoted ψ^(Pn)\hat\psi(\mathbb P_n)ψ^​(Pn​)﻿. Sometimes, in an abuse of notation, we'll just say ψ^\hat\psiψ^​﻿ when really we mean ψ^(Pn)\hat\psi(\mathbb P_n)ψ^​(Pn​)﻿. It should be clear from context what is being referred to (the estimator ψ^\hat\psiψ^​﻿ or the estimate ψ^(Pn)\hat\psi(\mathbb P_n)ψ^​(Pn​)﻿).

⬅️

BACK: 1. A Roadmap for Statistics

➡️

NEXT: 1.2 Inferential Analysis