### load libraries
library(tidyverse)
library(tidybayes)
library(rstan)
library(skimr)

Background

In grad school I was trained in frequentist statistics - which is great. Frequentist statistics are very useful. I’m not a hater*.

*Side note: I absolutely am a hater of the dichotomous decision making imposed by null hypothesis significance testing (NHST) and “statistical significance” (see this great paper).

But, Bayesian statistics are also useful and I found myself wanting to learn more.

When I started learning Bayesian statistics on my own, the best book I came across (by far) was Richard McElreath’s Statistical Rethinking. Using very simple examples and applications in R code, he walks through the basics in a way that just made things click for me.

The purpose of this notebook is to provide a similar practical introduction to the basics of Bayesian statistics in R – using the same examples as in Statistical Rethinking – but without relying on McElreath’s rethinking R package. While rethinking is awesome, I personally prefer working more directly with rstan and using tidyverse friendly packages like tidybayes.

Thanks to Richard McElreath for the great book, and I hope this helps others understand the basics of Bayesian statistics in the same way it did for me when I was first learning.

Why Bayesian?

For me – and I think for most people – the most appealing thing about Bayesian statistics is that they allow for more intuitive statements to be made about the world. Frequentist stats can be hard to explain.

For example, in frequentist statistics we’re stuck in a world where we have to talk about \(p(data|hypothesis)\) – the probability of the data given the hypothesis. This paints us into an (often misinterpreted) corner when it comes to things like p-values and confidence intervals. For example:

  • p-values: The probability of observing this difference – or a greater difference – between two groups assuming the null hypothesis is true (i.e., we can’t make a more intuitive statement about the probability of a given hypothesis being true).
  • 95% confidence intervals: If we were to repeat our experiment many times, we’d expect our confidence interval to contain the true population difference between the two groups 95% of the time… as for our specific confidence interval, we just kinda hope it’s one of the 95% of CI’s that does contain the true population difference but we’re not sure (i.e., we can’t make a more intuitive statement about there being a 95% chance that the true difference between groups is somewhere between [lower 95% CI] and [upper 95% CI]).

Bayesian statistics, on the other hand, flips this on its head. We get to talk about \(p(hypothesis|data)\)the probability of the hypothesis given the data

  • Posterior: Bayesian statistics result in a posterior distribution representing the relative plausibilities of different parameter values of interest, conditional on the data (i.e., we can say directly how likely it is that a given hypothesis about a parameter value is true).

  • Credibility Intervals: Given the posterior distribution of a parameter value of interest, we know the probability that it falls within any given interval (i.e., we can say directly that there’s a 95% chance that the parameter value of interest lies somewhere between [lower 95% CI] and [upper 95% CI]).

The rest of this notebook walks through the concepts of parameters/likelihood/priors and how to create posterior distributions from them – first using an intuitive method known as grid approximation and then building on that intuition to understand sampling methods like Markov Chain Monte Carlo (MCMC).

A simple motivating example

Let’s start with a simple example where there is only 1 unknown parameter value of interest: the proportion of the Earth that is covered in water.

Suppose we have a globe representing the Earth. We are curious how much of the surface of this globe is covered in water, so we adopt the following strategy: toss the globe in the air and catch it, then record whether the area under index finger is water (W) or land (L). The first 9 samples look like this:

W L W W W L W L W (i.e., 6 out of 9 samples are water)

What Bayesian statistics are going to allow us to do is calculate the posterior plausibility of different values of our unknown parameter of interest (i.e., the percent of the Earth that is covered in water), given this data we’ve collected along with any prior beliefs we might have about how much of the Earth is covered in water.

To accomplish this we’ll need a few things:

  • A likelihood function
  • The unknown parameter(s) we want to estimate
  • A prior distribution for each unknown parameter(s) we want to estimate

Likelihood

Likelihood is a mathematical formula that specifies the plausibility of the data. It maps each possible parameter value onto the relative number of ways the data could occur, given that possibility.

In our globe tossing example, we need to say how likely that exact sample is, out of the universe of potential samples of the same length. In this case we can use the binomial distribution. For example, the relative number of ways to get six W’s, holding p at 0.50 and n at 9 is:

dbinom(6, size = 9, prob = 0.5)
## [1] 0.1640625

But what is the most likely value of prob \(p\), given our data and any prior beliefs? This is what we will soon find out.

Parameters

Parameters are simply quantities that we wish to estimate from the data. They represent the different conjectures for the causes/explanations of the data.

In globe tossing example, both n (number of tosses) and w (count of water) are data, leaving p as the unknown parameter that we wish to estimate (the proportion of Earth that is covered in water).

Prior

For every parameter we intend our Bayesian model to estimate, we must provide a prior – an initial plausibility assignment for each possible value of the parameter. More on priors below.

Posterior Distribution

Once we have the likelihood, which parameters are to be estimated, and a prior for each parameter, we can calculate the posterior – the relative plausibility of different parameter values, conditional on the data:

\[Posterior = \frac{Likelihood\;\times\;Prior}{Average\;Likelihood}\\\]

Note: “Average likelihood” (sometimes also called “evidence” or “probability of the data”) simply standardizes the posterior to ensure it sums (integrates) to 1.

However, knowing the math of Bayes theorem is often of little help because many models cannot be conditioned formally. Instead we use various numerical techniques to approximate the mathematics that follows from the definition of Bayes’ theorem. We’ll cover two such numerical techniques for computing posterior distributions here:

  • Grid approximation
  • Markov Chain Monte Carlo (MCMC)

Note: McElreath also nicely covers another approach – Quadratic approximation – in his book. Here I use grid approximation to build intuition and then jump straight to MCMC sampling which is commonly used when there’s many (many) unknown parameters to estimate instead of just one

Grid approximation

While most parameters are continuous, capable of taking on infinite values, we can achieve excellent approximation of a continuous posterior distribution by considering only a finite “grid” of parameter values.

First we can define the number of approximations to use. The more we use, the more granular our posterior distribution will be.

# define the number of approximations to use
n_approx <- 1000

Next we can define the “grid” of possible values of our unknown parameter that we wish to explore. In this case, we’re interested in the plausibility of values ranging from 0 (0% of the Earth is covered in water) to 1 (100% of the Earth is covered in water).

# define grid of 1000 possible values of p ranging from 0 to 1
p_grid <- seq(from = 0, to = 1, length.out = n_approx)

Now, we might have some prior data or knowledge about how much of the Earth is covered in water, and we can specify a prior to reflect those beliefs. For example, we might use a “step prior” to indicate that we think the probability of the Earth being covered by any less than 50% water is zero. Here, though, we’ll use a uniform prior to assign equal prior probability to all possible parameter values in our grid (even though this is not very reasonable: do we really think that 0% and 100% of Earth covered by water is just as likely as, say, 50%?):

# define the prior
prior <- rep(1, n_approx)  # uniform prior
# prior <- ifelse(p_grid < 0.5, 0, 1)  # step prior, zero probability to values less than 0.5

Finally, we use the binomial distribution to calculate the likelihood of our data at each value in our “grid” which ranges from 0 to 1:

# compute the likelihood at each value in the grid
likelihood <- dbinom(6, size = 9, prob = p_grid)

And then we can compute our posterior using the formula above:

# compute the product of likelihood and prior
unstd_posterior <- likelihood * prior

# standardize the posterior, so it sums to 1
posterior <- unstd_posterior / sum(unstd_posterior)

# plot the posterior distribution
tibble(p_grid, posterior) %>%
    ggplot(aes(x = p_grid, y = posterior)) +
    geom_point(color = "sky blue") +
    labs(
        title = "Posterior distribution",
        x = "Possible parameter values",
        y = ""
    ) +
    theme_minimal()