Skip to content

Bayesian inference

Bayesian inference formalizes how to optimally update knowledge in light of observed data.

In math-speak, we can express this as

π(θy)π(yθ)π(θ).\pi(\theta \vert y) \propto \pi(y \vert \theta) \pi(\theta).

This one’s important, so let’s take this step by step:

  • the probability of a parameter configuration given the data, the posterior π(θy)\pi(\theta \vert y)
  • is proportional to \propto
  • the probability of the data given the parameter configuration, the likelihood π(yθ)\pi(y \vert \theta)
  • multiplied by the probability of the parameter configuration, the prior π(θ)\pi(\theta).

So much for the theory. Let’s implement this ourselves to get a better hang of it!

Probabilistic coins

Let’s begin with the all-time classic of inferring the probability of a coin landing heads.

Before all that let’s begin by simulating some data.

true_theta <- 0.7
y <- sample(c(0, 1), 100, replace = TRUE, p = c(1 - true_theta, true_theta))

Likelihood function

For our coin toss, we need to implement a Bernoulli likelihood:

L(θx1,...,xn)=θh(1θ)nh where h=i=1nxi.L(\theta | x_1, ..., x_n) = \theta^h (1-\theta)^{n-h} \ \text{where } h = \sum_{i=1}^n x_i.

In practice, it is necessary to work with the log likelihood to prevent numerical underflow when working with very small probabilities.

With the log likelihood calculated, we simply perform Bayesian inference as outlined above.

log_likelihood <- function(theta, y) {
sum(y) * log(theta) + sum(y == 0) * log(1 - theta)
}
log_posterior <- function(theta, y, alpha, beta) {
log_likelihood(theta, y) + log(dbeta(theta, alpha, beta))
}

Posterior updating

Updating the posterior is straightforward for a problem as simple as this coin toss.

More complicated problems will force us to resort to more computationally expensive techniques, so-called Markov Chain Mote Carlo sampling.

Analytical posterior

Given the conjugate prior

priorBeta(α,β)\text{prior} \sim Beta(\alpha, \beta)

the posterior distribution is

π(θy)Beta(α+heads,β+tails).\pi(\theta \vert y) \sim Beta(\alpha + \text{heads}, \beta + \text{tails}).

This is straightforward to code, too:

alpha <- 4
beta <- 4
analytic_posterior <- rbeta(1000, alpha + sum(y), beta + sum(y == 0))

Great! We’re done! 🎉 … Wait, why is there more text below?

The bad news is that analytical solutions as simple as that above are rarely applicable in real-world scenarios.

This mathematical dilemma has meant that Bayesian inference was, until very recently, available to only those with very powerful computers. Nowadays, however, the average laptop can run the necessary calculations in reasonable time.

MCMC sampling

MCMC (Markov Chain Monte Carlo) sampling is a technique used to estimate probability distributions, particularly when direct calculation is difficult or impossible.

At its core, MCMC sampling works by ‘exploring’ the parameter space (the set of all possible parameter values) more or less randomly. Having found a ‘nice spot’ (parameters with high probability density, meaning they’re more likely given the observed data and prior beliefs), most samplers will be more likely to hang around and sample a little more. On the contrary, if the current spot is terrible, the sampler will quickly move away to try to sample elsewhere.

Each new sample depends only on the current sample, creating what is called a Markov Chain 1. This simple rule allows us – after many, many samples – to learn about the posterior distribution without analytically solving it. Tada! We can now approximate complex probability distributions and make inferences about parameters in statistical models which we cannot solve analytically.

The simplest of these samplers is the so-called Metropolis-Hastings algorithm2. It is fine if you simply copy this code chunk for the final visualization. For the curious, have a look and try to understand what it is doing – it is surprisingly straightforward!

calculate_accept_prob <- function(theta_proposed, theta_current, y, prior) {
log_proposed <- log_posterior(theta_proposed, y, prior)
log_current <- log_posterior(theta_current, y, prior)
exp(log_proposed - log_current)
}
metropolis_hastings <- function(data, alpha, beta, n_iterations) {
theta_current <- rbeta(1, alpha, beta)
samples <- numeric(n_iterations)
for (i in seq_len(n_iterations)) {
theta_proposed <- rbeta(1, theta_current * 10, (1 - theta_current) * 10)
accept_prob <- calculate_accept_prob(theta_proposed, theta_current, y, prior)
if (runif(1) < accept_prob) {
theta_current <- theta_proposed
}
samples[i] <- theta_current
}
return(samples)
}
n_iteratios <- 10000
samples <- metropolis_hastings(y, alpha, beta, n_iterations)

Visualization

Visualization is a cornerstone to aid interpretation in the Bayesian workflow. Typically, we would not only plot prior and posterior predictions, but also check the MCMC sampling for issues. We will have a look at that in a separate tutorial.

library(ggplot2)
df <- data.frame(theta = samples)
ggplot(df, aes(x = theta)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
stat_function(
fun = dbeta, args = list(
shape1 = alpha + sum(y),
shape2 = beta + sum(y == 0)
),
color = "red", size = 1
) +
stat_function(
fun = dbeta, args = list(shape1 = alpha, shape2 = beta),
color = "green", size = 1
) +
labs(x = "theta", y = "Density", title = "Posterior Distribution of theta") +
theme_minimal() +
scale_color_manual(values = c(
"Posterior" = "skyblue",
"Analytical Posterior" = "red",
"Prior" = "green"
)) +
guides(color = guide_legend(title = NULL))

Footnotes

  1. The Markov property dictates that each sample may only depend on the value of the previous sample.

  2. Nowadays, the preferred sampling algorithm is typically an extension of the Hamiltonian Monte Carlo algorithm (HMC) called No-U-Turn Sampler (NUTS).