Bayesian inference

Bayesian inference formalizes how to optimally update knowledge in light of observed data.

In math-speak, we can express this as

\pi(\theta \vert y) \propto \pi(y \vert \theta) \pi(\theta).

This one’s important, so let’s take this step by step:

the probability of a parameter configuration given the data, the posterior $\pi(\theta \vert y)$
is proportional to $\propto$
the probability of the data given the parameter configuration, the likelihood $\pi(y \vert \theta)$
multiplied by the probability of the parameter configuration, the prior $\pi(\theta)$ .

So much for the theory. Let’s implement this ourselves to get a better hang of it!

Probabilistic coins

Let’s begin with the all-time classic of inferring the probability of a coin landing heads.

Before all that let’s begin by simulating some data.

true_theta <- 0.7
y <- sample(c(0, 1), 100, replace = TRUE, p = c(1 - true_theta, true_theta))

import numpy as np

true_theta = 0.7
y = np.random.binomial(n=1, p=true_theta, size=100)

using Distributions

true_θ = 0.7
y = rand(Bernoulli(true_θ), 100)

Likelihood function

For our coin toss, we need to implement a Bernoulli likelihood:

L(\theta | x_1, ..., x_n) = \theta^h (1-\theta)^{n-h} \ \text{where } h = \sum_{i=1}^n x_i.

In practice, it is necessary to work with the log likelihood to prevent numerical underflow when working with very small probabilities.

With the log likelihood calculated, we simply perform Bayesian inference as outlined above.

log_likelihood <- function(theta, y) {
  sum(y) * log(theta) + sum(y == 0) * log(1 - theta)
}

log_posterior <- function(theta, y, alpha, beta) {
  log_likelihood(theta, y) + log(dbeta(theta, alpha, beta))
}

import numpy as np

def log_likelihood(theta, y):
    return np.sum(y) * np.log(theta) + np.sum(y == 0) * np.log(1.0 - theta)

def log_posterior(theta, y, prior):
    return log_likelihood(theta, y) + prior.logpdf(theta)

log_likelihood(θ, y) = sum(y) * log(θ) + sum(y .== 0) * log(1.0 - θ)
log_posterior(θ, y, prior) = log_likelihood(θ, y) + logpdf(prior, θ)

Posterior updating

Updating the posterior is straightforward for a problem as simple as this coin toss.

More complicated problems will force us to resort to more computationally expensive techniques, so-called Markov Chain Mote Carlo sampling.

Analytical posterior

Given the conjugate prior

\text{prior} \sim Beta(\alpha, \beta)

the posterior distribution is

\pi(\theta \vert y) \sim Beta(\alpha + \text{heads}, \beta + \text{tails}).

This is straightforward to code, too:

alpha <- 4
beta <- 4
analytic_posterior <- rbeta(1000, alpha + sum(y), beta + sum(y == 0))

from scipy import stats

alpha, beta = 4, 4
prior = stats.beta(alpha, beta)
analytic_posterior = stats.beta(alpha + sum(y), beta + sum(yi == 0 for yi in y))

α, β = 4, 4
prior = Beta(α, β)
analytic_posterior = Beta(α + sum(y), β + sum(y .== 0))

Great! We’re done! 🎉 … Wait, why is there more text below?

The bad news is that analytical solutions as simple as that above are rarely applicable in real-world scenarios.

This mathematical dilemma has meant that Bayesian inference was, until very recently, available to only those with very powerful computers. Nowadays, however, the average laptop can run the necessary calculations in reasonable time.

MCMC sampling

MCMC (Markov Chain Monte Carlo) sampling is a technique used to estimate probability distributions, particularly when direct calculation is difficult or impossible.

At its core, MCMC sampling works by ‘exploring’ the parameter space (the set of all possible parameter values) more or less randomly. Having found a ‘nice spot’ (parameters with high probability density, meaning they’re more likely given the observed data and prior beliefs), most samplers will be more likely to hang around and sample a little more. On the contrary, if the current spot is terrible, the sampler will quickly move away to try to sample elsewhere.

Each new sample depends only on the current sample, creating what is called a Markov Chain ¹. This simple rule allows us – after many, many samples – to learn about the posterior distribution without analytically solving it. Tada! We can now approximate complex probability distributions and make inferences about parameters in statistical models which we cannot solve analytically.

The simplest of these samplers is the so-called Metropolis-Hastings algorithm². It is fine if you simply copy this code chunk for the final visualization. For the curious, have a look and try to understand what it is doing – it is surprisingly straightforward!

calculate_accept_prob <- function(theta_proposed, theta_current, y, prior) {
  log_proposed <- log_posterior(theta_proposed, y, prior)
  log_current <- log_posterior(theta_current, y, prior)
  exp(log_proposed - log_current)
}

metropolis_hastings <- function(data, alpha, beta, n_iterations) {
  theta_current <- rbeta(1, alpha, beta)
  samples <- numeric(n_iterations)

  for (i in seq_len(n_iterations)) {
    theta_proposed <- rbeta(1, theta_current * 10, (1 - theta_current) * 10)

    accept_prob <- calculate_accept_prob(theta_proposed, theta_current, y, prior)

    if (runif(1) < accept_prob) {
      theta_current <- theta_proposed
    }

    samples[i] <- theta_current
  }

  return(samples)
}

n_iteratios <- 10000
samples <- metropolis_hastings(y, alpha, beta, n_iterations)

import numpy as np
from scipy import stats

def calculate_accept_prob(theta_proposed, theta_current, y, prior):
    log_proposed = log_posterior(theta_proposed, y, prior)
    log_current = log_posterior(theta_current, y, prior)
    return np.exp(log_proposed - log_current)

def metropolis_hastings(y, prior, n_iterations):
    theta_current = prior.rvs()
    samples = np.zeros(n_iterations)

    for i in range(n_iterations):
        theta_proposed = stats.beta(theta_current * 10, (1 - theta_current) * 10).rvs()

        accept_prob = calculate_accept_prob(theta_proposed, theta_current, y, prior)

        if np.random.random() < accept_prob:
            theta_current = theta_proposed

        samples[i] = theta_current

    return samples

n_iterations = 10_000
samples = metropolis_hastings(y, prior, n_iterations)

function calculate_accept_prob(θ_proposed, θ_current, y, prior)
    log_proposed = log_posterior(θ_proposed, y, prior)
    log_current = log_posterior(θ_current, y, prior)
    return exp(log_proposed - log_current)
end

function metropolis_hastings(y, prior, n_iterations)
    θ_current = rand(prior)
    samples = zeros(n_iterations)

    for i in 1:n_iterations
        θ_proposed = rand(Beta(θ_current * 10, (1 - θ_current) * 10))

        accept_prob = calculate_accept_prob(θ_proposed, θ_current, y, prior)

        if rand() < accept_prob
            θ_current = θ_proposed
        end

        samples[i] = θ_current
    end

    return samples
end

n_iterations = 10_000
samples = metropolis_hastings(y, prior, n_iterations)

Visualization

Visualization is a cornerstone to aid interpretation in the Bayesian workflow. Typically, we would not only plot prior and posterior predictions, but also check the MCMC sampling for issues. We will have a look at that in a separate tutorial.

library(ggplot2)

df <- data.frame(theta = samples)

ggplot(df, aes(x = theta)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
  stat_function(
    fun = dbeta, args = list(
      shape1 = alpha + sum(y),
      shape2 = beta + sum(y == 0)
    ),
    color = "red", size = 1
  ) +
  stat_function(
    fun = dbeta, args = list(shape1 = alpha, shape2 = beta),
    color = "green", size = 1
  ) +
  labs(x = "theta", y = "Density", title = "Posterior Distribution of theta") +
  theme_minimal() +
  scale_color_manual(values = c(
    "Posterior" = "skyblue",
    "Analytical Posterior" = "red",
    "Prior" = "green"
  )) +
  guides(color = guide_legend(title = NULL))

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.histplot(samples, stat="density", label="Posterior", kde=False, color='skyblue', bins=30)
theta = np.linspace(0, 1, 100)

plt.plot(theta, analytic_posterior.pdf(theta), label="Analytic Posterior", linewidth=2)

plt.plot(theta, prior.pdf(theta), label="Prior", linewidth=2)

plt.xlabel("Theta")
plt.ylabel("Density")
plt.title("Posterior Distribution of Theta")

plt.legend()

plt.show()

using StatsPlots

histogram(samples, normalize=:pdf, label="Posterior")
plot!(analytic_posterior, label="Analytical Posterior", linewidth=2)
plot!(prior, label="Prior", linewidth=2)
xlabel!("θ"); ylabel!("Density"); title!("Posterior Distribution of θ")

The Markov property dictates that each sample may only depend on the value of the previous sample. ↩
Nowadays, the preferred sampling algorithm is typically an extension of the Hamiltonian Monte Carlo algorithm (HMC) called No-U-Turn Sampler (NUTS). ↩