Bayesian inference
Bayesian inference formalizes how to optimally update knowledge in light of observed data.
In math-speak, we can express this as
This one’s important, so let’s take this step by step:
- the probability of a parameter configuration given the data, the posterior
- is proportional to
- the probability of the data given the parameter configuration, the likelihood
- multiplied by the probability of the parameter configuration, the prior .
So much for the theory. Let’s implement this ourselves to get a better hang of it!
Probabilistic coins
Let’s begin with the all-time classic of inferring the probability of a coin landing heads.
Before all that let’s begin by simulating some data.
true_theta <- 0.7y <- sample(c(0, 1), 100, replace = TRUE, p = c(1 - true_theta, true_theta))
import numpy as np
true_theta = 0.7y = np.random.binomial(n=1, p=true_theta, size=100)
using Distributions
true_θ = 0.7y = rand(Bernoulli(true_θ), 100)
Likelihood function
For our coin toss, we need to implement a Bernoulli likelihood:
In practice, it is necessary to work with the log likelihood to prevent numerical underflow when working with very small probabilities.
With the log likelihood calculated, we simply perform Bayesian inference as outlined above.
log_likelihood <- function(theta, y) { sum(y) * log(theta) + sum(y == 0) * log(1 - theta)}
log_posterior <- function(theta, y, alpha, beta) { log_likelihood(theta, y) + log(dbeta(theta, alpha, beta))}
import numpy as np
def log_likelihood(theta, y): return np.sum(y) * np.log(theta) + np.sum(y == 0) * np.log(1.0 - theta)
def log_posterior(theta, y, prior): return log_likelihood(theta, y) + prior.logpdf(theta)
log_likelihood(θ, y) = sum(y) * log(θ) + sum(y .== 0) * log(1.0 - θ)log_posterior(θ, y, prior) = log_likelihood(θ, y) + logpdf(prior, θ)
Posterior updating
Updating the posterior is straightforward for a problem as simple as this coin toss.
More complicated problems will force us to resort to more computationally expensive techniques, so-called Markov Chain Mote Carlo sampling.
Analytical posterior
Given the conjugate prior
the posterior distribution is
This is straightforward to code, too:
alpha <- 4beta <- 4analytic_posterior <- rbeta(1000, alpha + sum(y), beta + sum(y == 0))
from scipy import stats
alpha, beta = 4, 4prior = stats.beta(alpha, beta)analytic_posterior = stats.beta(alpha + sum(y), beta + sum(yi == 0 for yi in y))
α, β = 4, 4prior = Beta(α, β)analytic_posterior = Beta(α + sum(y), β + sum(y .== 0))
Great! We’re done! 🎉 … Wait, why is there more text below?
The bad news is that analytical solutions as simple as that above are rarely applicable in real-world scenarios.
This mathematical dilemma has meant that Bayesian inference was, until very recently, available to only those with very powerful computers. Nowadays, however, the average laptop can run the necessary calculations in reasonable time.
MCMC sampling
MCMC (Markov Chain Monte Carlo) sampling is a technique used to estimate probability distributions, particularly when direct calculation is difficult or impossible.
At its core, MCMC sampling works by ‘exploring’ the parameter space (the set of all possible parameter values) more or less randomly. Having found a ‘nice spot’ (parameters with high probability density, meaning they’re more likely given the observed data and prior beliefs), most samplers will be more likely to hang around and sample a little more. On the contrary, if the current spot is terrible, the sampler will quickly move away to try to sample elsewhere.
Each new sample depends only on the current sample, creating what is called a Markov Chain 1. This simple rule allows us – after many, many samples – to learn about the posterior distribution without analytically solving it. Tada! We can now approximate complex probability distributions and make inferences about parameters in statistical models which we cannot solve analytically.
The simplest of these samplers is the so-called Metropolis-Hastings algorithm2. It is fine if you simply copy this code chunk for the final visualization. For the curious, have a look and try to understand what it is doing – it is surprisingly straightforward!
calculate_accept_prob <- function(theta_proposed, theta_current, y, prior) { log_proposed <- log_posterior(theta_proposed, y, prior) log_current <- log_posterior(theta_current, y, prior) exp(log_proposed - log_current)}
metropolis_hastings <- function(data, alpha, beta, n_iterations) { theta_current <- rbeta(1, alpha, beta) samples <- numeric(n_iterations)
for (i in seq_len(n_iterations)) { theta_proposed <- rbeta(1, theta_current * 10, (1 - theta_current) * 10)
accept_prob <- calculate_accept_prob(theta_proposed, theta_current, y, prior)
if (runif(1) < accept_prob) { theta_current <- theta_proposed }
samples[i] <- theta_current }
return(samples)}
n_iteratios <- 10000samples <- metropolis_hastings(y, alpha, beta, n_iterations)
import numpy as npfrom scipy import stats
def calculate_accept_prob(theta_proposed, theta_current, y, prior): log_proposed = log_posterior(theta_proposed, y, prior) log_current = log_posterior(theta_current, y, prior) return np.exp(log_proposed - log_current)
def metropolis_hastings(y, prior, n_iterations): theta_current = prior.rvs() samples = np.zeros(n_iterations)
for i in range(n_iterations): theta_proposed = stats.beta(theta_current * 10, (1 - theta_current) * 10).rvs()
accept_prob = calculate_accept_prob(theta_proposed, theta_current, y, prior)
if np.random.random() < accept_prob: theta_current = theta_proposed
samples[i] = theta_current
return samples
n_iterations = 10_000samples = metropolis_hastings(y, prior, n_iterations)
function calculate_accept_prob(θ_proposed, θ_current, y, prior) log_proposed = log_posterior(θ_proposed, y, prior) log_current = log_posterior(θ_current, y, prior) return exp(log_proposed - log_current)end
function metropolis_hastings(y, prior, n_iterations) θ_current = rand(prior) samples = zeros(n_iterations)
for i in 1:n_iterations θ_proposed = rand(Beta(θ_current * 10, (1 - θ_current) * 10))
accept_prob = calculate_accept_prob(θ_proposed, θ_current, y, prior)
if rand() < accept_prob θ_current = θ_proposed end
samples[i] = θ_current end
return samplesend
n_iterations = 10_000samples = metropolis_hastings(y, prior, n_iterations)
Visualization
Visualization is a cornerstone to aid interpretation in the Bayesian workflow. Typically, we would not only plot prior and posterior predictions, but also check the MCMC sampling for issues. We will have a look at that in a separate tutorial.
library(ggplot2)
df <- data.frame(theta = samples)
ggplot(df, aes(x = theta)) + geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") + stat_function( fun = dbeta, args = list( shape1 = alpha + sum(y), shape2 = beta + sum(y == 0) ), color = "red", size = 1 ) + stat_function( fun = dbeta, args = list(shape1 = alpha, shape2 = beta), color = "green", size = 1 ) + labs(x = "theta", y = "Density", title = "Posterior Distribution of theta") + theme_minimal() + scale_color_manual(values = c( "Posterior" = "skyblue", "Analytical Posterior" = "red", "Prior" = "green" )) + guides(color = guide_legend(title = NULL))
import matplotlib.pyplot as pltimport seaborn as snsimport numpy as np
sns.histplot(samples, stat="density", label="Posterior", kde=False, color='skyblue', bins=30)theta = np.linspace(0, 1, 100)
plt.plot(theta, analytic_posterior.pdf(theta), label="Analytic Posterior", linewidth=2)
plt.plot(theta, prior.pdf(theta), label="Prior", linewidth=2)
plt.xlabel("Theta")plt.ylabel("Density")plt.title("Posterior Distribution of Theta")
plt.legend()
plt.show()
using StatsPlots
histogram(samples, normalize=:pdf, label="Posterior")plot!(analytic_posterior, label="Analytical Posterior", linewidth=2)plot!(prior, label="Prior", linewidth=2)xlabel!("θ"); ylabel!("Density"); title!("Posterior Distribution of θ")
Footnotes
-
The Markov property dictates that each sample may only depend on the value of the previous sample. ↩
-
Nowadays, the preferred sampling algorithm is typically an extension of the Hamiltonian Monte Carlo algorithm (HMC) called No-U-Turn Sampler (NUTS). ↩