|𝔻⟩irac's Student: Basic Bayesian Parameter Estimation

To start the year off with something a bit easy to digest, I found this post I prepared maybe ~4 years ago but didn't finish it. The main focus is to just walk through, as simply as possible, on the basic idea of Bayesian parameter estimation.

Assume you want to know what the most likely value for the slope and intercept of a line, $y=mx+b$, trending through some data should be. How would you assign this probability for those parameters? How would you update it if you have new data that the line should trend through?

There are two ways to think about this. The first is to extract from the data the underlying statistics for all possible slopes and intercepts that could make up the lines which trend with the data. We could then try to find the slope and intercept which on average is "closest" to all data points and assign this as the slope and intercept. Furthermore, we can quantify the average disagreement between the value $y'=mx+b$ and the actual data set, ${x,y}$. What I described is called linear regression, nothing exciting at all, but what's key here is I didn't use probabilities, but rather used the statistics of the data. What if I use probabilities, that is if I say what is the probability of $m$ and $b$ given the data ${x,y}$, phrased in mathematical terms:

$$P(m,b \vert x,y)$$

which is called the joint conditional probability distribution for $m$ and $b$. It simply states what the probability of both $m$ and $b$ is given the data set observed. The useful thing about probabilities is you can also write them down in terms of the conditional and marginal probabilities. For example, I can write

$$ \begin{equation} P(m,b \vert x,y) = \frac{P(y \vert x,m,b)P(m,b)}{P(y \vert x)}. \label{eq:posterior} \end{equation} $$

This provides the posterior distribution for the slope and intercept given the data set. The term $P(y \vert x)$ is the marginal likelihood, which acts as a normalizing constant. First, we have the likelihood probability, $P(y \vert x,m,b)$, which should say something about how reasonable $y$ is given the parameters and $x$. I'm going to assume normal distribution over values $y_i$¹, where $i$ is the data point. This means we can take the joint probability over all $y$ as a product, $\prod P(y_i|x_i,m,b)$, and the moments of the normal distribution are going to be given by our line. Thus what naturally comes out is the following:

$$ \begin{equation} P(y \vert x,m,b) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\sum{(y_i - (mx_i + b))^2}}{2\sigma^2}\right) \label{eq:likelihood} \end{equation} $$

So this is the likelihood probability that given $x$, $m$, and $b$, what is the probability of observing $y$. As you can see from the numerator in the exponent of eq. \ref{eq:likelihood}, this is dictated by the least-squares (i.e., sum of the residuals). Here I just gave the likelihood probability density function but didn't show how it came to be. In essence, it occurs because of the special affine transformation characteristics of a normal distribution² and the assumption that random noise in observed $y_i$ is given by a zero mean centered normal, i.e., $\mathscr{N}(0,\sigma^2)$.

Now, what about the other probabilities in eq. \ref{eq:posterior}, mainly the joint prior probability for $m$ and $b$? First, assume these are independent random variables, then the joint prior can just be written as a product. Secondly, if we assume a uniform distribution prior for both, then this just becomes a constant scaling factor; for which we can ignore. Therefore what we have is a likelihood probability that corresponds to the posterior distribution, i.e., a new update on the probability that the parameters are $m$ and $b$ given the observed data $y$.

In practice, it is easier to work with the log-likelihood (i.e., taking the natural log of eq. \ref{eq:likelihood}) and maximizing this [1]. How do you maximize? Conveniently, we can take the derivatives of the log-likelihood with respect to the parameters, set them equal to zero, and then calculate the values. This is called the method of maximum likelihood estimation, and the parameters $m$ and $b$ are recast in terms of sample means, variance, and covariance of the data.

Quantifying Uncertainty in Parameters

In Bayesian parameter estimation, the posterior distribution not only provides the most likely values for the parameters $m$ and $b$ but also quantifies the uncertainty around these estimates. This is a key advantage of the Bayesian approach. We can make statements related to:

Credible Intervals: A credible interval is a range of values within which the parameter is likely to lie with a certain probability. For example, a 95% credible interval for $m$ might indicate that there is a 95% probability that the true value of $m$ lies within this interval.
Posterior Variance: The variance of the posterior distribution provides a measure of uncertainty. A smaller variance indicates more certainty about the parameter estimates, while a larger variance indicates greater uncertainty.

For complex models, we often have to use sampling methods like Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. These methods allow us to estimate the mean, variance, and credible intervals of the parameters. Much of modern Bayesian inference is based on how to sample to approximate the posterior distribution [2].

By analyzing the posterior distribution, we gain insights into both the most likely parameter values and the uncertainty associated with these estimates, enabling more informed decision-making based on the data.

💻 Basic Numerical Example

Here is some fake linear data with noise and we want to find the most likely values for the slope and intercept.

import numpy as np
import matplotlib.pyplot as plt

# Data + noise
np.random.seed(42)
x = np.linspace(0, 10, 200)
y = 2.5 * x + 1.0 
y += np.random.normal(0, 1.0, size=len(x))

# Posterior
m_vals = np.linspace(2.45, 2.6, 100)
b_vals = np.linspace(0.4, 1.2, 100)
M, B = np.meshgrid(m_vals, b_vals)
posterior = np.array(
    [
        [np.exp(-0.5 * np.sum((y - (m * x + b)) ** 2)) for m in m_vals]
        for b in b_vals
    ]
)
posterior /= posterior.sum()

# Plot
plt.figure(figsize=(8, 6))
plt.contourf(M, B, posterior, levels=30, cmap='viridis')
plt.colorbar(label='Density', format='%.0e')
plt.scatter(2.5, 1.0, color='red', marker='x', label='True (m, b)')
plt.xlabel('Slope (m)')
plt.ylabel('Intercept (b)')
plt.legend()
plt.show()

Posterior Distribution over Slope and Intercept

Summarizing

Parameter estimation can be approached from both frequentist and Bayesian perspectives. While linear regression provides a straightforward method to estimate parameters using data statistics, Bayesian methods offer a probabilistic framework that incorporates prior knowledge and updates beliefs with new data. Understanding these methods enhances our ability to make informed predictions and decisions based on data.

Footnotes

For independent outcomes, you can always write the joint probability as a product. ↩
The multivariate normal distribution stays normal even after an affine transformation, which is a type of linear transformation. This results in the mean and covariance changing predictably. ↩