Center for Quantitative Genetics and Genomics
Aarhus University
\[
y = X\beta + e, \quad e \sim \mathcal{N}(0, \sigma^2 I_n)
\] - \(y\): outcomes
- \(X\): design matrix
- \(\beta\): coefficients
- \(e\): are the residuals
- \(\sigma^2\): residual variance
Regression effects: \[ \hat{\beta} = (X^\top X)^{-1} X^\top y \]
Residual variance: \[ \hat{\sigma}^2 = \frac{1}{n-p}\sum_i (y_i - x_i^\top \hat{\beta})^2 \]
Inference via standard errors and \(t\)-tests, confidence intervals, and prediction intervals.
Bayesian linear regression starts with the same model structure as classical linear regression.
\[ y = X\beta + e, \quad e \sim \mathcal{N}(0, \sigma^2 I_n) \]
Because the residuals are Gaussian, it follows that the marginal distribution of \(y\) is:
\[ e \sim \mathcal{N}(0, \sigma^2 I_n) \] The marginal distribution of \(y\) is:
\[ y \sim \mathcal{N}(X\beta, \sigma^2 I_n) \] This defines the likelihood the probability of the observed data given parameters \(\beta\) and \(\sigma^2\):
\[ p(y \mid X, \beta, \sigma^2) = \mathcal{N}(X\beta, \sigma^2 I_n) \]
In Bayesian linear regression, we specify prior distributions that express our beliefs about parameters before seeing the data.
A common conjugate prior for the regression coefficients is:
\[ \beta \mid \sigma_b^2 \sim \mathcal{N}(0, \sigma_b^2 I_p) \]
This reflects the belief that most effect sizes are small and centered near zero — consistent with the polygenic assumption in genetics.
The parameter \(\sigma_b^2\) acts as a shrinkage (regularization) parameter:
It controls the strength of regularization and is often treated as an unknown hyperparameter estimated from the data.
We also place priors on the variance components to complete the hierarchical model.
\[ \sigma_b^2 \mid S_b, v_b \sim S_b \, \chi^{-2}(v_b), \quad \sigma^2 \mid S, v \sim S \, \chi^{-2}(v) \]
Here:
Conjugate priors keep posteriors in the same family
(e.g., scaled inverse-chi-squared), allowing closed-form Gibbs updates.
They also serve as regularizers:
Thus, conjugate priors make Bayesian linear regression efficient and stable.
In Bayesian analysis, we combine the likelihood and priors using Bayes’ rule to obtain the joint posterior:
\[ p(\beta, \sigma_b^2, \sigma^2 \mid y) \propto p(y \mid \beta, \sigma^2)\; p(\beta \mid \sigma_b^2)\; p(\sigma_b^2)\; p(\sigma^2) \]
This posterior captures all updated knowledge about the unknown parameters after observing the data.
It forms the basis for computing posterior means, credible intervals, and predictions.
With conjugate priors, each parameter’s full conditional distribution has a closed-form solution.
This makes Gibbs sampling a natural and efficient inference method.
Gibbs sampling thus provides an easy way to approximate the full posterior in Bayesian linear regression.
Given \(\sigma^2\), \(\sigma_b^2\), and the data \(y\), the regression coefficients have a multivariate normal conditional posterior:
\[ \beta \mid \sigma^2, \sigma_b^2, y \sim \mathcal{N}(\mu_\beta, \Sigma_\beta) \]
where
\[ \Sigma_\beta = \left( \frac{X^\top X}{\sigma^2} + \frac{I}{\sigma_b^2} \right)^{-1}, \quad \mu_\beta = \Sigma_\beta \frac{X^\top y}{\sigma^2} \]
This distribution represents our updated belief about \(\beta\)
after observing the data, while holding \(\sigma_b^2\) and \(\sigma^2\) fixed.
In classical regression, the OLS estimator is
\[ \hat\beta_{\text{OLS}} = (X^\top X)^{-1} X^\top y, \quad y \sim \mathcal{N}(X\beta, \sigma^2 I) \]
The estimate of \(\beta\) is independent of \(\sigma^2\),
since \(\sigma^2\) only scales the likelihood, not its maximum.
In Bayesian regression, \(\sigma^2\) appears explicitly in the posterior:
\[ \Sigma_\beta = \left( \frac{X^\top X}{\sigma^2} + \frac{I}{\sigma_b^2} \right)^{-1}, \quad \mu_\beta = \Sigma_\beta \frac{X^\top y}{\sigma^2} \]
The term \(\frac{I}{\sigma_b^2}\) introduces shrinkage, regularizing estimates and stabilizing inference especially when \(p > n\) or predictors are highly correlated.
Thus, the Bayesian posterior mean is a regularized, uncertainty-aware generalization of OLS.
Instead of sampling \(\beta\) jointly, we can update each coefficient \(\beta_j\) one at a time, holding all others fixed efficient for large \(p\) or spike-and-slab models.
Let \(X_j\) be the \(j\)th column of \(X\) and define the partial residual:
\[ r_j = y - X_{-j} \beta_{-j} \]
Then the conditional posterior for \(\beta_j\) is univariate normal:
\[ \beta_j \mid D \sim \mathcal{N} \!\left( \frac{X_j^\top r_j}{X_j^\top X_j + \sigma^2 / \sigma_b^2},\; \frac{\sigma^2}{X_j^\top X_j + \sigma^2 / \sigma_b^2} \right) \]
This corresponds to a regularized least-squares update. Residual updates avoid matrix inversion, scale to high dimensions, and extend naturally to sparse (spike-and-slab) models.
The conditional distribution of the prior variance \(\sigma_b^2\), given \(\beta\) and the hyperparameters, is a scaled inverse-chi-squared:
\[ \sigma_b^2 \mid \beta \sim \tilde{S}_b \, \chi^{-2}(\tilde{v}_b) \]
where
\[ \tilde{v}_b = v_b + p, \quad \tilde{S}_b = \frac{\beta^\top \beta + v_b S_b}{\tilde{v}_b} \]
At each Gibbs iteration, \(\sigma_b^2\) is sampled directly given \(\beta\). This update reflects our revised belief about the variability of effect sizes after observing the current posterior draw of \(\beta\).
The conditional distribution of the residual variance \(\sigma^2\),
given \(\beta\) and the data, is also scaled inverse-chi-squared:
\[ \sigma^2 \mid \beta, y \sim \tilde{S} \, \chi^{-2}(\tilde{v}) \]
where
\[ \tilde{v} = v + n, \quad \tilde{S} = \frac{(y - X\beta)^\top (y - X\beta) + v S}{\tilde{v}} \]
At each Gibbs iteration, \(\sigma^2\) is sampled directly given \(\beta\).
This captures our updated belief about the residual variability
after accounting for the current linear predictor \(X\beta\).
Bayesian inference often involves complex posteriors that lack closed-form solutions. To approximate these, we use Markov Chain Monte Carlo (MCMC) methods.
MCMC builds a Markov chain whose stationary distribution is the target posterior. Once the chain has converged, its samples can be used to estimate:
Among MCMC algorithms, the Gibbs sampler is especially useful when all full conditional distributions are available in closed form.
For Bayesian linear regression with conjugate priors, the joint posterior is:
\[ p(\beta, \sigma_b^2 , \sigma^2 \mid y) \propto p(y \mid \beta, \sigma^2)\; p(\beta \mid \sigma_b^2)\; p(\sigma_b^2)\; p(\sigma^2) \]
We iteratively draw from the following full conditionals:
Each step updates one parameter given the latest values of the others. Repeating this sequence yields samples from the joint posterior \(p(\beta, \sigma_b^2, \sigma^2 \mid y)\).
Because each conditional is standard (Normal or scaled inverse-\(\chi^2\)), Gibbs sampling is both efficient and easy to implement.
After running the Gibbs sampler, we obtain posterior draws \(\{\theta^{(t)}\}_{t=1}^T\) for parameters such as \(\beta_j\), \(\sigma^2\), or \(\sigma_b^2\).
We summarize the posterior distribution via:
These summaries describe the most probable values of \(\theta\)
and their uncertainty after combining data and prior beliefs.
Bayesian inference provides full posterior distributions, not just point estimates. Uncertainty is quantified directly from the posterior samples:
The width of the credible interval reflects this uncertainty. Parameters with broader posteriors are estimated with less precision, and the degree of uncertainty depends on both the data and the prior.
Given a new observation \(x_{\text{new}}\), we can predict using posterior draws:
The resulting samples \(\{y_{\text{new}}^{(t)}\}\) form a posterior predictive distribution, from which we can derive predictive intervals and evaluate predictive accuracy.
Posterior samples enable rich model diagnostics and hypothesis testing:
Posterior probability of an event
\[
\Pr(\beta_j \ne 0 \mid y)
\approx \frac{1}{T} \sum_{t=1}^{T} \mathbf{1}\!\left(\beta_j^{(t)} \ne 0\right)
\]
Posterior predictive checks
Simulate new datasets using posterior draws and compare them to the observed data to assess model fit.
Model comparison
Bayes factors and marginal likelihoods can be approximated to formally test or compare competing models.
These tools extend Bayesian inference beyond estimation to model validation, uncertainty quantification, and decision-making.
Before interpreting MCMC results, we must check that the Gibbs sampler has converged to the target posterior distribution.
Convergence diagnostics assess whether the Markov chain has reached its stationary distribution and is producing valid samples.
Two basic strategies are:
These steps improve sample quality and ensure reliable posterior summaries.
A simple yet powerful diagnostic is the trace plot,
showing sampled parameter values \(\theta^{(t)}\) over iterations \(t\).
Trace plots help detect: - Lack of stationarity (upward/downward trends) - Poor mixing or multimodality - Burn-in issues
Visual inspection is often the first step in assessing convergence.
Samples from a Gibbs sampler are correlated, especially for tightly coupled parameters.
The autocorrelation function (ACF) quantifies dependence across lags \(k\):
\[ \hat{\rho}_k = \frac{\sum_{t=1}^{T-k} (\theta^{(t)} - \bar{\theta})(\theta^{(t+k)} - \bar{\theta})} {\sum_{t=1}^{T} (\theta^{(t)} - \bar{\theta})^2} \]
Reducing autocorrelation may require more iterations,
reparameterization, or thinning the chain.
Autocorrelation reduces the number of independent samples obtained.
The effective sample size (ESS) adjusts for this:
\[ \text{ESS}(\theta) = \frac{T}{1 + 2 \sum_{k=1}^{K} \hat{\rho}_k} \]
ESS provides a quantitative measure of sampling efficiency
and helps determine whether more iterations are needed.
When running multiple chains, the Gelman–Rubin statistic compares between-chain and within-chain variability.
For \(m\) chains with \(T\) iterations each:
\[ W = \frac{1}{m} \sum_{i=1}^{m} s_i^2, \quad B = \frac{T}{m-1} \sum_{i=1}^{m} (\bar{\theta}_i - \bar{\theta})^2 \]
The potential scale reduction factor:
\[ \hat{R} = \sqrt{ \frac{\hat{V}}{W} }, \quad \hat{V} = \frac{T-1}{T} W + \frac{1}{T} B \]
The Geweke test checks whether early and late portions of a single chain have the same mean, indicating stationarity.
\[ Z = \frac{\bar{\theta}_A - \bar{\theta}_B} {\sqrt{\text{Var}(\bar{\theta}_A) + \text{Var}(\bar{\theta}_B)}} \]
Typically:
Under convergence, \(Z \sim \mathcal{N}(0,1)\).
These diagnostics ensure that posterior summaries reflect the true target distribution.
As in classical BLR, the outcome is modeled as:
\[ y = Xb + e, \quad e \sim \mathcal{N}(0, \sigma^2 I_n) \]
where \(y\) is the \(n \times 1\) response, \(X\) the design matrix, \(b\) the regression coefficients, and \(\sigma^2\) the residual variance.
This defines the likelihood:
\[ y \mid b, \sigma^2 \sim \mathcal{N}(Xb, \sigma^2 I_n) \]
The goal is to estimate \(b\) and identify which predictors truly contribute to \(y\).
In standard Bayesian linear regression:
\[ \beta_j \sim \mathcal{N}(0, \sigma_b^2) \]
This Gaussian (shrinkage) prior assumes all predictors have small effects, but it does not allow exact zeros — limiting variable selection.
The spike-and-slab prior addresses this by mixing two components:
This yields sparse, interpretable models that select relevant variables.
Each regression effect is drawn from a two-component mixture:
\[ p(b_i \mid \sigma_b^2, \pi) = \pi\, \mathcal{N}(0, \sigma_b^2) + (1-\pi)\, \delta_0 \]
where:
Thus, with probability \(\pi\) a predictor is active (slab), and with probability \(1-\pi\) it is excluded (spike).
This hierarchical mixture prior provides several benefits:
Hence, spike-and-slab models combine variable selection with Bayesian uncertainty quantification.
We express each effect as:
\[ b_i = \alpha_i \, \delta_i \]
where:
\[ \alpha_i \mid \sigma_b^2 \sim \mathcal{N}(0, \sigma_b^2), \quad \delta_i \mid \pi \sim \text{Bernoulli}(\pi) \]
Marginalizing over \(\delta_i\) yields the spike-and-slab mixture prior above.
The overall sparsity level is controlled by \(\pi\), assigned a Beta prior:
\[ \pi \sim \text{Beta}(\alpha, \beta) \]
This prior lets the data determine the degree of sparsity.
Variance parameters use scaled inverse-chi-squared priors:
\[ \sigma_b^2 \sim S_b \chi^{-2}(v_b), \quad \sigma^2 \sim S \chi^{-2}(v) \]
These are conjugate, providing closed-form conditional updates. Hyperparameters \((S_b, v_b)\) and \((S, v)\) encode prior beliefs about effect size variability and residual noise.
Combining the likelihood and priors, the joint posterior is:
\[ p(\mu, \alpha, \delta, \pi, \sigma_b^2, \sigma^2 \mid y) \propto p(y \mid \mu, \alpha, \delta, \sigma^2)\, p(\alpha \mid \sigma_b^2)\, p(\delta \mid \pi)\, p(\pi)\, p(\sigma_b^2)\, p(\sigma^2) \]
This captures our updated beliefs about effects, inclusion indicators, and variance components.
Inference proceeds via Gibbs sampling, cycling through these conditional updates:
Here, \(D\) denotes the data and all other current parameter values.
Each conditional follows a standard distribution (Normal, Bernoulli, Beta, scaled-\(\chi^{-2}\)).
Iterating these updates generates samples from the joint posterior.
The posterior inclusion probability (PIP) measures how likely each predictor is truly associated with \(y\):
\[ \widehat{\Pr}(\delta_i = 1 \mid y) = \frac{1}{T} \sum_{t=1}^{T} \delta_i^{(t)} \]
PIPs summarize variable relevance and drive Bayesian feature selection.
Bayesian Linear Regression combines likelihood and prior to form the posterior, enabling principled modeling, regularization, and uncertainty quantification.
We have now seen the basic framework of Bayesian Linear Regression (BLR)
and will illustrate how it provides a unified approach for analyzing genetic and genomic data.
These examples show how BLR connects statistical modeling with biological interpretation in quantitative genetics.