Introduction to Bayesian Linear Regression, Posterior Inference, and MCMC

Peter Sørensen

Center for Quantitative Genetics and Genomics

Aarhus University

Overview

Classical Linear Regression
- Model, inference, and limitations
Bayesian Linear Regression
- Motivation, priors, and posteriors
- Conditional posteriors and inference
Computation and Applications
- MCMC and Gibbs sampling
- Diagnostics and R implementation

Introduction

Bayesian Linear Regression (BLR) extends classical regression by incorporating prior information and producing posterior distributions over model parameters.
Advantages:
- Handles high-dimensional and small-sample problems.
- Provides full uncertainty quantification.
- Enables regularization and integration of prior biological knowledge.

Applications in Genomics

Bayesian Linear Regression (BLR) is widely applied in quantitative genetics and genomics.
Common use cases:
- Genome-Wide Association Studies (GWAS) and fine-mapping of causal variants.
- Genetic prediction and heritability estimation.
- Pathway and gene-set enrichment analyses.
- Integrative multi-omics modeling (genome, transcriptome, epigenome).

In quantitative genetics, Bayesian methods are particularly useful for modeling large numbers of genetic markers simultaneously, while accounting for uncertainty in their effects. They are often used in genome-wide association studies to fine-map causal variants and quantify their contributions to complex traits. Another major application is genetic prediction, where Bayesian approaches can shrink noisy estimates and improve predictive accuracy. These models are also central to heritability estimation, helping partition genetic and environmental components of variance. Beyond single-variant analysis, Bayesian frameworks are used for pathway or gene-set enrichment studies, integrating biological structure into statistical modeling. Overall, Bayesian regression provides a flexible and coherent approach to studying molecular and genetic variation.

Classical Linear Regression

Model

\[ y = X\beta + e, \quad e \sim \mathcal{N}(0, \sigma^2 I_n) \] - \(y\): outcomes
- \(X\): design matrix
- \(\beta\): coefficients
- \(e\): are the residuals
- \(\sigma^2\): residual variance

We now transition from the broader applications of Bayesian methods to a review of classical linear regression, which serves as our starting point for building Bayesian intuition.
In this model, the response variable ( y ) is explained as a linear combination of predictors in the design matrix ( X ), weighted by regression coefficients ( ), plus a residual error term ( e ).
The residuals are assumed to follow a normal distribution with mean zero and variance ( ^2 ), implying that deviations from the regression line are random and homoscedastic.
Each term has a specific role: ( y ) represents the observed outcomes, ( X ) contains the predictors or explanatory variables, ( ) captures the strength and direction of their effects, and ( e ) accounts for unexplained variability.
This framework forms the foundation for both frequentist estimation — through least squares — and for Bayesian extensions, where we will later introduce priors and infer posterior distributions for ( ) and ( ^2 ).

Estimation

Regression effects: \[ \hat{\beta} = (X^\top X)^{-1} X^\top y \]

Residual variance: \[ \hat{\sigma}^2 = \frac{1}{n-p}\sum_i (y_i - x_i^\top \hat{\beta})^2 \]

Inference via standard errors and \(t\)-tests, confidence intervals, and prediction intervals.

In classical linear regression, estimation of the model parameters is achieved using the method of least squares.
The estimated regression coefficients, denoted by beta hat, are obtained by minimizing the sum of squared residuals.
Mathematically, this solution is given by ( = (X^X){-1} X^y ).
Intuitively, this means we’re finding the set of coefficients that produces the best linear fit to the data, in the sense of minimizing the vertical distances between observed and predicted values.
Once the coefficients are estimated, we can compute the residual variance, denoted ( ^2 ), which measures the average squared deviation between observed outcomes and model predictions.
This variance provides a key measure of unexplained variability, and it underlies inference through standard errors, t-tests, confidence intervals, and prediction intervals.
Altogether, these quantities form the backbone of classical regression inference before introducing Bayesian perspectives.

Limitations

No explicit control over effect size distribution
Sensitive when collinearity is high
Not identifiable when \(p>n\)
Uncertainty largely asymptotic unless normality assumptions hold

While classical linear regression provides a simple and powerful framework, it also has important limitations that motivate Bayesian alternatives.
First, the model does not allow us to explicitly control the distribution of effect sizes — all regression coefficients are treated symmetrically, regardless of prior knowledge about their likely magnitudes.
Second, the estimates become unstable in the presence of collinearity, when predictors are highly correlated, leading to inflated variances and unreliable inference.
Third, the model is not identifiable when the number of predictors exceeds the number of observations, that is, when ( p > n ); in such cases, there are infinitely many solutions that fit the data equally well.
Finally, classical inference relies heavily on asymptotic approximations and normality assumptions, which can underestimate uncertainty in small or complex data sets.
These limitations highlight why Bayesian frameworks, which introduce prior structure and probabilistic uncertainty, are often preferred in modern genomics and high-dimensional settings.

Why Bayesian Linear Regression?

Combines likelihood and prior to form the posterior.
Priors express beliefs about effect sizes:
- Normal → many small effects
- Spike-and-slab → sparse effects
Acts as a regularizer:
- Shrinks small/noisy effects toward \(0\)
- Preserves large, important effects
Stable when \(p > n\) due to prior information.
Provides full posterior distributions for \(\beta\) and \(\sigma^2\).

Bayesian linear regression extends the classical framework by introducing prior information and combining it with the likelihood to obtain a posterior distribution.
The likelihood represents what the data tell us about the parameters, while the prior expresses our beliefs about plausible effect sizes before observing the data.
Different choices of priors reflect different assumptions: a normal prior assumes that most effects are small and centered around zero, whereas a spike-and-slab prior encourages sparsity, allowing a few large effects while setting many others exactly to zero.
This incorporation of prior structure acts as a form of regularization — it naturally shrinks small or noisy estimates toward zero, while preserving larger, well-supported effects.
An additional advantage is that the model remains identifiable and stable even when the number of predictors exceeds the number of observations, that is, when ( p > n ).
Finally, Bayesian inference provides full posterior distributions for both the regression coefficients and the residual variance, giving us direct probabilistic measures of uncertainty instead of relying on asymptotic approximations.

Overview: Bayesian Linear Regression

Combines data and prior knowledge using Bayes’ rule.
Uses conjugate priors to yield closed-form full conditionals.
Employs Gibbs sampling to approximate the posterior distribution.
Estimates parameters, uncertainty, and predictions from posterior draws.

This slide provides an overview of the Bayesian linear regression framework and how it differs operationally from the classical approach.
At its core, Bayesian inference combines prior knowledge and observed data through Bayes’ rule to obtain the posterior distribution of the model parameters.
In practice, we often choose conjugate priors because they lead to closed-form full conditional distributions, which greatly simplify posterior computation.
To explore these posterior distributions, we employ Gibbs sampling — a form of Markov chain Monte Carlo that iteratively samples each parameter from its conditional distribution, given the others.
From the resulting posterior draws, we can estimate not only the parameters themselves, but also their uncertainty and predictive distributions for new observations.
This process provides a coherent, probabilistic framework for inference, prediction, and model interpretation.

Bayesian Linear Regression with Gaussian Priors

Bayesian linear regression starts with the same model structure as classical linear regression.

\[ y = X\beta + e, \quad e \sim \mathcal{N}(0, \sigma^2 I_n) \]

\(y\): \(n \times 1\) vector of observed outcomes
\(X\): \(n \times p\) design matrix of predictors
\(\beta\): \(p \times 1\) vector of unknown coefficients
\(e\): Gaussian noise with mean \(0\) and variance \(\sigma^2\)

Likelihood in Bayesian Linear Regression

Because the residuals are Gaussian, it follows that the marginal distribution of \(y\) is:

\[ e \sim \mathcal{N}(0, \sigma^2 I_n) \] The marginal distribution of \(y\) is:

\[ y \sim \mathcal{N}(X\beta, \sigma^2 I_n) \] This defines the likelihood the probability of the observed data given parameters \(\beta\) and \(\sigma^2\):

\[ p(y \mid X, \beta, \sigma^2) = \mathcal{N}(X\beta, \sigma^2 I_n) \]

Introducing Priors

In Bayesian linear regression, we specify prior distributions that express our beliefs about parameters before seeing the data.

A common conjugate prior for the regression coefficients is:

\[ \beta \mid \sigma_b^2 \sim \mathcal{N}(0, \sigma_b^2 I_p) \]

This reflects the belief that most effect sizes are small and centered near zero — consistent with the polygenic assumption in genetics.

Role of the Prior Variance \(\sigma_b^2\)

The parameter \(\sigma_b^2\) acts as a shrinkage (regularization) parameter:

Small \(\sigma_b^2\) → stronger shrinkage toward zero.
Large \(\sigma_b^2\) → weaker shrinkage, allowing larger effects.

It controls the strength of regularization and is often treated as an unknown hyperparameter estimated from the data.

Priors on Variance Components

We also place priors on the variance components to complete the hierarchical model.

\[ \sigma_b^2 \mid S_b, v_b \sim S_b \, \chi^{-2}(v_b), \quad \sigma^2 \mid S, v \sim S \, \chi^{-2}(v) \]

Here:

\(S_b\) and \(v_b\) are user-defined hyperparameters that control the prior distribution on the variance of regression coefficients.
\(S\) and \(v\) are hyperparameters for the residual variance \(\sigma^2\).

Conjugate Priors and Regularization

Conjugate priors keep posteriors in the same family
(e.g., scaled inverse-chi-squared), allowing closed-form Gibbs updates.

They also serve as regularizers:

The prior on \(\beta\) shrinks small or noisy effects toward zero.
Priors on variance components prevent overfitting, especially when \(p > n\).

Thus, conjugate priors make Bayesian linear regression efficient and stable.

Posterior Distribution

In Bayesian analysis, we combine the likelihood and priors using Bayes’ rule to obtain the joint posterior:

\[ p(\beta, \sigma_b^2, \sigma^2 \mid y) \propto p(y \mid \beta, \sigma^2)\; p(\beta \mid \sigma_b^2)\; p(\sigma_b^2)\; p(\sigma^2) \]

This posterior captures all updated knowledge about the unknown parameters after observing the data.
It forms the basis for computing posterior means, credible intervals, and predictions.

Conjugacy and Gibbs Sampling

With conjugate priors, each parameter’s full conditional distribution has a closed-form solution.
This makes Gibbs sampling a natural and efficient inference method.

Parameters are updated one at a time, each from its conditional posterior.
The resulting Markov chain explores the joint posterior of
\((\beta, \sigma_b^2, \sigma^2)\).

Gibbs sampling thus provides an easy way to approximate the full posterior in Bayesian linear regression.

Full Conditional for \(\beta\)

Given \(\sigma^2\), \(\sigma_b^2\), and the data \(y\), the regression coefficients have a multivariate normal conditional posterior:

\[ \beta \mid \sigma^2, \sigma_b^2, y \sim \mathcal{N}(\mu_\beta, \Sigma_\beta) \]

where

\[ \Sigma_\beta = \left( \frac{X^\top X}{\sigma^2} + \frac{I}{\sigma_b^2} \right)^{-1}, \quad \mu_\beta = \Sigma_\beta \frac{X^\top y}{\sigma^2} \]

This distribution represents our updated belief about \(\beta\)
after observing the data, while holding \(\sigma_b^2\) and \(\sigma^2\) fixed.

Comparison to Classical OLS

In classical regression, the OLS estimator is

\[ \hat\beta_{\text{OLS}} = (X^\top X)^{-1} X^\top y, \quad y \sim \mathcal{N}(X\beta, \sigma^2 I) \]

The estimate of \(\beta\) is independent of \(\sigma^2\),
since \(\sigma^2\) only scales the likelihood, not its maximum.

In Bayesian regression, \(\sigma^2\) appears explicitly in the posterior:

\[ \Sigma_\beta = \left( \frac{X^\top X}{\sigma^2} + \frac{I}{\sigma_b^2} \right)^{-1}, \quad \mu_\beta = \Sigma_\beta \frac{X^\top y}{\sigma^2} \]

The term \(\frac{I}{\sigma_b^2}\) introduces shrinkage, regularizing estimates and stabilizing inference especially when \(p > n\) or predictors are highly correlated.

Thus, the Bayesian posterior mean is a regularized, uncertainty-aware generalization of OLS.

Full Conditional for \(\beta_j\)

Instead of sampling \(\beta\) jointly, we can update each coefficient \(\beta_j\) one at a time, holding all others fixed efficient for large \(p\) or spike-and-slab models.

Let \(X_j\) be the \(j\)th column of \(X\) and define the partial residual:

\[ r_j = y - X_{-j} \beta_{-j} \]

Then the conditional posterior for \(\beta_j\) is univariate normal:

\[ \beta_j \mid D \sim \mathcal{N} \!\left( \frac{X_j^\top r_j}{X_j^\top X_j + \sigma^2 / \sigma_b^2},\; \frac{\sigma^2}{X_j^\top X_j + \sigma^2 / \sigma_b^2} \right) \]

This corresponds to a regularized least-squares update. Residual updates avoid matrix inversion, scale to high dimensions, and extend naturally to sparse (spike-and-slab) models.

Full Conditional for \(\sigma_b^2\)

The conditional distribution of the prior variance \(\sigma_b^2\), given \(\beta\) and the hyperparameters, is a scaled inverse-chi-squared:

\[ \sigma_b^2 \mid \beta \sim \tilde{S}_b \, \chi^{-2}(\tilde{v}_b) \]

where

\[ \tilde{v}_b = v_b + p, \quad \tilde{S}_b = \frac{\beta^\top \beta + v_b S_b}{\tilde{v}_b} \]

At each Gibbs iteration, \(\sigma_b^2\) is sampled directly given \(\beta\). This update reflects our revised belief about the variability of effect sizes after observing the current posterior draw of \(\beta\).

Full Conditional for \(\sigma^2\)

The conditional distribution of the residual variance \(\sigma^2\),
given \(\beta\) and the data, is also scaled inverse-chi-squared:

\[ \sigma^2 \mid \beta, y \sim \tilde{S} \, \chi^{-2}(\tilde{v}) \]

where

\[ \tilde{v} = v + n, \quad \tilde{S} = \frac{(y - X\beta)^\top (y - X\beta) + v S}{\tilde{v}} \]

At each Gibbs iteration, \(\sigma^2\) is sampled directly given \(\beta\).
This captures our updated belief about the residual variability
after accounting for the current linear predictor \(X\beta\).

Gibbs Sampling: Motivation

Bayesian inference often involves complex posteriors that lack closed-form solutions. To approximate these, we use Markov Chain Monte Carlo (MCMC) methods.

MCMC builds a Markov chain whose stationary distribution is the target posterior. Once the chain has converged, its samples can be used to estimate:

Posterior means, variances, and credible intervals
Predictive distributions
Other functions of interest

Among MCMC algorithms, the Gibbs sampler is especially useful when all full conditional distributions are available in closed form.

Gibbs Sampling: The Algorithm

For Bayesian linear regression with conjugate priors, the joint posterior is:

\[ p(\beta, \sigma_b^2 , \sigma^2 \mid y) \propto p(y \mid \beta, \sigma^2)\; p(\beta \mid \sigma_b^2)\; p(\sigma_b^2)\; p(\sigma^2) \]

We iteratively draw from the following full conditionals:

Sample \(\beta \mid \sigma_b^2, \sigma^2, y\)
Sample \(\sigma_b^2 \mid \beta\)
Sample \(\sigma^2 \mid \beta, y\)

Each step updates one parameter given the latest values of the others. Repeating this sequence yields samples from the joint posterior \(p(\beta, \sigma_b^2, \sigma^2 \mid y)\).

Because each conditional is standard (Normal or scaled inverse-\(\chi^2\)), Gibbs sampling is both efficient and easy to implement.

Posterior Summaries

After running the Gibbs sampler, we obtain posterior draws \(\{\theta^{(t)}\}_{t=1}^T\) for parameters such as \(\beta_j\), \(\sigma^2\), or \(\sigma_b^2\).

We summarize the posterior distribution via:

Posterior mean
\[ \mathbb{E}[\theta \mid y] \approx \frac{1}{T} \sum_{t=1}^{T} \theta^{(t)} \]
Posterior median: the median of \(\theta^{(t)}\)
Credible interval (95%)
\[ [\theta]_{0.025}, [\theta]_{0.975} \]

These summaries describe the most probable values of \(\theta\)
and their uncertainty after combining data and prior beliefs.

Estimating Uncertainty

Bayesian inference provides full posterior distributions, not just point estimates. Uncertainty is quantified directly from the posterior samples:

Posterior standard deviation
\[ \mathrm{SD}(\theta \mid y) \approx \sqrt{\frac{1}{T-1} \sum_{t=1}^{T} (\theta^{(t)} - \bar{\theta})^2} \]

The width of the credible interval reflects this uncertainty. Parameters with broader posteriors are estimated with less precision, and the degree of uncertainty depends on both the data and the prior.

Posterior Prediction

Given a new observation \(x_{\text{new}}\), we can predict using posterior draws:

Compute predicted means for each sample: \[ \hat{y}_{\text{new}}^{(t)} = x_{\text{new}}^\top \beta^{(t)} \]
Add residual uncertainty: \[ y_{\text{new}}^{(t)} \sim \mathcal{N}\!\left(x_{\text{new}}^\top \beta^{(t)},\; \sigma^{2(t)}\right) \]

The resulting samples \(\{y_{\text{new}}^{(t)}\}\) form a posterior predictive distribution, from which we can derive predictive intervals and evaluate predictive accuracy.

Model Checking and Hypothesis Testing

Posterior samples enable rich model diagnostics and hypothesis testing:

Posterior probability of an event
\[ \Pr(\beta_j \ne 0 \mid y) \approx \frac{1}{T} \sum_{t=1}^{T} \mathbf{1}\!\left(\beta_j^{(t)} \ne 0\right) \]
Posterior predictive checks
Simulate new datasets using posterior draws and compare them to the observed data to assess model fit.
Model comparison
Bayes factors and marginal likelihoods can be approximated to formally test or compare competing models.

These tools extend Bayesian inference beyond estimation to model validation, uncertainty quantification, and decision-making.

Convergence Diagnostics

Before interpreting MCMC results, we must check that the Gibbs sampler has converged to the target posterior distribution.

Convergence diagnostics assess whether the Markov chain has reached its stationary distribution and is producing valid samples.

Two basic strategies are:

Burn-in – Discard early iterations (e.g., first 1000) to remove dependence on starting values.
Thinning – Keep every \(k\)-th sample to reduce autocorrelation.

These steps improve sample quality and ensure reliable posterior summaries.

Trace Plots

A simple yet powerful diagnostic is the trace plot,
showing sampled parameter values \(\theta^{(t)}\) over iterations \(t\).

A converged chain fluctuates around a stable mean — no trend or drift.
Multiple chains from different starting points should overlap and mix well.

Trace plots help detect: - Lack of stationarity (upward/downward trends) - Poor mixing or multimodality - Burn-in issues

Visual inspection is often the first step in assessing convergence.

Autocorrelation

Samples from a Gibbs sampler are correlated, especially for tightly coupled parameters.
The autocorrelation function (ACF) quantifies dependence across lags \(k\):

\[ \hat{\rho}_k = \frac{\sum_{t=1}^{T-k} (\theta^{(t)} - \bar{\theta})(\theta^{(t+k)} - \bar{\theta})} {\sum_{t=1}^{T} (\theta^{(t)} - \bar{\theta})^2} \]

High \(\hat{\rho}_k\) → slow mixing and fewer effective samples
Low \(\hat{\rho}_k\) → better mixing and faster convergence

Reducing autocorrelation may require more iterations,
reparameterization, or thinning the chain.

Effective Sample Size (ESS)

Autocorrelation reduces the number of independent samples obtained.
The effective sample size (ESS) adjusts for this:

\[ \text{ESS}(\theta) = \frac{T}{1 + 2 \sum_{k=1}^{K} \hat{\rho}_k} \]

Small ESS → chain is highly correlated, less informative
Rule of thumb: \(\text{ESS} > 100\) per parameter for stable inference

ESS provides a quantitative measure of sampling efficiency
and helps determine whether more iterations are needed.

Gelman–Rubin Diagnostic (\(\hat{R}\))

When running multiple chains, the Gelman–Rubin statistic compares between-chain and within-chain variability.

For \(m\) chains with \(T\) iterations each:

\[ W = \frac{1}{m} \sum_{i=1}^{m} s_i^2, \quad B = \frac{T}{m-1} \sum_{i=1}^{m} (\bar{\theta}_i - \bar{\theta})^2 \]

The potential scale reduction factor:

\[ \hat{R} = \sqrt{ \frac{\hat{V}}{W} }, \quad \hat{V} = \frac{T-1}{T} W + \frac{1}{T} B \]

\(\hat{R} \approx 1\) → convergence achieved
\(\hat{R} > 1.1\) → chains have not converged

Geweke Diagnostic

The Geweke test checks whether early and late portions of a single chain have the same mean, indicating stationarity.

\[ Z = \frac{\bar{\theta}_A - \bar{\theta}_B} {\sqrt{\text{Var}(\bar{\theta}_A) + \text{Var}(\bar{\theta}_B)}} \]

Typically:

Segment A = first 10% of the chain
Segment B = last 50% of the chain

Under convergence, \(Z \sim \mathcal{N}(0,1)\).

\(|Z| \le 2\) → chain likely stationary
\(|Z| > 2\) → potential non-convergence

These diagnostics ensure that posterior summaries reflect the true target distribution.

Spike-and-Slab Bayesian Linear Regression

As in classical BLR, the outcome is modeled as:

\[ y = Xb + e, \quad e \sim \mathcal{N}(0, \sigma^2 I_n) \]

where \(y\) is the \(n \times 1\) response, \(X\) the design matrix, \(b\) the regression coefficients, and \(\sigma^2\) the residual variance.

This defines the likelihood:

\[ y \mid b, \sigma^2 \sim \mathcal{N}(Xb, \sigma^2 I_n) \]

The goal is to estimate \(b\) and identify which predictors truly contribute to \(y\).

Motivation for the Spike-and-Slab Prior

In standard Bayesian linear regression:

\[ \beta_j \sim \mathcal{N}(0, \sigma_b^2) \]

This Gaussian (shrinkage) prior assumes all predictors have small effects, but it does not allow exact zeros — limiting variable selection.

The spike-and-slab prior addresses this by mixing two components:

A spike at zero → excluded predictors
A slab (wide normal) → active predictors

This yields sparse, interpretable models that select relevant variables.

The Spike-and-Slab Mixture Prior

Each regression effect is drawn from a two-component mixture:

\[ p(b_i \mid \sigma_b^2, \pi) = \pi\, \mathcal{N}(0, \sigma_b^2) + (1-\pi)\, \delta_0 \]

where:

\(\pi\) = prior probability that \(b_i\) is non-zero
\(\delta_0\) = point mass at zero

Thus, with probability \(\pi\) a predictor is active (slab), and with probability \(1-\pi\) it is excluded (spike).

Advantages of Spike-and-Slab Priors

This hierarchical mixture prior provides several benefits:

Sparsity — allows exact zeros for irrelevant predictors
Interpretability — binary indicators give posterior inclusion probabilities (PIPs)
Adaptivity — the inclusion probability \(\pi\) is learned from the data
Balance — captures both strong signals (detection) and small effects (prediction)

Hence, spike-and-slab models combine variable selection with Bayesian uncertainty quantification.

Hierarchical Representation

We express each effect as:

\[ b_i = \alpha_i \, \delta_i \]

where:

\[ \alpha_i \mid \sigma_b^2 \sim \mathcal{N}(0, \sigma_b^2), \quad \delta_i \mid \pi \sim \text{Bernoulli}(\pi) \]

\(\alpha_i\): effect size when predictor is included
\(\delta_i\): binary inclusion indicator (0 or 1)

Marginalizing over \(\delta_i\) yields the spike-and-slab mixture prior above.

Prior for the Inclusion Probability \(\pi\)

The overall sparsity level is controlled by \(\pi\), assigned a Beta prior:

\[ \pi \sim \text{Beta}(\alpha, \beta) \]

Small \(\alpha\), large \(\beta\) → favor sparser models
\(\alpha = \beta = 1\) → uniform prior
Larger \(\alpha\) → denser models

This prior lets the data determine the degree of sparsity.

Priors for Variance Components

Variance parameters use scaled inverse-chi-squared priors:

\[ \sigma_b^2 \sim S_b \chi^{-2}(v_b), \quad \sigma^2 \sim S \chi^{-2}(v) \]

These are conjugate, providing closed-form conditional updates. Hyperparameters \((S_b, v_b)\) and \((S, v)\) encode prior beliefs about effect size variability and residual noise.

Joint Posterior Structure

Combining the likelihood and priors, the joint posterior is:

\[ p(\mu, \alpha, \delta, \pi, \sigma_b^2, \sigma^2 \mid y) \propto p(y \mid \mu, \alpha, \delta, \sigma^2)\, p(\alpha \mid \sigma_b^2)\, p(\delta \mid \pi)\, p(\pi)\, p(\sigma_b^2)\, p(\sigma^2) \]

This captures our updated beliefs about effects, inclusion indicators, and variance components.

Gibbs Sampling for Spike-and-Slab BLR

Inference proceeds via Gibbs sampling, cycling through these conditional updates:

\(\alpha \mid D\)
\(\delta \mid D\)
\(\pi \mid D\)
\(\sigma_b^2 \mid D\)
\(\sigma^2 \mid D\)

Here, \(D\) denotes the data and all other current parameter values.
Each conditional follows a standard distribution (Normal, Bernoulli, Beta, scaled-\(\chi^{-2}\)).
Iterating these updates generates samples from the joint posterior.

Posterior Inclusion Probabilities

The posterior inclusion probability (PIP) measures how likely each predictor is truly associated with \(y\):

\[ \widehat{\Pr}(\delta_i = 1 \mid y) = \frac{1}{T} \sum_{t=1}^{T} \delta_i^{(t)} \]

High PIP → predictor is likely important
Low PIP → predictor likely irrelevant

PIPs summarize variable relevance and drive Bayesian feature selection.

Summary of Bayesian Linear Regression

Bayesian Linear Regression combines likelihood and prior to form the posterior, enabling principled modeling, regularization, and uncertainty quantification.

Inference via MCMC (often Gibbs sampling) with posterior draws for means, credible intervals, and predictions.
Spike-and-slab priors enable sparsity and variable selection, assigning exact zeros to irrelevant predictors and identifying key variables via posterior inclusion probabilities (PIPs).
Conjugate and mixture priors yield efficient and robust inference, even when \(p > n\).
With proper convergence checks, Bayesian models provide stable and reliable inference across a wide range of data settings.

Applications in Genomics

We have now seen the basic framework of Bayesian Linear Regression (BLR)
and will illustrate how it provides a unified approach for analyzing genetic and genomic data.

Genome-Wide Association Studies (GWAS) and fine-mapping of causal variants.
Genetic prediction and heritability estimation.
Pathway and gene-set enrichment analyses.

These examples show how BLR connects statistical modeling with biological interpretation in quantitative genetics.