Introduction to Bayesian Linear Regression Models used in Gene Set Analyses

Peter Sørensen

Center for Quantitative Genetics and Genomics, Aarhus University, Denmark

Bayesian Linear Regression Models

Bayesian Linear Regression Models provide a flexible statistical framework for modeling complex biological and healthcare data. They support key applications such as:

  • Genome-wide association studies (GWAS) and fine-mapping of causal variants
  • Polygenic risk scoring (PRS) for predicting complex traits and disease risk
  • Gene and pathway enrichment analyses to test biological hypotheses
  • Integrative multi-omics modeling across the genome, transcriptome, epigenome, and proteome
  • Applications to registry-based healthcare data, enabling population-level risk prediction and disease modeling

The Bayesian Linear Regression Model

The Bayesian Linear Regression (BLR) model builds:

\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) \]

  • \(\mathbf{Y}\) represents the observed outcomes or association measures corresponding to the features in \(\mathbf{X}\).
  • \(\mathbf{X}\) represents molecular or genomic predictors (e.g., genotypes, gene scores, annotations, pathway indicators).
  • \(\boldsymbol{\beta}\)effect sizes quantifying how features in \(\mathbf{X}\) explain variation in \(\mathbf{Y}\)
  • \(\boldsymbol{\varepsilon}\)residual noise capturing unexplained variation

In the Bayesian formulation, each \(\beta_j\) is assigned a prior distribution reflecting beliefs about effect size magnitude or sparsity and determine how information is shared across features or biological layers.

Why Bayesian Linear Regression Models?

Regression effects can be estimated in many ways, but we focus on a Bayesian hierarchical framework because it:

  • Combines data and prior knowledge to improve inference
  • Provides a natural way to regularize and handle noisy or high-dimensional data
  • Enables flexible modeling of diverse effect patterns:
    • Many small vs. few large effects
    • Structured effects (e.g., by pathway, gene set, or omic layer)
  • Returns uncertainty estimates for all parameters → improving interpretability and model comparison

Through their hierarchical structure, BLR models naturally integrate multiple biological layers — linking genomic, transcriptomic, and other molecular data

Hyperpriors in Bayesian Modeling

The BLR model extends the linear framework by introducing hierarchical priors on parameters. Suppose we model regression coefficients \(b_j\) with a normal prior:

\[ b_j \sim \mathcal{N}(0, \sigma_b^2) \]

Here, \(\sigma_b^2\) (the variance of the prior) controls how large the effects \(b_j\) are expected to be.

Instead of fixing \(\sigma_b^2\) we treat it as unknown and assign it its own prior — the hyperprior:

\[ \sigma_b^2 \sim \text{Inv-}\chi^2(\nu, S^2) \]

In practice, this parameter is learned from the data during estimation rather than fixed in advance.

Hierarchical Structure

The three levels in the model:

Level Description Example
1 Describes how data are generated given parameters \(y \sim \mathcal{N}(Xb, \sigma^2 I)\)
2 Describes our beliefs about the parameters before seeing data \(b_i \sim \mathcal{N}(0, \sigma_b^2)\)
3 Describes uncertainty about the prior’s parameters \(\sigma_b^2 \sim \text{Inv-}\chi^2(\nu, S^2)\)


This hierarchical structure allows the model to learn how strongly to shrink effect estimates from the data, while accounting for uncertainty in prior parameters and automatically regularizing effect sizes.

Simple and robust, but may not capture diverse effect-size distributions.

Adapting to Complex Biological Architectures

Complex traits arise from heterogeneous effect-size distributions — some features have large effects, many have small, and others are likely null. To capture this diversity, the BLR framework can be extended in two ways:

  • Data-driven grouping of molecular features:
    The model learns effect-size classes from the data using a mixture of variances \(\{\tau_k^2\}\) with probabilities \(\{\pi_k\}\).

  • Biologically informed grouping of molecular features:
    The model uses prior biological knowledge to assign features to groups a priori, each with its own variance \(\tau_g^2\) capturing within-group variability.

Both approaches enable the model to adapt to complex genetic and molecular architectures and share information across related features or omic layers

Multiple-Component BLR

A more flexible formulation assumes that effects come from a mixture of normal distributions:

\[ \beta_j \mid d_j = k \sim \mathcal{N}(0, \tau_k^2), \qquad P(d_j = k) = \pi_k \]

  • \(d_j\) is a latent indicator variable assigning effect \(j\) to component \(k\).
  • Each component \(k\) has its own variance \(\tau_k^2\), controlling expected effect size.
  • \(\pi_k\) represents the probability of membership in each component.

This structure allows the model to capture both large and small effects, including potential nulls.
In practice, both \(\pi_k\) and \(\tau_k^2\) are learned from the data, making the model data-driven and adaptive, though not necessarily biologically informed.

Hierarchical (Group-Structured) BLR

In this biologically informed extension, we assign features to predefined groups (e.g., genes, pathways, or protein complexes) a priori and model group-specific variances:

\[ \beta_j \sim \mathcal{N}\!\big(0, \tau_{g(j)}^2\big), \qquad \tau_g^2 \sim p(\tau_g^2) \]

  • Each feature \(j\) belongs to a group \(g(j)\) defined before analysis.
  • Each group \(g\) has its own variance \(\tau_g^2\), controlling effect size variability within that group.

This approach allows the model to share information among related features and test enrichment across biological groups.
In practice, the group-level variances \(\tau_g^2\) are learned from the data, enabling the model to adapt shrinkage across biological structures.

Indicator Variables and Posterior Inclusion Probabilities

In Bayesian variable selection, each feature \(j\) is assigned an indicator variable:

\[ \delta_j = \begin{cases} 1, & \text{if feature $j$ has a non-zero effect} \\ 0, & \text{if feature $j$ has no effect.} \end{cases} \]

The model can be written as: \[ \beta_j = \alpha_j \, \delta_j, \qquad \alpha_j \sim \mathcal{N}(0, \tau^2), \quad \delta_j \sim \text{Bernoulli}(\pi) \]

where \(\pi\) is the prior inclusion probability.

After inference, we estimate \(\text{PIP}_j = P(\delta_j = 1 \mid \text{data})\) — the posterior inclusion probability for feature j.

Understanding the Outputs of Bayesian Linear Regression (BLR)

The BLR model produces posterior summaries that describe the evidence, magnitude, and uncertainty of feature effects.

Key outputs:

  1. Posterior means of \(\boldsymbol{\beta}\) (effect sizes)
  2. Posterior inclusion probabilities (PIPs)
  3. Variance component estimates (\(\sigma^2\), \(\tau^2\), …)

Each plays a distinct and complementary role in interpretation.

Posterior Means of Effect Sizes (\(\boldsymbol{\beta}\))

  • Represent the estimated effect of each feature (gene, SNP, or set)
  • Computed as posterior averages:
    \[ \hat{\beta}_j = \mathbb{E}[\beta_j \mid \mathbf{Y}, \mathbf{X}] \]
  • Interpretation:
    • Magnitude → direction and strength of association
    • Sign → positive or negative effect
    • Shrinkage → smaller estimates for weak or uncertain effects
  • These are directly analogous to regression coefficients, but account for prior information and uncertainty.

Posterior Inclusion Probabilities (PIPs)

  • Represent the probability that feature \(j\) has a nonzero effect: \[ \text{PIP}_j = \Pr(\beta_j \neq 0 \mid \mathbf{Y}, \mathbf{X}) \]

  • Quantifies evidence of inclusion in the model:

    • High PIP → strong support for inclusion
    • Low PIP → likely irrelevant or redundant feature
  • Interpretation:

    • PIPs act as Bayesian significance measures
    • Useful for ranking, fine-mapping, and feature prioritization

In pathway analyses (e.g., Bayesian MAGMA), PIPs correspond to gene- or set-level importance scores.

Variance Component Estimates (\(\tau^2\), \(\sigma^2\))

Variance components describe uncertainty and heterogeneity in effect sizes:

\[ \beta_j \sim \mathcal{N}(0, \tau^2) \qquad \text{and} \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]

\(\tau^2\) is the variance component for the effect sizes
- Small \(\tau^2\): most effects are close to zero
- Large \(\tau^2\): more large-effect features expected

\(\sigma^2\) is the variance component for the residuals
- Captures unexplained variation after accounting for \(\mathbf{X}\)

Together, these components describe the genetic architecture, enabling heritability estimation, uncertainty quantification, and enrichment analysis across sets or traits.

Interpreting the Outputs Together

Quantity Interpretation Typical use
\(\hat{\beta}_j\) Direction and magnitude of effect Effect estimation, prediction
\(\text{PIP}_j\) Probability feature is truly associated Fine-mapping, feature ranking
\(\tau^2\), \(\sigma^2\) Variance in effect sizes and residuals Genetic architecture, model fit

These summaries are synergistic:

  • \(\boldsymbol{\beta}\) tells how much
  • PIP tells how confident
  • Variance components tell how complex

Summary of BLR Model Structures

Model Type Prior Structure Biological Interpretation
Single-component BLR One global variance \(\tau^2\) All features (across layers) share the same level of shrinkage — equal contribution assumption
Multiple-component BLR Mixture of variances \(\{\tau_k^2\}\) Features belong to different effect-size classes (e.g., large, small, null); grouping learned from data
Hierarchical (Biologically informed) BLR Group-specific mixtures of variances \(\{\tau_{gk}^2\}\) Features grouped a priori (e.g., by genes, pathways, or omic layers); within each group, effects can vary in size and sparsity

These models form a hierarchy of increasing flexibility and biological realism —
from global shrinkage → to data-driven heterogeneity → to biologically structured mixtures that model variation within and between feature sets.

Motivation for Multivariate BLR

Many traits and molecular layers are correlated — they share genetic architecture and biological pathways.


To model these dependencies, we extend BLR to the multivariate setting:

  • Jointly models multiple traits or omic layers
    → captures shared genetic or molecular effects
  • Borrows strength across correlated traits
    → improves fine-mapping resolution and prediction accuracy
  • Estimates cross-trait effect patterns
    → helps identify pleiotropic genes and shared biological pathways

The Multivariate Bayesian Linear Regression (MV-BLR) Model

In the multivariate BLR model, we model multiple correlated outcomes jointly:

\[ \mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E} \]

  • \(\mathbf{Y}\): \((n \times T)\) matrix of outcomes
    (e.g., association measures for \(T\) traits or omic layers)
  • \(\mathbf{X}\): \((n \times p)\) feature matrix
  • \(\mathbf{B}\): \((p \times T)\) matrix of effect sizes
  • \(\mathbf{E}\): \((n \times T)\) residual matrix

Each row of \(\mathbf{Y}\) corresponds to an observation or gene, and each column to a trait, phenotype, or molecular layer.

Multivariate Error and Effect Priors

We extend the univariate priors to the multivariate setting:

\[ \mathbf{e}_{i\cdot} \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_e) \] \[ \mathbf{b}_j \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_b) \]

  • \(\boldsymbol{\Sigma}_e\): residual covariance among traits
  • \(\boldsymbol{\Sigma}_b\): covariance of effect sizes across traits
  • When \(\boldsymbol{\Sigma}_e\) and \(\boldsymbol{\Sigma}_b\) are diagonal, the model reduces to \(T\) independent univariate BLR models.

Allows information sharing across correlated traits or omic layers and can be used to identify pleiotropic effects and cross-trait genetic architectures.

Multivariate BLR (Structured MV-BLR)

The hierarchical structure can be extended to model multiple traits
while preserving biological grouping of features:

\[ \mathbf{b}_j \sim \mathcal{N}_T\!\big(\mathbf{0}, \boldsymbol{\Sigma}_{b, g(j)}\big), \qquad \boldsymbol{\Sigma}_{b, g} \sim p(\boldsymbol{\Sigma}_{b, g}) \]

  • Each biological group \(g\) has its own trait-level covariance matrix \(\boldsymbol{\Sigma}_{b,g}\)
  • \(\boldsymbol{\Sigma}_{b,g}\) captures correlations and scale of effects across traits within that group

Enables information sharing both within biological sets and across correlated traits.

Indicator Variables and PIPs (Multivariate BLR)

In the multivariate BLR, each feature \(j\) may affect multiple outcomes (traits).
We extend the indicator variable to capture cross-trait activity patterns:

\[ \boldsymbol{\delta}_j = \begin{bmatrix} \delta_{j1} \\ \delta_{j2} \\ \vdots \\ \delta_{jT} \end{bmatrix}, \qquad \delta_{jt} = \begin{cases} 1, & \text{if feature $j$ affects trait $t$} \\ 0, & \text{otherwise.} \end{cases} \]

After inference, we estimate \(\text{PIP}_{jt} = P(\delta_{jt} = 1 \mid \text{data})\) — the posterior inclusion probability that feature j affects trait t.

Multivariate BLR Outputs

In the multivariate setting, we generalize each posterior quantity:

Parameter Interpretation
\(\mathbf{B} = [\beta_{jt}]\) Effect matrix across traits (\(j\): feature, \(t\): trait)
\(\text{PIP}_j\) Probability that feature \(j\) affects ≥1 trait
\(\boldsymbol{\Sigma}_b\) Covariance of effects across traits
\(\boldsymbol{\Sigma}_e\) Residual covariance among traits

These allow us to identify:

  • Shared genetic effects (pleiotropy)
  • Trait-specific vs. shared signals
  • Cross-trait enrichment of biological sets

Overview of BLR Models used in Gene Set Analyses

Model Type Feature Integration Grouping Basis Prior Structure What It Captures
Single-component BLR Combines all biological features in one model None One global variance (\(\tau^2\)) All features contribute equally; uniform shrinkage
Multiple-component BLR Integrates all layers but allows heterogeneous contributions Learned from data Mixture of variances (\(\{\tau_k^2\}\)) Large, small, and null effect classes
Hierarchical BLR Groups features by biological structure (e.g., genes, pathways) Defined a priori Group-specific mixture of variances (\(\{\tau_{gk}^2\}\)) Within-group heterogeneity; enrichment and structured shrinkage
Multivariate BLR Jointly models multiple correlated traits or outcomes None or by trait Shared covariance (\(\boldsymbol{\Sigma}_b\)) across traits Genetic/molecular correlations; pleiotropy
Hierarchical MV-BLR Combines biological grouping and multiple outcomes Defined a priori Group- and trait-specific covariance mixtures (\(\{\boldsymbol{\Sigma}_{b,gk}\}\)) Shared biological mechanisms across traits and layers

Learning at Different Levels

Model Level Key Parameters Learned What They Represent How They Are Learned What We Learn Biologically
Effect sizes \(\boldsymbol{\beta}\) Strength and direction of association for each feature Posterior mean/median given priors and data Which features drive the outcome
Indicator variables \(\delta_j\) (single trait), \(\boldsymbol{\delta}_j\) (multi-trait) Whether feature \(j\) is active (and for which traits) Estimated as posterior inclusion probabilities (PIPs) Which features are relevant, and whether effects are shared or trait-specific
Variance components \(\tau^2\), \(\{\tau_k^2\}\), \(\{\tau_{gk}^2\}\) Magnitude of expected effect sizes; heterogeneity across layers or groups Inferred hierarchically from the data (via MCMC or EM) How strongly different groups or omic layers contribute
Covariance components \(\boldsymbol{\Sigma}_b\), \(\{\boldsymbol{\Sigma}_{b,g}\}\) Correlation of effects across traits or molecular layers Estimated from joint posterior Shared pathways, pleiotropy, and cross-layer architecture

Estimating Variance Components During Model Fitting

In Bayesian Linear Regression, the variance components are not fixed — they are estimated jointly with the effect sizes \(\boldsymbol{\beta}\).

  • The model defines hierarchical priors: \[ \beta_j \sim \mathcal{N}(0, \tau^2) \qquad \text{and} \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \] where \(\tau^2\) and \(\sigma^2\) are unknown parameters.

  • These variances are updated iteratively during MCMC or EM optimization:

    • \(\tau^2\) reflects the inferred spread of true effects
    • \(\sigma^2\) reflects residual noise or unexplained variability

Inference alternates between sampling (or updating) \(\boldsymbol{\beta}\) and re-estimating the variance components given the data and current effects.

Hierarchical Structure: Full Bayesian Estimation

Each variance component has its own prior, enabling uncertainty propagation:

\[ \begin{aligned} \mathbf{Y} &= \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, & \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I}) \\ \beta_j &\sim \mathcal{N}(0, \tau^2) & \tau^2 \sim \text{Inv-}\chi^2(\nu_\tau, S_\tau^2) \\ \sigma^2 &\sim \text{Inv-}\chi^2(\nu_\sigma, S_\sigma^2) \end{aligned} \]

  • The model integrates over the uncertainty in \(\tau^2\) and \(\sigma^2\),
    not just point estimates
  • Posterior samples of these parameters describe:
    • The degree of polygenicity (via \(\tau^2\))
    • The signal-to-noise ratio (via \(\tau^2 / \sigma^2\))

This full hierarchical treatment improves stability and interpretability compared to fixing variance components a priori.

In Grouped or Multi-Component BLR Models

Variance components can be set-specific or mixture-specific,
and are all estimated from the data:

\[ \beta_j \sim \mathcal{N}\!\big(0, \tau_{g(j)}^2\big), \qquad \tau_g^2 \sim p(\tau_g^2) \]

or

\[ \beta_j \sim \sum_{k=1}^{K} \pi_k\,\mathcal{N}(0, \tau_k^2), \qquad \tau_k^2 \sim p(\tau_k^2) \]

  • Each \(\tau_g^2\) (or \(\tau_k^2\)) is estimated adaptively
  • Groups or components with strong evidence receive larger \(\tau^2\)
    (less shrinkage, more signal)
  • Noisy or irrelevant groups shrink toward smaller \(\tau^2\)

Variance components act as adaptive shrinkage parameters, controlling model complexity based on the data.

In the Multivariate BLR Model

Variance components generalize to covariance matrices:

\[ \mathbf{b}_j \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_b), \qquad \mathbf{E}_{i\cdot} \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_e) \]

  • \(\boldsymbol{\Sigma}_b\): covariance of effects across traits
    → estimated from shared signal among outcomes
  • \(\boldsymbol{\Sigma}_e\): residual covariance among traits
    → estimated from correlated noise or shared environment

Estimating \(\boldsymbol{\Sigma}_b\) and \(\boldsymbol{\Sigma}_e\) enables discovery of pleiotropy and cross-trait genetic structure.

Why Variance Component Estimation Matters

Parameter Role Interpretation
\(\tau^2\) Effect-size variance How much true genetic signal exists
\(\sigma^2\) Residual variance How much variation remains unexplained
\(\tau_g^2\) / \(\tau_k^2\) Group or component variance Which sets/components are enriched
\(\boldsymbol{\Sigma}_b\) Cross-trait covariance Pleiotropy or shared mechanisms

These variance components are not tuning parameters — they are learned quantities that describe the underlying biology. Their estimation is central to the interpretability of BLR.

Summary: Variance Components as Model-Driven Insights

  • Estimated jointly with effect sizes and inclusion probabilities
  • Control the degree of shrinkage and complexity in the model
  • Reveal biological structure (e.g., gene-set enrichments, trait correlations)
  • Provide model-based evidence for:
    • Polygenicity
    • Pleiotropy
    • Biological pathway relevance

In short, variance components are learned descriptors of architecture, not just technical parameters and are a cornerstone of the BLR framework.

References

Sørensen P, Rohde PD. A Versatile Data Repository for GWAS Summary Statistics-Based Downstream Genomic Analysis of Human Complex Traits.
medRxiv (2025). https://doi.org/10.1101/2025.10.01.25337099

Sørensen IF, Sørensen P. Privacy-Preserving Multivariate Bayesian Regression Models for Overcoming Data Sharing Barriers in Health and Genomics.
medRxiv (2025). https://doi.org/10.1101/2025.07.30.25332448

Hjelholt AJ, Gholipourshahraki T, Bai Z, Shrestha M, Kjølby M, Sørensen P, Rohde P. Leveraging Genetic Correlations to Prioritize Drug Groups for Repurposing in Type 2 Diabetes. medRxiv (2025). https://doi.org/10.1101/2025.06.13.25329590

Gholipourshahraki T, Bai Z, Shrestha M, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Evaluation of Bayesian Linear Regression Models for Gene Set Prioritization in Complex Diseases. PLOS Genetics 20(11): e1011463 (2025). https://doi.org/10.1371/journal.pgen.1011463

Bai Z, Gholipourshahraki T, Shrestha M, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Evaluation of Bayesian Linear Regression Derived Gene Set Test Methods. BMC Genomics 25(1): 1236 (2024). https://doi.org/10.1186/s12864-024-11026-2

Shrestha M, Bai Z, Gholipourshahraki T, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Enhanced Genetic Fine Mapping Accuracy with Bayesian Linear Regression Models in Diverse Genetic Architectures. PLOS Genetics 21(7): e1011783 (2025). https://doi.org/10.1371/journal.pgen.1011783

Kunkel D, Sørensen P, Shankar V, Morgante F. Improving Polygenic Prediction from Summary Data by Learning Patterns of Effect Sharing Across Multiple Phenotypes. PLOS Genetics 21(1): e1011519 (2025). https://doi.org/10.1371/journal.pgen.1011519

Rohde P, Sørensen IF, Sørensen P. Expanded Utility of the R Package qgg with Applications within Genomic Medicine. Bioinformatics 39:11 (2023). https://doi.org/10.1093/bioinformatics/btad656

Rohde P, Sørensen IF, Sørensen P. qgg: An R Package for Large-Scale Quantitative Genetic Analyses. Bioinformatics 36(8): 2614–2615 (2020). https://doi.org/10.1093/bioinformatics/btz955