Center for Quantitative Genetics and Genomics, Aarhus University, Denmark
Genomic Medicine, Department of Health Science and Technology, Aalborg University, Denmark

Gene set analysis evaluates the coordinated action of genes or sets of variants within predefined biological pathways or functional groups.
GWAS identify single genetic variants (SNPs) associated with traits or diseases.
Many variants have small individual effects
→ Use larger datasets or make better use of existing data.
Some effects are clustered within functionally related genes or pathways
→ Use prior information on functional marker groups to improve detection power and interpretation.
Some effects are shared across multiple traits
→ Leverage correlated trait information to enhance detection power and prediction accuracy.
Many different gene set analysis approaches have been proposed.
MAGMA: Multi-marker Analysis of GenoMic Annotation (Leuww et al 2015)
PoPS: Polygenic Prioritisation Scoring (Weeks et al 2023)
MAGMA fits a linear regression model to test associations between gene sets and traits.
When analyzing thousands of gene sets, several issues arise:
- Overfitting – the model may capture noise rather than true signals.
- Multicollinearity – many gene sets are correlated due to biological overlap.
- Multiple testing – increases false-positive risk.
- Interpretation difficulty – hard to disentangle contributions of overlapping sets.
→ Use of regularization and variable selection to improve model robustness and interpretability.
Additionally, many complex traits are genetically correlated, sharing overlapping biological pathways.
→ Incorporating multi-trait information in MAGMA can increase detection power and reveal shared genetic mechanisms across traits.
Develop and evaluate a Bayesian gene-set prioritization approach using BLR within the MAGMA framework.
The Bayesian MAGMA framework builds on the standard regression model:
\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) \]
In the Bayesian formulation, each \(\beta_j\) is assigned a prior distribution that encodes assumptions about effect size magnitude, sparsity, or functional grouping.
These priors enable regularization, variable selection, and information sharing across correlated features or biological layers.
In the multivariate BLR model, we model multiple correlated outcomes jointly:
\[ \mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E} \]
Each row of \(\mathbf{Y}\) corresponds to an observation or gene, and each column to a trait, phenotype, or molecular layer.
We extend the univariate priors to the multivariate setting: \[ \mathbf{e}_{i\cdot} \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_e) \] \[ \mathbf{b}_{j\cdot} \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_b) \]
\(\boldsymbol{\Sigma}_e\): residual covariance among traits (\(\mathbf{e}_{i\cdot}\) is the vector of residuals for observation \(i\) across all \(T\) traits)
\(\boldsymbol{\Sigma}_b\): covariance of effect sizes across traits (\(\mathbf{b}_{j\cdot}\) is the vector of effect sizes for feature \(j\) across all \(T\) traits)
When the off-diagonal elements are nonzero, the model borrows information across correlated traits, enabling detection of pleiotropic effects and shared genetic factors.
When \(\boldsymbol{\Sigma}_e\) and \(\boldsymbol{\Sigma}_b\) are diagonal, the model reduces to \(T\) independent univariate BLR models.
Each feature \(j\) may affect multiple outcomes (traits).
We define an indicator vector for cross-trait activity patterns:
\[ \boldsymbol{\delta}_j = \begin{bmatrix} \delta_{j1} \\ \delta_{j2} \\ \vdots \\ \delta_{jT} \end{bmatrix}, \qquad \delta_{jt} = \begin{cases} 1, & \text{if feature $j$ affects trait $t$}, \\ 0, & \text{otherwise.} \end{cases} \]
After Gibbs sampling, the posterior inclusion probability (PIP) is estimated as:
\[ \widehat{\text{PIP}}_{jt} = \frac{1}{M} \sum_{m=1}^{M} \mathbb{I}\!\left(\delta_{jt}^{(m)} = 1\right) \approx P(\delta_{jt} = 1 \mid \text{data}), \]
representing the probability that feature \(j\) is associated with trait \(t\).
In the multivariate setting, we generalize each posterior quantity:
| Parameter | Interpretation |
|---|---|
| \(\mathbf{B} = [\beta_{jt}]\) | Effect matrix across traits (\(j\): feature, \(t\): trait) |
| \(\mathbf{PIP} = [\text{PIP}_{jt}]\) | Posterior inclusion probability matrix (\(j\): feature, \(t\): trait) |
| \(\boldsymbol{\Sigma}_b\) | Covariance of effects across traits |
| \(\boldsymbol{\Sigma}_e\) | Residual covariance among traits |
These posterior quantities allow us to identify:
Evaluate a Bayesian gene-set prioritization approach using BLR within the MAGMA framework.
Simulation study:
- Assessed model performance under varying gene set characteristics and genetic architectures.
- Used UK Biobank genetic data for realistic evaluation.
Comparative analysis:
- Benchmarked Bayesian MAGMA against the standard MAGMA approach.
Applications:
- Applied to nine complex traits using publicly available GWAS data.
- Developed a multi-trait BLR model to integrate GWAS results across traits and uncover shared genetic architecture.

Gholipourshahraki et al., 2024

Compute gene-level (or other feature-level) association statistics:
Bai et al., 2025

Gholipourshahraki et al., 2024

Hjelholt et al., 2025

Advantages
Limitations
Future Work
Sørensen P, Rohde PD. A Versatile Data Repository for GWAS Summary Statistics-Based Downstream Genomic Analysis of Human Complex Traits.
medRxiv (2025). https://doi.org/10.1101/2025.10.01.25337099
Sørensen IF, Sørensen P. Privacy-Preserving Multivariate Bayesian Regression Models for Overcoming Data Sharing Barriers in Health and Genomics.
medRxiv (2025). https://doi.org/10.1101/2025.07.30.25332448
Hjelholt AJ, Gholipourshahraki T, Bai Z, Shrestha M, Kjølby M, Sørensen P, Rohde P. Leveraging Genetic Correlations to Prioritize Drug Groups for Repurposing in Type 2 Diabetes. medRxiv (2025). https://doi.org/10.1101/2025.06.13.25329590
Gholipourshahraki T, Bai Z, Shrestha M, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Evaluation of Bayesian Linear Regression Models for Gene Set Prioritization in Complex Diseases. PLOS Genetics 20(11): e1011463 (2025). https://doi.org/10.1371/journal.pgen.1011463
Bai Z, Gholipourshahraki T, Shrestha M, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Evaluation of Bayesian Linear Regression Derived Gene Set Test Methods. BMC Genomics 25(1): 1236 (2024). https://doi.org/10.1186/s12864-024-11026-2
Shrestha M, Bai Z, Gholipourshahraki T, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Enhanced Genetic Fine Mapping Accuracy with Bayesian Linear Regression Models in Diverse Genetic Architectures. PLOS Genetics 21(7): e1011783 (2025). https://doi.org/10.1371/journal.pgen.1011783
Kunkel D, Sørensen P, Shankar V, Morgante F. Improving Polygenic Prediction from Summary Data by Learning Patterns of Effect Sharing Across Multiple Phenotypes. PLOS Genetics 21(1): e1011519 (2025). https://doi.org/10.1371/journal.pgen.1011519
Rohde P, Sørensen IF, Sørensen P. Expanded Utility of the R Package qgg with Applications within Genomic Medicine. Bioinformatics 39:11 (2023). https://doi.org/10.1093/bioinformatics/btad656
Rohde P, Sørensen IF, Sørensen P. qgg: An R Package for Large-Scale Quantitative Genetic Analyses. Bioinformatics 36(8): 2614–2615 (2020). https://doi.org/10.1093/bioinformatics/btz955
| Model Type | Feature Integration | Grouping Basis | Prior Structure | What It Captures |
|---|---|---|---|---|
| Single-component BLR | Combines all biological features in one model | None | One global variance (\(\tau^2\)) | All features contribute equally; uniform shrinkage |
| Multiple-component BLR | Integrates all layers but allows heterogeneous contributions | Learned from data | Mixture of variances (\(\{\tau_k^2\}\)) | Large, small, and null effect classes |
| Hierarchical BLR | Groups features by biological structure (e.g., genes, pathways) | Defined a priori | Group-specific mixture of variances (\(\{\tau_{gk}^2\}\)) | Within-group heterogeneity; enrichment and structured shrinkage |
| Multivariate BLR | Jointly models multiple correlated traits or outcomes | None or by trait | Shared covariance (\(\boldsymbol{\Sigma}_b\)) across traits | Genetic/molecular correlations; pleiotropy |
| Hierarchical MV-BLR | Combines biological grouping and multiple outcomes | Defined a priori | Group- and trait-specific covariance mixtures (\(\{\boldsymbol{\Sigma}_{b,gk}\}\)) | Shared biological mechanisms across traits and layers |
| Model Level | Key Parameters Learned | What They Represent | How They Are Learned | What We Learn Biologically |
|---|---|---|---|---|
| Effect sizes | \(\boldsymbol{\beta}\) | Strength and direction of association for each feature | Posterior mean/median given priors and data | Which features drive the outcome |
| Indicator variables | \(\delta_j\) (single trait), \(\boldsymbol{\delta}_j\) (multi-trait) | Whether feature \(j\) is active (and for which traits) | Estimated as posterior inclusion probabilities (PIPs) | Which features are relevant, and whether effects are shared or trait-specific |
| Variance components | \(\tau^2\), \(\{\tau_k^2\}\), \(\{\tau_{gk}^2\}\) | Magnitude of expected effect sizes; heterogeneity across layers or groups | Inferred hierarchically from the data (via MCMC or EM) | How strongly different groups or omic layers contribute |
| Covariance components | \(\boldsymbol{\Sigma}_b\), \(\{\boldsymbol{\Sigma}_{b,g}\}\) | Correlation of effects across traits or molecular layers | Estimated from joint posterior | Shared pathways, pleiotropy, and cross-layer architecture |