Genomic Medicine, Department of Health Science and Technology, Aalborg University, Denmark
Center for Quantitative Genetics and Genomics, Aarhus University, Denmark
gact provides an infrastructure for efficient processing of large-scale genomic association data, with core functions for:
gact is intended to serve as a practical implementation of integrative genomics, bridging statistical modeling and biological interpretation, and supporting reproducible and extensible workflows.
The gact() function is a single R command that creates and populates the Genomic Association of Complex Traits (GACT) database.
It automates three main tasks:
glist, gstat, gsets, marker, gtex, download, etc.)gact constructs gene and marker sets from a wide range of curated biological databases:
We plan to add additional biological resources in gact.
The gact R package includes utility functions to extract and structure data from the GACT database into analysis-ready inputs — \(\mathbf{Y}\) (e.g., summary statistic outcomes) and \(\mathbf{X}\) (genomic or biological features).
getMarkerStat() — retrieve GWAS summary statistics (Y’s)getFeatureStat() — extract gene-, protein-, or pathway-level results (Y’s)getMarkerSets() — define biological groupings (basis for X’s)designMatrix() — build feature matrices (X) linking variants or genes to biological feature setsTogether, these functions form a reproducible workflow for generating standardized input data for Bayesian Hierarchical Models and other machine learning approaches.
Bayesian Hierarchical Models provide a flexible statistical framework for modeling complex biological and healthcare data.
They support key applications such as:
These models build directly on the Y’s and X’s generated by gact, providing a unified framework for integrative genomic analysis.
The Bayesian Linear Regression (BLR) model builds:
\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) \]
In the Bayesian formulation, each \(\beta_j\) is assigned a prior distribution reflecting beliefs about effect size magnitude or sparsity and determine how information is shared across features or biological layers.
Regression effects can be estimated in many ways, but we focus on a Bayesian hierarchical framework because it:
Through their hierarchical structure, BLR models naturally integrate multiple biological layers — linking genomic, transcriptomic, and other molecular data
The three levels in the model:
| Level | Description | Example |
|---|---|---|
| 1 | Describes how data are generated given parameters | \(y \sim \mathcal{N}(Xb, \sigma^2 I)\) |
| 2 | Describes our beliefs about the parameters before seeing data | \(b_i \sim \mathcal{N}(0, \sigma_b^2)\) |
| 3 | Describes uncertainty about the prior’s parameters | \(\sigma_b^2 \sim \text{Inv-}\chi^2(\nu, S^2)\) |
This hierarchical structure allows the model to learn how strongly to shrink effect estimates from the data, while accounting for uncertainty in prior parameters and automatically regularizing effect sizes.
Simple and robust, but may not capture diverse effect-size distributions.
Complex traits arise from heterogeneous effect-size distributions — some features have large effects, many have small, and others are likely null. To capture this diversity, the BLR framework can be extended in two ways:
Data-driven grouping of molecular features:
The model learns effect-size classes from the data using a mixture of variances \(\{\tau_k^2\}\) with probabilities \(\{\pi_k\}\).
Biologically informed grouping of molecular features:
The model uses prior biological knowledge to assign features to groups a priori, each with its own variance \(\tau_g^2\) capturing within-group variability.
Both approaches enable the model to adapt to complex genetic and molecular architectures and share information across related features or omic layers
In Bayesian variable selection, each feature \(j\) is assigned an indicator variable:
\[ \delta_j = \begin{cases} 1, & \text{if feature $j$ has a non-zero effect} \\ 0, & \text{if feature $j$ has no effect.} \end{cases} \]
After inference, we estimate \(\text{PIP}_j = P(\delta_j = 1 \mid \text{data})\) — the posterior inclusion probability for feature j.
| Model Type | Prior Structure | Biological Interpretation |
|---|---|---|
| Single-component BLR | One global variance \(\tau^2\) | All features (across layers) share the same level of shrinkage — equal contribution assumption |
| Multiple-component BLR | Mixture of variances \(\{\tau_k^2\}\) | Features belong to different effect-size classes (e.g., large, small, null); grouping learned from data |
| Hierarchical (Biologically informed) BLR | Group-specific mixtures of variances \(\{\tau_{gk}^2\}\) | Features grouped a priori (e.g., by genes, pathways, or omic layers); within each group, effects can vary in size and sparsity |
These models form a hierarchy of increasing flexibility and biological realism —
from global shrinkage → to data-driven heterogeneity → to biologically structured mixtures that model variation within and between feature sets.
Many traits and molecular layers are correlated — they share genetic architecture and biological pathways.
To model these dependencies, we extend BLR to the multivariate setting:
In the multivariate BLR model, we model multiple correlated outcomes jointly:
\[ \mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E} \]
Each row of \(\mathbf{Y}\) corresponds to an observation or gene, and each column to a trait, phenotype, or molecular layer.
We extend the univariate priors to the multivariate setting:
\[ \mathbf{e}_{i\cdot} \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_e) \] \[ \mathbf{b}_j \sim \mathcal{N}_T(\mathbf{0}, \boldsymbol{\Sigma}_b) \]
Allows information sharing across correlated traits or omic layers and can be used to identify pleiotropic effects and cross-trait genetic architectures.
The hierarchical structure can be extended to model multiple traits
while preserving biological grouping of features:
\[ \mathbf{b}_j \sim \mathcal{N}_T\!\big(\mathbf{0}, \boldsymbol{\Sigma}_{b, g(j)}\big), \qquad \boldsymbol{\Sigma}_{b, g} \sim p(\boldsymbol{\Sigma}_{b, g}) \]
Enables information sharing both within biological sets and across correlated traits.
In the multivariate BLR, each feature \(j\) may affect multiple outcomes (traits).
We extend the indicator variable to capture cross-trait activity patterns:
\[ \boldsymbol{\delta}_j = \begin{bmatrix} \delta_{j1} \\ \delta_{j2} \\ \vdots \\ \delta_{jT} \end{bmatrix}, \qquad \delta_{jt} = \begin{cases} 1, & \text{if feature $j$ affects trait $t$} \\ 0, & \text{otherwise.} \end{cases} \]
After inference, we estimate \(\text{PIP}_{jt} = P(\delta_{jt} = 1 \mid \text{data})\) — the posterior inclusion probability that feature j affects trait t.
| Model Type | Feature Integration | Grouping Basis | Prior Structure | What It Captures |
|---|---|---|---|---|
| Single-component BLR | Combines all biological features in one model | None | One global variance (\(\tau^2\)) | All features contribute equally; uniform shrinkage |
| Multiple-component BLR | Integrates all layers but allows heterogeneous contributions | Learned from data | Mixture of variances (\(\{\tau_k^2\}\)) | Large, small, and null effect classes |
| Hierarchical BLR | Groups features by biological structure (e.g., genes, pathways) | Defined a priori | Group-specific mixture of variances (\(\{\tau_{gk}^2\}\)) | Within-group heterogeneity; enrichment and structured shrinkage |
| Multivariate BLR | Jointly models multiple correlated traits or outcomes | None or by trait | Shared covariance (\(\boldsymbol{\Sigma}_b\)) across traits | Genetic/molecular correlations; pleiotropy |
| Hierarchical MV-BLR | Combines biological grouping and multiple outcomes | Defined a priori | Group- and trait-specific covariance mixtures (\(\{\boldsymbol{\Sigma}_{b,gk}\}\)) | Shared biological mechanisms across traits and layers |
| Model Level | Key Parameters Learned | What They Represent | How They Are Learned | What We Learn Biologically |
|---|---|---|---|---|
| Effect sizes | \(\boldsymbol{\beta}\) | Strength and direction of association for each feature | Posterior mean/median given priors and data | Which features drive the outcome |
| Indicator variables | \(\delta_j\) (single trait), \(\boldsymbol{\delta}_j\) (multi-trait) | Whether feature \(j\) is active (and for which traits) | Estimated as posterior inclusion probabilities (PIPs) | Which features are relevant, and whether effects are shared or trait-specific |
| Variance components | \(\tau^2\), \(\{\tau_k^2\}\), \(\{\tau_{gk}^2\}\) | Magnitude of expected effect sizes; heterogeneity across layers or groups | Inferred hierarchically from the data (via MCMC or EM) | How strongly different groups or omic layers contribute |
| Covariance components | \(\boldsymbol{\Sigma}_b\), \(\{\boldsymbol{\Sigma}_{b,g}\}\) | Correlation of effects across traits or molecular layers | Estimated from joint posterior | Shared pathways, pleiotropy, and cross-layer architecture |
Bayesian Linear Regression (BLR) provides a probabilistic framework for estimating genome-wide effects used in polygenic risk scores (PRS).
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) \]
After fitting the BLR model:
\[ \text{PRS}_i = \sum_j X_{ij} \, \hat{\beta}_j \]
qgg provides tools for statistical modeling and analysis of large-scale genomic data, including:
qgg handles large-scale genomic data through efficient algorithms and sparse matrix techniques, combined with multi-core processing using OpenMP, multithreaded matrix operations via BLAS libraries (e.g., OpenBLAS, ATLAS, or MKL), and fast, memory-efficient batch processing of genotype data stored in
binary formats such as PLINK .bed files.
Gene analysis using VEGAS: Gene analysis using the VEGAS (Versatile Gene-based Association Study) approach using the 1000G LD reference data processed above,
Gene set analysis using Bayesian MAGMA: Pathway prioritization using a single and multiple trait Bayesian MAGMA models and gene-level statistics derived from VEGAS (Gholipourshahraki et al.2024).
Gene ranking using PoPS: Polygenic Prioritization Scoring (PoPS) using BLR models and gene-level statistics derived from VEGAS (work in progress).
Finemapping using BLR models: Finemapping of gene and LD regions using single trait Bayesian Linear Regression models (Shrestha et al.2025).
Polygenic scoring using BLR models: Polygenic scoring (PGS) using Bayesian Linear Regression models and biological pathway information (work in progress).
Polygenic scoring using PGS Catalog: Polygenic scoring (PGS) using summary statistics from PGS catalog and biological pathway information.
LD score regression: LD score regression for estimating genomic heritability and correlations.
From Data Integration to Modeling
Bridges data integration, statistical modeling and biological interpretation, enabling reproducible and extensible workflows.
Integrates biological information across molecular layers — from genome to pathways, complexes, and drug–gene interactions
Uses structured priors and hierarchical modeling to share information, regularize effect estimates, and quantify uncertainty
Enables data-driven discovery and prediction
Next Steps
💡 We are open to collaboration!
If you’re interested in applying BLR methods or contributing to the gact framework, please reach out.
Further Reading
Sørensen P, Rohde PD. A Versatile Data Repository for GWAS Summary Statistics-Based Downstream Genomic Analysis of Human Complex Traits.
medRxiv (2025). https://doi.org/10.1101/2025.10.01.25337099
Sørensen IF, Sørensen P. Privacy-Preserving Multivariate Bayesian Regression Models for Overcoming Data Sharing Barriers in Health and Genomics.
medRxiv (2025). https://doi.org/10.1101/2025.07.30.25332448
Hjelholt AJ, Gholipourshahraki T, Bai Z, Shrestha M, Kjølby M, Sørensen P, Rohde P. Leveraging Genetic Correlations to Prioritize Drug Groups for Repurposing in Type 2 Diabetes. medRxiv (2025). https://doi.org/10.1101/2025.06.13.25329590
Gholipourshahraki T, Bai Z, Shrestha M, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Evaluation of Bayesian Linear Regression Models for Gene Set Prioritization in Complex Diseases. PLOS Genetics 20(11): e1011463 (2025). https://doi.org/10.1371/journal.pgen.1011463
Bai Z, Gholipourshahraki T, Shrestha M, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Evaluation of Bayesian Linear Regression Derived Gene Set Test Methods. BMC Genomics 25(1): 1236 (2024). https://doi.org/10.1186/s12864-024-11026-2
Shrestha M, Bai Z, Gholipourshahraki T, Hjelholt A, Rohde P, Fuglsang MK, Sørensen P. Enhanced Genetic Fine Mapping Accuracy with Bayesian Linear Regression Models in Diverse Genetic Architectures. PLOS Genetics 21(7): e1011783 (2025). https://doi.org/10.1371/journal.pgen.1011783
Kunkel D, Sørensen P, Shankar V, Morgante F. Improving Polygenic Prediction from Summary Data by Learning Patterns of Effect Sharing Across Multiple Phenotypes. PLOS Genetics 21(1): e1011519 (2025). https://doi.org/10.1371/journal.pgen.1011519
Rohde P, Sørensen IF, Sørensen P. Expanded Utility of the R Package qgg with Applications within Genomic Medicine. Bioinformatics 39:11 (2023). https://doi.org/10.1093/bioinformatics/btad656
Rohde P, Sørensen IF, Sørensen P. qgg: An R Package for Large-Scale Quantitative Genetic Analyses. Bioinformatics 36(8): 2614–2615 (2020). https://doi.org/10.1093/bioinformatics/btz955
Gene and biological pathway prioritization can provide valuable insights into the underlying biology of diseases and potential drug targets.
MAGMA: Multi-marker Analysis of GenoMic Annotation (Leuww et al 2015) generalized gene set analysis of GWAS data
Compute gene-level (or other feature-level) association statistics:
Bai et al., 2025
Gholipourshahraki et al., 2024
Gholipourshahraki et al., 2024
Hjelholt et al., 2025