Code
library(qgg)
library(ggplot2)Palle Duun Rohde
Genomic Medicine, Department of Health Science and Technology, Aalborg University, Denmark
Peter Sørensen
Center for Quantitative Genetics and Genomics, Aarhus University, Denmark
This tutorial introduces Bayesian MAGMA — a gene set prioritization approach utilizing Bayesian Linear Regression (BLR) models within the MAGMA gene set analysis framework.
The figure below shows a schematic overview of the workflow.
In the initial step, GWAS summary data for the traits of interest are used to compute gene-level Z-scores using the VEGAS (Versatile Gene-Based Association Study) approach.
Next, a design matrix linking genes to gene sets is constructed to integrate curated pathway databases (e.g., KEGG).
The Bayesian MAGMA model is then fitted using this design matrix as input predictors and the gene-level Z-scores as the response variable.
This results in posterior inclusion probabilities (PIPs) for each gene set, representing the probability that the set is included in the model.
Gene sets with higher PIPs are prioritized, highlighting pathways most likely to explain genetic associations.
The approach can also be extended to multi-trait analyses, enabling cross-trait exploration of pathway relevance.
See full details in Gholipourshahraki T, et al. PLOS Genet. 20, e1011463 (2024).
Figure 1. Schematic overview of the Bayesian MAGMA workflow.
This tutorial demonstrates how to simulate gene-level association statistics and overlapping gene sets (pathways) for enrichment analysis using MAGMA-style approaches, implemented either as OLS joint estimation or via Bayesian inference with BayesC or BayesR priors.
The tutorial includes the following key steps:
magma()-style models (linear and Bayesian).In this step, we simulate gene-level Z-scores for 20,000 genes,
where a subset of 100 are truly associated (causal).
We then visualize the results using a QQ plot that distinguishes
causal genes (red) from null genes (grey).
genes <- paste0("Gene", seq_len(n_genes))
# --- Step 1: Null genes centered around 0 ---
z_scores <- rnorm(n_genes, mean = 0, sd = 1)
names(z_scores) <- genes
# --- Step 2: Define causal genes (stronger, possibly positive or negative) ---
associated_genes <- sample(genes, n_assoc)
z_scores[associated_genes] <- rnorm(n_assoc, mean = effect_mean, sd = effect_sd)
# --- Step 3: Compute two-tailed p-values (VEGAS-style) ---
p_values <- 2 * pnorm(-abs(z_scores))
# --- Step 4: QQ plot ---
df <- data.frame(
Gene = genes,
P = p_values,
causal = genes %in% associated_genes
)
df <- df[order(df$P), ]
df$expected <- -log10(ppoints(n_genes))
df$observed <- -log10(df$P)
col_map <- c("FALSE" = "grey60", "TRUE" = "firebrick")
plot(df$expected, df$observed,
pch = 19, cex = 0.6,
col = col_map[as.character(df$causal)],
xlab = expression(Expected~~-log[10](italic(p))),
ylab = expression(Observed~~-log[10](italic(p))),
main = "QQ Plot of Simulated VEGAS-Style Gene Z-scores")
abline(0, 1, col = "black", lwd = 2)
legend("topleft",
legend = c("Null genes", "Causal genes"),
col = c("grey60", "firebrick"), pch = 19, bty = "n")We use a log-normal distribution to simulate pathway sizes, resulting in a right-skewed distribution (many small pathways, few large ones).
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.00 53.75 75.00 92.77 119.25 259.00
Here, we simulate overlapping pathways. Each pathway shares some genes with previous ones, mimicking realistic biological redundancy.
pathway_list <- vector("list", n_pathways)
names(pathway_list) <- paste0("Pathway_", seq_len(n_pathways))
for (i in seq_len(n_pathways)) {
size <- sizes[i]
if (i > 1 && runif(1) < 0.9) { # 90% chance to overlap
j <- sample(seq_len(i - 1), 1) # pick existing pathway
target_overlap <- round(size * runif(1, 0.20, 0.50)) # 20–50% overlap
overlap_pool <- pathway_list[[j]]
k <- min(target_overlap, length(overlap_pool), size)
overlap_genes <- if (k > 0) sample(overlap_pool, k) else character(0)
remaining_pool <- setdiff(genes, overlap_genes)
n_remaining <- size - length(overlap_genes)
new_genes <- if (n_remaining > 0) sample(remaining_pool, n_remaining) else character(0)
pathway_list[[i]] <- unique(c(overlap_genes, new_genes))
} else {
pathway_list[[i]] <- sample(genes, size)
}
}We enrich 1–2 pathways with truly associated genes to simulate signal.
n_enriched_paths <- 2
enriched_paths <- sample(names(pathway_list), n_enriched_paths)
for (p in enriched_paths) {
n_enriched_genes <- 30
enriched_genes <- sample(associated_genes, n_enriched_genes)
pathway_list[[p]] <- unique(c(pathway_list[[p]], enriched_genes))
}
cat("Created", n_pathways, "pathways (15–500 genes each)\n")Created 100 pathways (15–500 genes each)
Causal pathways: Pathway_75, Pathway_53
sizes_df <- data.frame(Pathway = names(pathway_list), Size = sizes)
ggplot(sizes_df, aes(x = Size)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
theme_minimal(base_size = 13) +
labs(
title = "Distribution of Simulated Pathway Sizes",
x = "Number of genes per pathway",
y = "Count"
)We calculate the Jaccard index (|A∩B| / |A∪B|) between pathways and visualize it with a heatmap.
n_show <- 30 # display subset
overlap_matrix <- matrix(NA, n_show, n_show,
dimnames = list(names(pathway_list)[1:n_show],
names(pathway_list)[1:n_show]))
for (i in seq_len(n_show)) {
for (j in seq_len(n_show)) {
g1 <- pathway_list[[i]]
g2 <- pathway_list[[j]]
overlap_matrix[i, j] <- length(intersect(g1, g2)) / length(union(g1, g2))
}
}
if (requireNamespace("pheatmap", quietly = TRUE)) {
library(pheatmap)
pheatmap(
overlap_matrix,
main = "Pathway Overlap (Jaccard Index)",
color = colorRampPalette(c("white", "red"))(50),
border_color = NA
)
}Here we compare three approaches implemented in qgg::magma():
| Method | Model Type | Concept | Output |
|---|---|---|---|
| Joint (fitM) | Linear regression | Tests all pathways simultaneously | β-coefficients, SEs, p-values |
| BayesC (fitC) | Bayesian spike-and-slab | Selects relevant pathways by posterior inclusion probability (PIP) | Posterior means, PIPs |
| BayesR (fitR) | Bayesian mixture prior | Allows pathways to have small, moderate, or large effects | Posterior means, PIPs |
ID m b seb z p
53 Pathway_53 45 0.0586 0.0072 8.1323 2.105592e-16
75 Pathway_75 76 0.0446 0.0076 5.8493 2.467936e-09
71 Pathway_71 63 0.0191 0.0083 2.2936 1.090724e-02
49 Pathway_49 61 0.0156 0.0082 1.9083 2.817681e-02
82 Pathway_82 85 0.0167 0.0091 1.8455 3.248288e-02
74 Pathway_74 54 0.0130 0.0072 1.7987 3.603322e-02
# Extract comparable metrics
results <- list(
Joint = fitM[, c("ID", "b", "p")],
BayesC = fitC[, c("ID", "b", "PIP")],
BayesR = fitR[, c("ID", "b", "PIP")]
)
# Merge results by Pathway ID
merged_results <- Reduce(function(x, y) merge(x, y, by = "ID", all = TRUE), results)
colnames(merged_results) <- c("Pathway", "b_joint", "p_joint", "b_bayesC", "PIP_bayesC", "b_bayesR", "PIP_bayesR")
merged_results$is_enriched <- merged_results$Pathway %in% enriched_paths
head(merged_results[order(-merged_results$PIP_bayesC), ]) Pathway b_joint p_joint b_bayesC PIP_bayesC b_bayesR PIP_bayesR
50 Pathway_53 0.0586 2.105592e-16 0.0587 1.0000 0.0586 1.0000
74 Pathway_75 0.0446 2.467936e-09 0.0443 1.0000 0.0440 1.0000
22 Pathway_28 -0.0188 9.947193e-01 -0.0001 0.0028 -0.0005 0.0284
60 Pathway_62 0.0134 3.969070e-02 0.0000 0.0018 0.0001 0.0120
77 Pathway_78 0.0104 8.826850e-02 0.0000 0.0014 0.0001 0.0078
45 Pathway_49 0.0156 2.817681e-02 0.0000 0.0012 0.0002 0.0126
is_enriched
50 TRUE
74 TRUE
22 FALSE
60 FALSE
77 FALSE
45 FALSE
library(ggplot2)
ggplot(merged_results, aes(x = -log10(p_joint), y = PIP_bayesC, color = is_enriched)) +
geom_point(size = 2, alpha = 0.8) +
geom_text(data = subset(merged_results, is_enriched), aes(label = Pathway),
vjust = -1, size = 3.2) +
scale_color_manual(values = c("FALSE" = "grey60", "TRUE" = "firebrick")) +
labs(
x = expression(-log[10](italic(p))~~"(Joint model)"),
y = "Posterior Inclusion Probability (BayesC)",
title = "Comparison of Classical and Bayesian MAGMA Results"
) +
theme_minimal(base_size = 13)The joint linear model detects a broad set of pathways with small effects but cannot quantify inclusion evidence. BayesC and BayesR provide posterior probabilities reflecting uncertainty and sparsity. In this simulation, truly enriched pathways (in red) show both low p-values and high PIPs (>0.9), illustrating consistency between classical and Bayesian inference while offering a probabilistic measure of pathway relevance.