gact

An R Package for Creating a Database of Genomic Association of Complex Traits

The R package gact is designed for establishing and populating a comprehensive database focused on genomic associations with complex traits. The package serves two primary functions: infrastructure creation and data acquisition. It facilitates the assembly of a structured repository that includes single marker associations, carefully curated to maintain high data quality. Beyond individual genetic markers, the package integrates a broad spectrum of genomic entities, encompassing genes, proteins, and an array of biological complexes (chemical and protein), as well as various biological pathways. It is designed to aid in the biological interpretation of genomic associations, shedding light on their complex relationships in the context of genomic associations of complex traits.

gact provides an infrastructure for efficient processing of large-scale genomic association data, including core functions for:

Establishing and populating a database for genomic association.
Downloading and processing a range of biological databases.
Downloading and processing summary statistics from genome-wide association studies (GWAS).
Conducting bioinformatic procedures to link genetic markers with genes, proteins, metabolites, and biological pathways.
Finemapping of genomic regions using Bayesian Linear Regression models.
Performing advanced gene set enrichment analysis utilizing a variety of tools and methodologies.

gact constructs gene and genetic marker sets from a range of biological databases including:

"Ensembl": Gene, protein, transcript sets derived from the Ensembl database.
"Regulation": Regulatory genomic feature sets derived from the Ensembl Regulation database.
"GO": Gene Ontology sets from the GO database.
"Pathways": Pathway sets from the Reactome and KEGG databases.
"ProteinComplexes": Protein complex sets derived from the STRING database.
"ChemicalComplexes": Chemical complex sets derived from the STITCH database.
"DrugGenes": Drug-gene interaction sets the DrugBank database.
"DrugATCGenes": Drug ATC gene sets based on the ATC and DrugBank databases.
"DrugComplexes": Drug gene complex sets combining information from STRING and DrugBank.
"DiseaseGenes": Disease-gene sets based on experiments, textmining and knowledge base derived from the DISEASE database.
"GTEx": GTEx project eQTL sets derived from the GTEx database.
"GWAScatalog": GWAS catalog sets derived from the GWAScatalog database.
"VEP": Variant Effect Predictor sets derived from the Ensembl Variant Effect Predictor database.

Installation of the gact package

To install the most recent version of the gact and qgg package from GitHub, use the following commands in R:

library(devtools)
devtools::install_github("psoerensen/gact")
devtools::install_github("psoerensen/qgg")

Tutorials for downloading and installing the gact database

Below is a set of tutorials used for the gact package:

Download and set up the gact database, which is focused on genomic associations for complex traits:
Download and install gact database

Download and process genotype data from the 1000 Genomes Project (1000G) for different ancestries (European, East Asian, South Asian) used in different genomic analysis:
Download and process of 1000G data

Computing sparse Linkage Disequilibrium (LD) matrices for 1000 Genomes Project (1000G) data across different ancestries and exploring the LD data which is used in a number of genomic analysis (LD score regression, Vegas gene analysis, Bayesian Linear Regression models):
Compute sparse LD matrices for 1000G data

Downloading and processing genome-wide association summary statistic and ingest into database:
Download and process new GWAS summary statistics

Tutorials for various types of genomic analysis using the gact database

Gene analysis using the VEGAS (Versatile Gene-based Association Study) approach using the 1000G LD reference data processed above:
Gene analysis using VEGAS

Gene set enrichment analysis (GSEA) based on BLR (Bayesian Linear Regression) model derived gene-level statistics and MAGMA (Multi-marker Analysis of GenoMic Annotation) (Bai et al. 2024).
Gene set analysis using MAGMA

Pathway prioritization using a single and multiple trait Bayesian MAGMA models and gene-level statistics derived from VEGAS (Gholipourshahraki et al. 2024).
Gene set analysis using Bayesian MAGMA

Polygenic Prioritization Scoring (PoPS) using BLR models and gene-level statistics derived from VEGAS (work in progress).
Gene ranking using PoPS

Finemapping with single trait Bayesian Linear Regression models and simulated data (Shrestha et al. 2023).
Finemapping using BLR models on simulated data

Finemapping of gene and LD regions using single trait Bayesian Linear Regression models (Shrestha et al. 2023).
Finemapping using BLR models on real data

Polygenic scoring (PGS) using Bayesian Linear Regression models and biological pathway information (work in progress).
Polygenic scoring using BLR models

Polygenic scoring (PGS) using summary statistics from PGS catalog and biological pathway information.
Polygenic scoring using PGS Catalog

LD score regression for estimating genomic heritability and correlations.
LD score regression

Funding

These notes and scripts are prepared in the BALDER project funded by the ODIN platform. ODIN is sponsored by the Novo Nordisk Foundation (grant number NNF20SA0061466)

References

Rohde PD, Sørensen IF, Sørensen P. qgg: an R package for large-scale quantitative genetic analyses. Bioinformatics 36:8 (2020). https://doi.org/10.1093/bioinformatics/btz955
Rohde PD, Sørensen IF, Sørensen P. Expanded utility of the R package, qgg, with applications within genomic medicine. Bioinformatics 39:11 (2023). https://doi.org/10.1093/bioinformatics/btad656
Shrestha et al. Evaluation of Bayesian Linear Regression Models as a Fine Mapping Tool. Submitted (2024) https://doi.org/10.1101/2023.09.01.555889
Bai et al. Evaluation of multiple marker mapping methods using single trait Bayesian Linear Regression models. BMC Genomics 25:1236 (2024). https://doi.org/10.1186/s12864-024-11026-2
Gholipourshahraki et al. Evaluation of Bayesian Linear Regression Models for Pathway Prioritization. PLOS Genetics 20(11) e1011463 (2025). https://doi.org/10.1371/journal.pgen.1011463.
Kunkel et al. Improving polygenic prediction from summary data by learning patterns of effect sharing across multiple phenotypes. Plos Genetics 21 (1), e1011519 (2025). https://doi.org/10.1371/journal.pgen.1011519.

This site is open source. Improve this page.