The R package gact is designed for establishing and populating a comprehensive database focused on genomic associations with complex traits. The package serves two primary functions: infrastructure creation and data acquisition. It facilitates the assembly of a structured repository that includes single marker associations, all rigorously curated to ensure the high quality of data. Beyond individual genetic markers, the package integrates a broad spectrum of genomic entities, encompassing genes, proteins, and an array of biological complexes (chemical and protein), as well as various biological pathways. It is designed to aid in the biological interpretation of genomic associations, shedding light on their complex relationships in the context of genomic associations of complex traits.
gact provides an infrastructure for efficient processing of large-scale genomic association data, including core functions for:
gact constructs gene and genetic marker sets from a range of biological databases including:
"Ensembl"
: Gene, protein, transcript sets from the
Ensembl database."Regulation"
: Regulatory genomic feature sets from the Ensembl
Regulation
database."GO"
: Gene Ontology sets from the GO
database."Pathways"
: Pathway sets from the Reactome
and KEGG databases."ProteinComplexes"
: Protein complex sets from the
STRING database."ChemicalComplexes"
: Chemical complex sets from the
STITCH database."DrugGenes"
: Drug-gene interaction sets the
DrugBank database."DrugATCGenes"
: Drug ATC gene sets based on the
ATC and
DrugBank databases."DrugComplexes"
: Drug complex sets combining information from
STRING and
DrugBank."DiseaseGenes"
: Disease-gene sets based on experiments, textmining
and knowledge base from the
DISEASE database."GTEx"
: GTEx project eQTL sets from the
GTEx
database."GWAScatalog"
: GWAS catalog sets from the
GWAScatalog database.To install the most recent version of the gact package from GitHub, use the following commands in R:
library(devtools)
devtools::install_github("psoerensen/gact")
Below is a set of tutorials used for the gact package:
Download and set up the gact database, which is focused on genomic
associations for complex traits:
Download and install gact
database
Downloading and processing genome-wide association summary statistic and
ingest into database:
Download and process new gwas summary
statistics
Download and process genotype data from the 1000 Genomes Project (1000G)
for different ancestries (European, East Asian, South Asian) used in
different genomic analysis:
Download and process of 1000G data
Computing sparse Linkage Disequilibrium (LD) matrices for 1000 Genomes
Project (1000G) data across different ancestries and exploring the LD
data which is used in a number of genomic analysis (LD score regression,
Vegas gene analysis, Bayesian Linear Regression models):
Compute sparse LD matrices for 1000G
data
Gene analysis using the VEGAS (Versatile Gene-based Association Study)
approach using the 1000G LD reference data processed above:
Gene analysis using VEGAS
Gene set enrichment analysis (GSEA) based on BLR (Bayesian Linear
Regression) model derived gene-level statistics and MAGMA (Multi-marker
Analysis of GenoMic Annotation) (Bai et al. 2024).
Gene set analysis using
BLR-MAGMA
Pathway prioritization using a BLR-MAGMA model and gene-level statistics
derived from VEGAS (Gholipourshahraki et al. 2024).
Pathway prioritization using
BLR-MAGMA
Finemapping of gene regions using single trait Bayesian Linear
Regression models (Shrestha et al. 2023).
Finemapping of gene regions using BLR
models
Finemapping of LD regions using single trait Bayesian Linear Regression
models (Shrestha et al. 2023).
Finemapping of LD regions using BLR
models
LD score regression for estimating genomic heritability and
correlations.
LD score regression
These notes and scripts are prepared in the BALDER project funded by the ODIN platform. ODIN is sponsored by the Novo Nordisk Foundation (grant number NNF20SA0061466)
Rohde PD, Sørensen IF, Sørensen P. 2020. qgg: an R package for large-scale quantitative genetic analyses. Bioinformatics 36:8. doi.org/10.1093/bioinformatics/btz955
Rohde PD, Sørensen IF, Sørensen P. 2023. Expanded utility of the R package, qgg, with applications within genomic medicine. Bioinformatics 39:11. doi.org/10.1093/bioinformatics/btad656
Shrestha et al. 2023. Evaluation of Bayesian Linear Regression Models as a Fine Mapping Tool. Submitted doi.org/10.1101/2023.09.01.555889
Bai et al. 2024. Evaluation of multiple marker mapping methods using single trait Bayesian Linear Regression models. In preparation
Gholipourshahraki et al. 2024. Evaluation of Bayesian Linear Regression Models for Pathway Prioritization. In preparation