The R package gact is designed for establishing and populating a comprehensive database focused on genomic associations with complex traits. The package serves two primary functions: infrastructure creation and data acquisition. It facilitates the assembly of a structured repository that includes single marker associations, carefully curated to maintain high data quality. Beyond individual genetic markers, the package integrates a broad spectrum of genomic entities, encompassing genes, proteins, and an array of biological complexes (chemical and protein), as well as various biological pathways. It is designed to aid in the biological interpretation of genomic associations, shedding light on their complex relationships in the context of genomic associations of complex traits.
gact provides an infrastructure for efficient processing of large-scale genomic association data, including core functions for:
gact constructs gene and genetic marker sets from a range of biological databases including:
"Ensembl"
: Gene, protein, transcript sets derived from the
Ensembl database."Regulation"
: Regulatory genomic feature sets derived from the
Ensembl
Regulation
database."GO"
: Gene Ontology sets from the GO
database."Pathways"
: Pathway sets from the Reactome
and KEGG databases."ProteinComplexes"
: Protein complex sets derived from the
STRING database."ChemicalComplexes"
: Chemical complex sets derived from the
STITCH database."DrugGenes"
: Drug-gene interaction sets the
DrugBank database."DrugATCGenes"
: Drug ATC gene sets based on the
ATC and
DrugBank databases."DrugComplexes"
: Drug gene complex sets combining information from
STRING and
DrugBank."DiseaseGenes"
: Disease-gene sets based on experiments, textmining
and knowledge base derived from the
DISEASE database."GTEx"
: GTEx project eQTL sets derived from the
GTEx
database."GWAScatalog"
: GWAS catalog sets derived from the
GWAScatalog database."VEP"
: Variant Effect Predictor sets derived from the Ensembl
Variant Effect
Predictor database.To install the most recent version of the gact and qgg package from GitHub, use the following commands in R:
library(devtools)
devtools::install_github("psoerensen/gact")
devtools::install_github("psoerensen/qgg")
Below is a set of tutorials used for the gact package:
Download and set up the gact database, which is focused on genomic
associations for complex traits:
Download and install gact
database
Download and process genotype data from the 1000 Genomes Project (1000G)
for different ancestries (European, East Asian, South Asian) used in
different genomic analysis:
Download and process of 1000G data
Computing sparse Linkage Disequilibrium (LD) matrices for 1000 Genomes
Project (1000G) data across different ancestries and exploring the LD
data which is used in a number of genomic analysis (LD score regression,
Vegas gene analysis, Bayesian Linear Regression models):
Compute sparse LD matrices for 1000G
data
Downloading and processing genome-wide association summary statistic and
ingest into database:
Download and process new GWAS summary
statistics
Gene analysis using the VEGAS (Versatile Gene-based Association Study)
approach using the 1000G LD reference data processed above:
Gene analysis using VEGAS
Gene set enrichment analysis (GSEA) based on BLR (Bayesian Linear
Regression) model derived gene-level statistics and MAGMA (Multi-marker
Analysis of GenoMic Annotation) (Bai et al. 2024).
Gene set analysis using MAGMA
Pathway prioritization using a single and multiple trait Bayesian MAGMA
models and gene-level statistics derived from VEGAS (Gholipourshahraki
et al. 2024).
Gene set analysis using Bayesian
MAGMA
Polygenic Prioritization Scoring (PoPS) using BLR models and gene-level
statistics derived from VEGAS (work in progress).
Gene ranking using PoPS
Finemapping of gene and LD regions using single trait Bayesian Linear
Regression models (Shrestha et al. 2023).
Finemapping using BLR
models
Polygenic scoring (PGS) using Bayesian Linear Regression models and
biological pathway information (work in progress).
Polygenic scoring using BLR
models
Polygenic scoring (PGS) using summary statistics from PGS catalog and
biological pathway information.
Polygenic scoring using PGS
Catalog
LD score regression for estimating genomic heritability and
correlations.
LD score regression
These notes and scripts are prepared in the BALDER project funded by the ODIN platform. ODIN is sponsored by the Novo Nordisk Foundation (grant number NNF20SA0061466)
Rohde PD, Sørensen IF, Sørensen P. 2020. qgg: an R package for large-scale quantitative genetic analyses. Bioinformatics 36:8. doi.org/10.1093/bioinformatics/btz955
Rohde PD, Sørensen IF, Sørensen P. 2023. Expanded utility of the R package, qgg, with applications within genomic medicine. Bioinformatics 39:11. doi.org/10.1093/bioinformatics/btad656
Shrestha et al. 2023. Evaluation of Bayesian Linear Regression Models as a Fine Mapping Tool. Submitted doi.org/10.1101/2023.09.01.555889
Bai et al. 2024. Evaluation of multiple marker mapping methods using single trait Bayesian Linear Regression models. In preparation
Gholipourshahraki et al. 2024. Evaluation of Bayesian Linear Regression Models for Pathway Prioritization. In preparation