Gene set enrichment analysis

The function gsea can perform several different gene set enrichment analyses. The general procedure is to obtain single marker statistics (e.g. summary statistics), from which it is possible to compute and evaluate a test statistic for a set of genetic markers that measures a joint degree of association between the marker set and the phenotype. The marker set is defined by a genomic feature such as genes, biological pathways, gene interactions, gene expression profiles etc.

Currently, four types of gene set enrichment analyses can be conducted with gsea; sum-based, count-based, score-based, and our own developed method, the covariance association test (CVAT). For details and comparisons of test statistics consult doi:10.1534/genetics.116.189498.

The sum test is based on the sum of all marker summary statistics located within the feature set. The single marker summary statistics can be obtained from linear model analyses (from PLINK or using the qgg glma approximation), or from single or multiple component REML analyses (GBLUP or GFBLUP) from the greml function. The sum test is powerful if the genomic feature harbors many genetic markers that have small to moderate effects.

The count-based method is based on counting the number of markers within a genomic feature that show association (or have single marker p-value below a certain threshold) with the phenotype. Under the null hypothesis (that the associated markers are picked at random from the total number of markers, thus, no enrichment of markers in any genomic feature) it is assumed that the observed count statistic is a realization from a hypergeometric distribution.

The score-based approach is based on the product between the scaled genotypes in a genomic feature and the residuals from the liner mixed model (obtained from greml).

The covariance association test (CVAT) is derived from the fit object from greml (GBLUP or GFBLUP), and measures the covariance between the total genomic effects for all markers and the genomic effects of the markers within the genomic feature.

The distribution of the test statistics obtained from the sum-based, score-based and CVAT is unknown, therefore a circular permutation approach is used to obtain an empirical distribution of test statistics.

Usage

gsea(
  stat = NULL,
  sets = NULL,
  Glist = NULL,
  W = NULL,
  fit = NULL,
  g = NULL,
  e = NULL,
  threshold = 0.05,
  method = "sum",
  nperm = 1000,
  ncores = 1
)

Arguments

stat: vector or matrix of single marker statistics (e.g. coefficients, t-statistics, p-values)
sets: list of marker sets - names corresponds to row names in stat
Glist: list providing information about genotypes stored on disk
W: matrix of centered and scaled genotypes (used if method = cvat or score)
fit: list object obtained from a linear mixed model fit using the greml function
g: vector (or matrix) of genetic effects obtained from a linear mixed model fit (GBLUP of GFBLUP)
e: vector (or matrix) of residual effects obtained from a linear mixed model fit (GBLUP of GFBLUP)
threshold: used if method='hyperg' (threshold=0.05 is default)
method: including sum, cvat, hyperg, score
nperm: number of permutations used for obtaining an empirical p-value
ncores: number of cores used in the analysis

Value

Returns a dataframe or a list including

stat: marker set test statistics
m: number of markers in the set
p: enrichment p-value for marker set

Author

Peter Soerensen

Examples



 # Simulate data
 W <- matrix(rnorm(1000000), ncol = 1000)
 colnames(W) <- as.character(1:ncol(W))
 rownames(W) <- as.character(1:nrow(W))
 y <- rowSums(W[, 1:10]) + rowSums(W[, 501:510]) + rnorm(nrow(W))

 # Create model
 data <- data.frame(y = y, mu = 1)
 fm <- y ~ 0 + mu
 X <- model.matrix(fm, data = data)

 # Single marker association analyses
 stat <- glma(y=y,X=X,W=W)

 # Create marker sets
 f <- factor(rep(1:100,each=10), levels=1:100)
 sets <- split(as.character(1:1000),f=f)

 # Set test based on sums
 b2 <- stat[,"stat"]**2
 names(b2) <- rownames(stat)
 mma <- gsea(stat = b2, sets = sets, method = "sum", nperm = 100)
 head(mma)

 # Set test based on hyperG
 p <- stat[,"p"]
 names(p) <- rownames(stat)
 mma <- gsea(stat = p, sets = sets, method = "hyperg", threshold = 0.05)
 head(mma)

# \donttest{
 G <- grm(W=W)
 fit <- greml(y=y, X=X, GRM=list(G=G), theta=c(10,1))

 # Set test based on cvat
 mma <- gsea(W=W,fit = fit, sets = sets, nperm = 1000, method="cvat")
 head(mma)

 # Set test based on score
 mma <- gsea(W=W,fit = fit, sets = sets, nperm = 1000, method="score")
 head(mma)

# }