Hi,
I received a data-set of methylation-data (Illumina EPIC) derived from human blood samples, which quite puzzles me:
The experiment had a paired set-up, consisting of a control and a treatment group.
- First, a baseline measurment of both groups was established pre treatment ("t0")
- Than one group was treated and one was not (Control) -> a 2nd measurment was taken after that ("t1")
- For all patients sex and age is known
I first processed the Illumina data with "minfi" to produce a beta-value matrix, from which I would like to perform downstream analysis. I followed a generic pipeline:
- Remove poor quality samples via
detectionP(rgSet)
and removing those with a value <0.05 - I created a genomic-ratio set with preprocessQuantile-normalisation
- I did probe filtring of the GR-set via
detP()
and kept those with values <0.01 - I did sex-chromosome fitlering
- I removed loci with common SNPs
My first approach was to just calculate differentially mehylated positions (DMPs), using dmpFinder()
between control and treatment after treatment (t1), but I only received a single significant site.
I proceeded to apply a more sophisticated model by correcting for the baseline as a covariate with limma/ANCOVA, but without success.
Therefore, I now switched to use ALL, but mostly non-significant DMPs for GSEA analysis to take into account all sites; In contrast to a hypergeometric test with pvalues<0.05.
I have used methylRRA()
to do so an this time receive about 800 significant sites after BH correction between control and diet at t1, but also about 90 at t0 (which should ideally be 0)
Thus my questions are:
- Can I do this?, I have my doubts as I also get a bunch of significant terms at t0
- If yes, is there a better way to approach it? Like, to adjust for baseline to get a more statistically robust result?
- If no, could you give me another approach worth trying?
Thank you! for Your help
Update
So, I performmed a Limma Model to get differntially expressed CpGs
here::i_am("Limma_DE.R")
library(here)
library(limma)
library(minfi)
library(tidyverse)
load(here("data/processed/mVals.RData"))
#Create levels for model-matrix
Time <- factor(targets$Time, labels=c('t0','t1') )
Group <- factor(targets$Treatment, labels=c('Treat','Ctrl') )
lev=paste0(Group ,".",Time)
design <- model.matrix(~0+lev)
#Create contrast matrix with all factors of interest
cont_matrix <- makeContrasts( t1_VS_t0_Treat = levTreat.t1 - levTreat.t0,
t1_VS_t0_Ctrl = levCtrl.t1 - levCtrl.t0,
Treat_vs_Ctrl_at_t0= levTreat.t0 - levCtrl.t0,
Treat_vs_Ctrl_at_t1= levTreat.t1 - levCtrl.t1,
Interaction = (levTreat.t1-levTreat.t0)-(levCtrl.t1-levCtrl.t0),
levels=design)
#Calculate different expressed sites
fit <- lmFit(mVals, design)
fit <- contrasts.fit(fit, cont_matrix)
fit.eBayes <- eBayes(fit)
summary(decideTests(fit.eBayes, adjust.method="BH", p.value=0.05, lfc = 0.5))
save(fit.eBayes, file="data/processed/DECpG_limma.RData")
Than I run gsameth
here::i_am("GSEA.R")
library(here)
library(limma)
library(minfi)
library(missMethyl)
library(msigdbr)
library(tidyverse)
#Load eBayes fir
load("data/processed/DECpG_limma.RData")
#Extarct CpGs (#4 is Interaction)
DE_CpG <- topTable(fit.eBayes, coef=4, number=Inf, adjust.method="BH", sort.by="logFC")
head(DE_CpG)
#Get msigdb hallmark gene-sets
gs_hallmark <- msigdbr(species="Homo sapiens", category="H")
gs_C2 <- msigdbr(species="Homo sapiens", category="C2")
msig_gsea_format <- function(gs){
#Get msig gene-sets into gsameth-format (-> list of character-vectors)
#gs (data.frame): Object returned by msigdbr()
#ls_gs (list): list of gene-sets consiting of character vectors w/ entrezID
ls_gs <- gs %>% dplyr::select(gs_name,entrez_gene) %>%
tidytable::group_split.(gs_name, .keep = FALSE, .named = TRUE)
ls_gs <- lapply(ls_gs, function(x) as.character(x$entrez_gene))
return(ls_gs)
}
gsea_collection <- msig_gsea_format(gs_C2)
res_gsea <- gsameth(sig.cpg=row.names(DE_CpG %>% filter()),
array.type = "EPIC",
collection= gsea_collection,
genomic.features = "ALL", sig.genes = TRUE)
Now, I still do not have any sig. DE-CPGs. However I get sig. gene-sets after gsameth()
.
My new question would be: Is it okay to input all CpGs, even tough the documentation mentions "Character vector of significant CpG sites to test for gene set enrichment." My problem is that I would rather perform GSEA with random walks, so that I might detect some effect
Hi Lukas
You definitely should not be inputting all CpGs into
gsameth
. In the absence of significant CpGs, I would suggest inputting the top 500 or top 1000 most significant ones ranked by p-value. Thegometh
andgsameth
functions are based on Wallenius' Hypergeometric test with careful consideration of the composition of the arrays to model the bias. So it is a form of over-representation test, so you need to be supplying the functions with a subset of the CpGs that capture the differences you are interested in.Another method you could try is ebGSEA, which we found was a generally good performer. From memory, you don't need to define a subset of significant CpGs but I could be mistaken.
Cheers, Belinda