Question

GSEA for paired methylation data

0

Entering edit mode

Lukas • 0

@ed59e223

Last seen 3.0 years ago

Italy

Hi,

I received a data-set of methylation-data (Illumina EPIC) derived from human blood samples, which quite puzzles me:

The experiment had a paired set-up, consisting of a control and a treatment group.

First, a baseline measurment of both groups was established pre treatment ("t0")
Than one group was treated and one was not (Control) -> a 2nd measurment was taken after that ("t1")
For all patients sex and age is known

I first processed the Illumina data with "minfi" to produce a beta-value matrix, from which I would like to perform downstream analysis. I followed a generic pipeline:

Remove poor quality samples via detectionP(rgSet) and removing those with a value <0.05
I created a genomic-ratio set with preprocessQuantile-normalisation
I did probe filtring of the GR-set via detP() and kept those with values <0.01
I did sex-chromosome fitlering
I removed loci with common SNPs

My first approach was to just calculate differentially mehylated positions (DMPs), using dmpFinder() between control and treatment after treatment (t1), but I only received a single significant site. I proceeded to apply a more sophisticated model by correcting for the baseline as a covariate with limma/ANCOVA, but without success.

Therefore, I now switched to use ALL, but mostly non-significant DMPs for GSEA analysis to take into account all sites; In contrast to a hypergeometric test with pvalues<0.05. I have used methylRRA() to do so an this time receive about 800 significant sites after BH correction between control and diet at t1, but also about 90 at t0 (which should ideally be 0)

Thus my questions are:

Can I do this?, I have my doubts as I also get a bunch of significant terms at t0
If yes, is there a better way to approach it? Like, to adjust for baseline to get a more statistically robust result?
If no, could you give me another approach worth trying?

Thank you! for Your help

Update

So, I performmed a Limma Model to get differntially expressed CpGs

here::i_am("Limma_DE.R")
library(here)
library(limma)
library(minfi)
library(tidyverse)

load(here("data/processed/mVals.RData"))

#Create levels for model-matrix
Time <- factor(targets$Time, labels=c('t0','t1') )
Group <- factor(targets$Treatment, labels=c('Treat','Ctrl') )
lev=paste0(Group ,".",Time)

design <- model.matrix(~0+lev)

#Create contrast matrix with all factors of interest
cont_matrix <- makeContrasts( t1_VS_t0_Treat = levTreat.t1 - levTreat.t0,
                              t1_VS_t0_Ctrl = levCtrl.t1 - levCtrl.t0,
                              Treat_vs_Ctrl_at_t0= levTreat.t0 - levCtrl.t0,
                              Treat_vs_Ctrl_at_t1= levTreat.t1 - levCtrl.t1,
                              Interaction = (levTreat.t1-levTreat.t0)-(levCtrl.t1-levCtrl.t0), 
                              levels=design)

#Calculate different expressed sites
fit <- lmFit(mVals, design) 
fit <- contrasts.fit(fit, cont_matrix) 
fit.eBayes <- eBayes(fit)
summary(decideTests(fit.eBayes, adjust.method="BH", p.value=0.05, lfc = 0.5))

save(fit.eBayes, file="data/processed/DECpG_limma.RData")

Than I run gsameth

here::i_am("GSEA.R")

library(here)
library(limma)
library(minfi)
library(missMethyl)
library(msigdbr)
library(tidyverse)


#Load eBayes fir
load("data/processed/DECpG_limma.RData")
#Extarct CpGs (#4 is Interaction)
DE_CpG <- topTable(fit.eBayes, coef=4, number=Inf, adjust.method="BH", sort.by="logFC")
head(DE_CpG)

#Get msigdb hallmark gene-sets
gs_hallmark <- msigdbr(species="Homo sapiens", category="H")
gs_C2 <- msigdbr(species="Homo sapiens", category="C2")


msig_gsea_format <- function(gs){
  #Get msig gene-sets into gsameth-format (-> list of character-vectors)
  #gs (data.frame): Object returned by msigdbr()
  #ls_gs (list): list of gene-sets consiting of character vectors w/ entrezID 
  ls_gs <- gs %>% dplyr::select(gs_name,entrez_gene) %>% 
                      tidytable::group_split.(gs_name, .keep = FALSE, .named = TRUE)
  ls_gs <- lapply(ls_gs, function(x) as.character(x$entrez_gene))

  return(ls_gs)
}

gsea_collection <- msig_gsea_format(gs_C2)


res_gsea <- gsameth(sig.cpg=row.names(DE_CpG %>% filter()), 
                    array.type = "EPIC",
                    collection= gsea_collection,
                    genomic.features = "ALL", sig.genes = TRUE)

Now, I still do not have any sig. DE-CPGs. However I get sig. gene-sets after gsameth(). My new question would be: Is it okay to input all CpGs, even tough the documentation mentions "Character vector of significant CpG sites to test for gene set enrichment." My problem is that I would rather perform GSEA with random walks, so that I might detect some effect

missMethyl GSEA minfi methylGSA • 2.3k views

ADD COMMENT • link updated 3.1 years ago by belinda.phipson ▴ 40 • written 3.1 years ago by Lukas • 0

0

Entering edit mode

Hi Lukas

You definitely should not be inputting all CpGs into gsameth. In the absence of significant CpGs, I would suggest inputting the top 500 or top 1000 most significant ones ranked by p-value. The gometh and gsameth functions are based on Wallenius' Hypergeometric test with careful consideration of the composition of the arrays to model the bias. So it is a form of over-representation test, so you need to be supplying the functions with a subset of the CpGs that capture the differences you are interested in.

Another method you could try is ebGSEA, which we found was a generally good performer. From memory, you don't need to define a subset of significant CpGs but I could be mistaken.

Cheers, Belinda

ADD REPLY • link 3.1 years ago belinda.phipson ▴ 40

score 2 · Answer 1 · 2023-01-04

Hi,

In terms of your analysis pipeline, it wasn't quite clear exactly what you have done. We have a methylation workflow package that goes through all the filtering and preprocessing (eg normalisation) steps, and walks you through a limma analysis and gene set testing (and some other analysis that might not be of interest). It focuses on 450K arrays but the process is the same for EPIC arrays. It is here in case you haven't seen it: https://www.bioconductor.org/packages/release/workflows/vignettes/methylationArrayAnalysis/inst/doc/methylationArrayAnalysis.html

I have a few other comments:

1.) Do you have male/female samples? You will want to remove CpGs on the X and Y chromosomes.

2.) You should be testing differential methylation on the M-values, not the Beta values.

3.) You are probably interested in the difference of the differences, ie an interaction effect between time and treatment group. In the limma manual section 9.6.1 there is an example of how to set up the contrast matrix to do this (the section is called "Time course experiments"). Essentially you want to test a contrast like this: int = (treat.t1 - treat.t0) - (contr.t1 - contr.t0) on the M-values.

4.) Once you have fitted the contrast above you can take the top X CpGs and use that as input into gometh or gsameth in the missMethyl Bioconductor package. We recently published a paper on gene set testing for methylation arrays and show that the methylRRA methods do not control the false discovery rate, so I would strongly recommend that you don't use those methods. Your significant hits are likely to be false discoveries. Here is the link for the paper if you are interested: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02388-x.

Hope that helps.

Cheers,

Belinda