Question: Subsetting ELists and Applying sva to Iterative Differential Gene Expression Tests
gravatar for adscheid3
17 months ago by
adscheid30 wrote:

Hello! My first question has to do with applying functions from edgeR and limma to identify genes that are expressed at levels that are consistent within individuals over time but differential between individuals. We have replicate A and B samples for individuals from two cohorts (100 and 500 series below), and we're identifying differentially expressed genes separately in each cohort because there are batch effects between them:

     ID Individual Bleed Batch
 101A-0        101     A     1
 101B-0        101     B     1
 102A-0        102     A     1
 102B-0        102     B     1
 103A-0        103     A     1
 103B-0        103     B     1
 107A-0        107     A     1
 107B-0        107     B     1
 108A-0        108     A     1
 108B-0        108     B     1
 109A-0        109     A     1
 109B-0        109     B     1
 110A-0        110     A     1
 110B-0        110     B     1
 111A-0        111     A     1
 111B-0        111     B     1
 112A-0        112     A     1
 112B-0        112     B     1
 502A-0        502     A     2
 502B-0        502     B     2
 503A-0        503     A     2
 503B-0        503     B     2
 504A-0        504     A     2
 504B-0        504     B     2
 505A-0        505     A     2
 505B-0        505     B     2
 506A-0        506     A     2
 506B-0        506     B     2
 507A-0        507     A     2
 507B-0        507     B     2
 508A-0        508     A     2
 508B-0        508     B     2
 509A-0        509     A     2
 509B-0        509     B     2
 510A-0        510     A     2
 510B-0        510     B     2
 511A-0        511     A     2
 511B-0        511     B     2
 512A-0        512     A     2
 512B-0        512     B     2 

I begin by creating a DGEList with data for all 40 samples, filtering out genes expressed at low levels, and then normalizing the data using TMM and voom:

dge <- DGEList(data)
keep <- rowSums(cpm(dge)>=1) >=6
fil_dge <- dge[keep, , keep.lib.size = FALSE]
fil_dge <- calcNormFactors(fil_dge)

Individual <- factor(Dems$Individual)
indDesign <- model.matrix(~0+Individual)  
rownames(indDesign) <- Dems$ID
v1 <- voom(fil_dge,indDesign)

I then filter out genes with absolute value mean fold changes >1.5 between A and B replicates using the log2 CPM values generated by voom:

n_genes <- nrow(fil_dge)
n_individuals <- (nrow(Dems))/2
abs_log_FCs <- matrix(0,n_genes,n_individuals)
row.names(abs_log_FCs) <- rownames(fil_dge)
for (i in 1:n_individuals){
  abs_log_FCs[,i] <- abs((v1$E[,2*i])-(v1$E[,(2*i)-1]))
ltFCthresh <- which((rowMeans(abs_log_FCs))<(log(1.5,2)))
ltFCthresh <- row.names(abs_log_FCs)[ltFCthresh]
fil_v1 <- v1[ltFCthresh,]

Finally, I use fil_v1 to perform differential gene expression analysis between each individual one-by-one (e.g. 101 vs. 102, 101 vs. 103, etc.) separately for each cohort. I determine which genes are differentially expressed by converting F.p.values from limma's eBayes function into FDRs for each cohort, and overlap differentially expressed genes from each cohort to find the genes that are differentially expressed in each case. Are these acceptable applications of edgeR and limma? For example, is it okay to subset the EList from voom normalization like this before performing differential gene expression, or should I do voom normalization again after filtering out genes that are expressed variably between A and B replicates?  

Second, I'd like to iteratively arrange the 20 individuals into random groups of 2 and perform differential gene expression analysis between the groups using the gene set from the above analysis each time. Because I'm combining the cohorts in this analysis and there are batch effects between them I'd like to apply sva. From what I've read the best way to do that would be to specify which groups are being contrasted (group variable below), perform sva for each contrast, derive the surrogate variables and include them in a voom normalization, and subset the previously defined gene set (gene_set variable below) from the resulting EList for differential expression analysis as follows:

mod0 <- model.matrix(~1, data=Dems)
mod1 <- model.matrix(~groups, data=Dems)
sva.obj <- sva(v1$E, mod1, mod0)
mod2 <- model.matrix(~groups+sva.obj$sv, data=Dems)
v2 <- voom(fil_dge, mod2)
fil_v2 <- v2[gene_set,]

Is there a better way to account for the batch effect prior to differential expression analysis (I realize I could block it but I believe I'd lose power that way) or is that the most effective way to apply sva and voom normalization given what I'd like to accomplish? Insight is much appreciated, thanks! All the best!


ADD COMMENTlink modified 17 months ago by Gordon Smyth35k • written 17 months ago by adscheid30
gravatar for Gordon Smyth
17 months ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

It's really much simpler than that! An anova approach will do exactly what you want, finding genes that are different between individuals but consistent within individuals. If you didn't have a batch effect, then this would work:

Individual <- factor(Dems$Individual)
design <- model.matrix(~Individual)
v <- voom(fil_dge, design)
fit <- lmFit(v, design)
fit <- eBayes(v, robust=TRUE)
topTable(fit, coef=2:20)

There's no need to man-handle the analysis by filtering out genes that are different between bleeds. The anova already does that automatically.

Since you want to correct for differences between the batches, a little more care is needed when making the design matrix. First, make a design matrix for the first batch:

Batch <- factor(Dems$Individual)
Ind1 <- factor(Dems$Individual[Batch==1])
design1 <- model.matrix(~Ind1)

Then do the same for batch 2:

Ind2 <- factor(Dems$Individual[Batch==2])
design2 <- model.matrix(~Ind2)

Now put them together with the batch correction:

design.batch <- model.match(~Batch)
design.individuals.within.batches <- blockDiag(design1[,-1],design2[,-1])
design <- cbind(design.batch, design.individuals.within.batches)

Now it's easy:

fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
fit2 <- fit[,-c(1,2)]

There's no need for sva. sva is intended for batch correction when you don't know what the batches are, but here you know exactly what they are.

There's no need to filter genes, or to arrange individuals into random groups, or anything like that.

To look for subgroups within the individuals, just make a plot of the samples and look:

logCPM <- cpm(fil_dge, log=TRUE, prior.count=3)
logCPM <- removeBatchEffect(logCPM, batch=Batch)
plotMDS(logCPM, label=dem$Individual)

or alternatively make a heatmap based on the most DE genes:

o <- order(fit2$F.p.value)

Either of these plots will provide clusterings of the individuals, which is the standard method to check for subgroups.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Gordon Smyth35k

Thanks for your response Dr. Smyth, it's very helpful! My second question actually has to do with looking at potential subgroups within the 20 individuals using genes that are differentially expressed among them, so I would still like to iteratively organize them into groups of 2 and do differential gene expression between the groups each time. To accomplish that would you subset v to genes that are differentially expressed among individuals, block the batch effect and perform each contrast of interest or use sva as I outlined above? Thanks again, all the best!  


ADD REPLYlink written 17 months ago by adscheid30

Great thanks again for your help!


ADD REPLYlink written 17 months ago by adscheid30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 449 users visited in the last hour