Question

Subsetting ELists and Applying sva to Iterative Differential Gene Expression Tests

0

Entering edit mode

adscheid3 • 0

@adscheid3-12893

Last seen 6.9 years ago

Hello! My first question has to do with applying functions from edgeR and limma to identify genes that are expressed at levels that are consistent within individuals over time but differential between individuals. We have replicate A and B samples for individuals from two cohorts (100 and 500 series below), and we're identifying differentially expressed genes separately in each cohort because there are batch effects between them:

Dems
     ID Individual Bleed Batch
 101A-0        101     A     1
 101B-0        101     B     1
 102A-0        102     A     1
 102B-0        102     B     1
 103A-0        103     A     1
 103B-0        103     B     1
 107A-0        107     A     1
 107B-0        107     B     1
 108A-0        108     A     1
 108B-0        108     B     1
 109A-0        109     A     1
 109B-0        109     B     1
 110A-0        110     A     1
 110B-0        110     B     1
 111A-0        111     A     1
 111B-0        111     B     1
 112A-0        112     A     1
 112B-0        112     B     1
 502A-0        502     A     2
 502B-0        502     B     2
 503A-0        503     A     2
 503B-0        503     B     2
 504A-0        504     A     2
 504B-0        504     B     2
 505A-0        505     A     2
 505B-0        505     B     2
 506A-0        506     A     2
 506B-0        506     B     2
 507A-0        507     A     2
 507B-0        507     B     2
 508A-0        508     A     2
 508B-0        508     B     2
 509A-0        509     A     2
 509B-0        509     B     2
 510A-0        510     A     2
 510B-0        510     B     2
 511A-0        511     A     2
 511B-0        511     B     2
 512A-0        512     A     2
 512B-0        512     B     2

I begin by creating a DGEList with data for all 40 samples, filtering out genes expressed at low levels, and then normalizing the data using TMM and voom:

dge <- DGEList(data)
keep <- rowSums(cpm(dge)>=1) >=6
fil_dge <- dge[keep, , keep.lib.size = FALSE]
fil_dge <- calcNormFactors(fil_dge)

Individual <- factor(Dems$Individual)
indDesign <- model.matrix(~0+Individual)  
rownames(indDesign) <- Dems$ID
v1 <- voom(fil_dge,indDesign)

I then filter out genes with absolute value mean fold changes >1.5 between A and B replicates using the log2 CPM values generated by voom:

n_genes <- nrow(fil_dge)
n_individuals <- (nrow(Dems))/2
abs_log_FCs <- matrix(0,n_genes,n_individuals)
row.names(abs_log_FCs) <- rownames(fil_dge)
for (i in 1:n_individuals){
  abs_log_FCs[,i] <- abs((v1$E[,2*i])-(v1$E[,(2*i)-1]))
}
ltFCthresh <- which((rowMeans(abs_log_FCs))<(log(1.5,2)))
ltFCthresh <- row.names(abs_log_FCs)[ltFCthresh]
fil_v1 <- v1[ltFCthresh,]

Finally, I use fil_v1 to perform differential gene expression analysis between each individual one-by-one (e.g. 101 vs. 102, 101 vs. 103, etc.) separately for each cohort. I determine which genes are differentially expressed by converting F.p.values from limma's eBayes function into FDRs for each cohort, and overlap differentially expressed genes from each cohort to find the genes that are differentially expressed in each case. Are these acceptable applications of edgeR and limma? For example, is it okay to subset the EList from voom normalization like this before performing differential gene expression, or should I do voom normalization again after filtering out genes that are expressed variably between A and B replicates?

Second, I'd like to iteratively arrange the 20 individuals into random groups of 2 and perform differential gene expression analysis between the groups using the gene set from the above analysis each time. Because I'm combining the cohorts in this analysis and there are batch effects between them I'd like to apply sva. From what I've read the best way to do that would be to specify which groups are being contrasted (group variable below), perform sva for each contrast, derive the surrogate variables and include them in a voom normalization, and subset the previously defined gene set (gene_set variable below) from the resulting EList for differential expression analysis as follows:

mod0 <- model.matrix(~1, data=Dems)
mod1 <- model.matrix(~groups, data=Dems)
sva.obj <- sva(v1$E, mod1, mod0)
mod2 <- model.matrix(~groups+sva.obj$sv, data=Dems)
v2 <- voom(fil_dge, mod2)
fil_v2 <- v2[gene_set,]

Is there a better way to account for the batch effect prior to differential expression analysis (I realize I could block it but I believe I'd lose power that way) or is that the most effective way to apply sva and voom normalization given what I'd like to accomplish? Insight is much appreciated, thanks! All the best!

Adam

limma edgeR sva RNA-Seq • 1.2k views

ADD COMMENT • link updated 7.0 years ago by Gordon Smyth 50k • written 7.0 years ago by adscheid3 • 0

score 3 · Answer 1 · 2017-04-25

It's really much simpler than that! An anova approach will do exactly what you want, finding genes that are different between individuals but consistent within individuals. If you didn't have a batch effect, then this would work:

Individual <- factor(Dems$Individual)
design <- model.matrix(~Individual)
v <- voom(fil_dge, design)
fit <- lmFit(v, design)
fit <- eBayes(v, robust=TRUE)
topTable(fit, coef=2:20)

There's no need to man-handle the analysis by filtering out genes that are different between bleeds. The anova already does that automatically.

Since you want to correct for differences between the batches, a little more care is needed when making the design matrix. First, make a design matrix for the first batch:

Batch <- factor(Dems$Individual)
Ind1 <- factor(Dems$Individual[Batch==1])
design1 <- model.matrix(~Ind1)

Then do the same for batch 2:

Ind2 <- factor(Dems$Individual[Batch==2])
design2 <- model.matrix(~Ind2)

Now put them together with the batch correction:

design.batch <- model.match(~Batch)
design.individuals.within.batches <- blockDiag(design1[,-1],design2[,-1])
design <- cbind(design.batch, design.individuals.within.batches)

Now it's easy:

fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
fit2 <- fit[,-c(1,2)]
topTable(fit2)

There's no need for sva. sva is intended for batch correction when you don't know what the batches are, but here you know exactly what they are.

There's no need to filter genes, or to arrange individuals into random groups, or anything like that.

To look for subgroups within the individuals, just make a plot of the samples and look:

logCPM <- cpm(fil_dge, log=TRUE, prior.count=3)
logCPM <- removeBatchEffect(logCPM, batch=Batch)
plotMDS(logCPM, label=dem$Individual)

or alternatively make a heatmap based on the most DE genes:

o <- order(fit2$F.p.value)
coolmap(logCPM[o[1:50],])

Either of these plots will provide clusterings of the individuals, which is the standard method to check for subgroups.