Question

ROAST after edgeR glmTreat and average RPKM filter

0

Entering edit mode

b.nota ▴ 370

@bnota-7379

Last seen 4.8 years ago

Netherlands

Hello there,

I am trying to use roast in combination with edgeR glmTreat and an average RPKM filter, but I am not sure how to correctly implement this. Hope someone can help with this.

So our strategy is to use glmTreat for at least 1.5 FC, and subsequently we are only interested in genes that have at least 8 RPKM in at least one group of the pairwise contrast.

My script works as follows:

library(edgeR)

DGE1 <- DGEList(counts,group=Group)

keep <- rowSums(cpm(DGE1)>1) >= 3

DGE2 <- DGE1[keep,]

y1 <- calcNormFactors(DGE2)

y <- estimateGLMCommonDisp(y1,design,verbose=TRUE)

y <- estimateGLMTrendedDisp(y, design, verbose=TRUE)

y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)

# Treat 1.5FC

tr <- glmTreat(fit, contrast=c(-1,0,0,0,0,0,1,0),lfc=log2(1.5))

# Subsequent average RPKM filter

RPKM_A <- rowMeans(RPKM[,c(13,21)])
RPKM_B <- rowMeans(RPKM[,c(6,14,22)])

Ave_RPKM <- cbind(RPKM_A, RPKM_B)

at_least_RPKM8 <- rownames(Ave_RPKM[Ave_RPKM[,1] > 8 | Ave_RPKM[,2] > 8,])

Top <- topTags(tr,n=length(y1$counts), adjust.method="BH", sort.by="p.value")

TopTable <- Top[at_least_RPKM8,]

So I have selected my genes with at least 8 RPKM in one group from my TopTable. And now if I want to go for roast I have to use the fit again, but I don't know how to get the treat 1.5 FC in there nor the > 8 RPKM filter. Can anyone suggest a better way to do this?

# ROAST

roast_res <- roast(y,index=geneSet,design=design,contrast=c(-1,0,0,0,0,0,1,0))

Many thanks in advance for your thoughts and ideas!

Ben

roast edgeR glmtreat • 1.5k views

ADD COMMENT • link updated 8.8 years ago by Aaron Lun ★ 28k • written 8.8 years ago by b.nota ▴ 370

score 3 · Answer 1 · 2016-09-02

There's a couple of problems with your "average RPKM filter". The first is that filtering after applying the BH correction does not guarantee that the FDR will still be properly controlled across the retained genes. For example, if top half of your DE list in Top was removed by at_least_RPKM8, the FDR across the remaining genes in TopTable would be greater than the specified threshold in TopTable$table$FDR. If you filter, you should generally do so before you perform the multiplicity correction.

That brings us to the second problem with your filter, in that it's not independent of the DE statistic. Specifically, your filter will implicitly discard non-DE genes that have similar average expression in your two groups (A and B) that is below the threshold of 8. For example, a non-DE gene that is expressed at RPKMs of 7 in both groups will be filtered out. This means that you've effectively used information related to the DE status of the genes to filter them. However, the multiple testing correction doesn't know that, meaning that it doesn't account for the implicit testing performed during filtering. As a result, you can't apply this particular filter before performing the multiple testing correction.

If you stick with this filtering strategy, you're stuck between a rock and a hard place. Both choices - filtering before or after correction - do not have good statistical properties. Instead, I would advise you to filter before correction with a strategy that is independent of DE, e.g., the average RPKM across both A and B. The average RPKM doesn't provide information about DE between A and B, because the manner in which the total expression gets allocated into each of the two groups is unknown. This ensures that the filtering procedure doesn't use DE-related information, and that the multiple testing correction is valid afterwards. In other words:

keep <- rowMeans(RPKM[,c(6,13,14,21,22)]) > 8
Top <- topTags(tr[keep,], n=Inf, adjust.method="BH", sort.by="p.value")

The same applies for the CPMs, which is what I would use unless there's something about the gene length that you just have to model.

Finally, you can get roast to use the filter by subsetting y by keep. However, as far as I know, roast doesn't support a log-fold change threshold. The null hypothesis being tested is that there's no DE of any kind between conditions for any of the genes in the set.