Clarification MAplot and FDR in EdgeR
2
0
Entering edit mode
@josephbergenstrahle-11312
Last seen 4.0 years ago

Hi!

I'm doing some DE analysis via edgeR. There is one part I have some trouble to understand, and I suspect it may have to do with poor mathematical understanding on my part.

Below I show an MA-plot. The points (genes) are labeled according to some gene lists. The genes are marked red if they are considered significantly deferentially expressed. Or this is my goal anyway. In order to do this, I make use of the following function within edgeR:

(0 in this case is genes that are not flagged as significant)

my understanding is that "efit" have a 5% p-adjusted threshold as default

I do however find this plot somewhat confusing. Differential expression is more pronounced at the extreme in the y-axis, and futher to the right of the x-axis would mean that we have more "confident"/"support" of our differential expression since we have a higher number of counts (i.e. more data to rely on). Hence, would it not be more logical if the red dots appear in the top right / buttom right corner of the graph. Some genes are marked black eventhough they have a higher |FC| AND higher mean of expression.

Is my reasoning wrong here?

Furthermore, there might be something I have got wrong about the structure of the MArrayLM object, since when I do the following

table(efit$F.p.value<0.05) I get a very much higher number of hits. Shouldn't these be the same? Or is this some other padjusted value? Any explanation would be highly appreciated! Thanks! /JB edgeR MAplot rnaseq differential gene expression • 1.1k views ADD COMMENT 0 Entering edit mode Please post the code that you used to perform the DE analyses and to generate this plot. ADD REPLY 0 Entering edit mode #**** Reading in my count files ****** dgeobj = readDGE(files, columns=c(1,2), header=FALSE) #**** Just adding some names ***** samplenames = substring(colnames(dgeobj), 0 ,3) colnames(dgeobj) = samplenames #**** Group according to treatment **** group = factor(c("Medium", "Polyl", "Medium", "Polyl"), levels = c("Medium", "Polyl")) cell = as.factor(c("D68", "D68", "D69", "D69")) dgeobj$samples$group = group dgeobj$samples$cell = cell #**** geneIDs are stored in rownames of DGEList object ***** geneid = rownames(dgeobj) genes = select(Homo.sapiens, keys=geneid, columns=c("SYMBOL"), keytype="ENSEMBL") #**** Check and remove duplicate geneIDs **** dup = genes$ENSEMBL[duplicated(genes$ENSEMBL)] mat = match(geneid, genes$ENSEMBL)
genes = genes[mat,]
dgeobj$genes = genes #**** transforming with log **** lcpm = cpm(dgeobj, log=TRUE) # **** Pre-filtering, remove low count genes ***** keep = rowSums(lcpm>0)>=2 dgeobj = dgeobj[keep,, keep.lib.sizes=FALSE] # **** Normalization ***** dgeobj = calcNormFactors(dgeobj, method="TMM") # ***** Construction of design matrix ***** design = model.matrix(~0+group) colnames(design) = gsub("Condition", "", colnames(design)) # **** Contrusction of contrast matrix ***** contr.matrix = makeContrasts(MediumvsPolyl = groupPolyl-groupMedium, levels = colnames(design)) #***** Voom conversion and fitting of data ****** v <- voom(dgeobj, design, plot=FALSE) vfit <- lmFit(v, design) vfit <- contrasts.fit(vfit, contrasts=contr.matrix) efit <- eBayes(vfit) # ***** MA-plot ****** log2FoldChange = efit$coefficients[,1]
amean = efit$Amean fdr = efit$F.p.value

edgeR_DE = data.frame(amean, log2FoldChange, fdr)
names(edgeR_DE)=c("Amean", "log2FoldChange", "fdr")

plot(x=edgeR_DE$Amean, y=edgeR_DE$log2FoldChange, ylim=c(-12,12), pch=19)
legend(.....)

#Marking the dots with my own gene-lists
(removed not relevant code here)

abline(h=1.5, lty=2, col="red")
abline(h=-1.5, lty=2, col="red")

#This is the index for marking the red dots

#color above FDR threshold
with(edgeR_DE[aboveFDR, ], {
points(Amean, log2FoldChange, col="red", cex=0.5, pch=19)})

 
 
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 3 hours ago
The city by the bay

Firstly, your question is tagged edgeR but the bulk of your code actually uses limma with voom. There's a difference, so be aware of that when you post questions in the future.

Secondly, the F.p.value field is not the FDR-adjusted p-value. Rather, it is the p-value from the F-test of testing all contrasts simultaneously in an ANOVA (which, because you only have one contrast, will be the same as the raw p-value for that contrast). If you want the adjusted p-value - or more generally, if you want the DE results - you should just do topTable(efit, n=Inf), which will give you a data frame very similar to edgeR_DE. Check out ?topTable for more details, in particular on the sorting of the output rows.

Thirdly, there's a native plotMA function in limma that you can apply directly to efit` to generate this type of plot. This is more convenient and error-free than setting up your own code, unless you're at the final stages of preparing figures for publication (in which case some customized plotting code is inevitably required).

Other than that, I can't see any obvious reason for why some genes with the same mean but higher log-fold changes are not detected. I can only assume that those genes have higher sample variances, which reduces the power to detect them as being DE. Perhaps a volcano plot may be more instructive as it will highlight the relationship between the p-value and log-fold change.

0
Entering edit mode

Excellent answer Aaron. My tagging was clearly a miss from my part.

After labeling the ones that had a certain FC change, I look back at the MA plot (this time via the plotMA in limma :) ):

Noticed that some of the highly FC ones had very low p-values, for example "CXCL11". I went over a bunch of them and looked at raw count values, and indeed that was the case as you mentioned -> high sample variances. CXCL11 had for example (counts):

2,      1263,         13,      12688

1
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia

It is unfortunate that you chose to keep plot=FALSE when you ran voom(), because the plot would have answered many of your questions. Sure, the variances tend to decrease as the average count size increases. But every gene has its own variance, and some genes are highly variable even if they have large counts. So it is totally expected and normal that some genes with large fold changes will not be assessed as significantly DE.

0
Entering edit mode

Yeah, as mentioned in my comment above that was the case. And It all makes total sense, I will call it a newbie error in thinking :D