Question

Clarification MAplot and FDR in EdgeR

0

Entering edit mode

joseph.bergenstrahle • 0

@josephbergenstrahle-11312

Last seen 6.4 years ago

Hi!

I'm doing some DE analysis via edgeR. There is one part I have some trouble to understand, and I suspect it may have to do with poor mathematical understanding on my part.

Below I show an MA-plot. The points (genes) are labeled according to some gene lists. The genes are marked red if they are considered significantly deferentially expressed. Or this is my goal anyway. In order to do this, I make use of the following function within edgeR:

decideTests(efit, adjust.method="BH")!=0

(0 in this case is genes that are not flagged as significant)

my understanding is that "efit" have a 5% p-adjusted threshold as default

I do however find this plot somewhat confusing. Differential expression is more pronounced at the extreme in the y-axis, and futher to the right of the x-axis would mean that we have more "confident"/"support" of our differential expression since we have a higher number of counts (i.e. more data to rely on). Hence, would it not be more logical if the red dots appear in the top right / buttom right corner of the graph. Some genes are marked black eventhough they have a higher |FC| AND higher mean of expression.

Is my reasoning wrong here?

Furthermore, there might be something I have got wrong about the structure of the MArrayLM object, since when I do the following

table(efit$F.p.value<0.05)

I get a very much higher number of hits. Shouldn't these be the same? Or is this some other padjusted value?

Any explanation would be highly appreciated!

Thanks!

/JB

edgeR MAplot rnaseq differential gene expression • 1.8k views

ADD COMMENT • link updated 7.7 years ago by Gordon Smyth 50k • written 7.7 years ago by joseph.bergenstrahle • 0

0

Entering edit mode

Please post the code that you used to perform the DE analyses and to generate this plot.

ADD REPLY • link 7.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

#**** Reading in my count files ******
dgeobj = readDGE(files, columns=c(1,2), header=FALSE) 

#**** Just adding some names *****
samplenames = substring(colnames(dgeobj), 0 ,3)
colnames(dgeobj) = samplenames

#**** Group according to treatment ****
group = factor(c("Medium", "Polyl", "Medium", "Polyl"), levels = c("Medium", "Polyl"))
cell = as.factor(c("D68", "D68", "D69", "D69"))
dgeobj$samples$group = group
dgeobj$samples$cell = cell

#**** geneIDs are stored in rownames of DGEList object *****
geneid = rownames(dgeobj)
genes = select(Homo.sapiens, keys=geneid, columns=c("SYMBOL"), keytype="ENSEMBL")

#**** Check and remove duplicate geneIDs ****
dup = genes$ENSEMBL[duplicated(genes$ENSEMBL)]
mat = match(geneid, genes$ENSEMBL)
genes = genes[mat,]
dgeobj$genes = genes

#**** transforming with log ****
lcpm = cpm(dgeobj, log=TRUE)

# **** Pre-filtering, remove low count genes *****
keep = rowSums(lcpm>0)>=2
dgeobj = dgeobj[keep,, keep.lib.sizes=FALSE]

# **** Normalization *****
dgeobj = calcNormFactors(dgeobj, method="TMM")

# ***** Construction of design matrix *****
design = model.matrix(~0+group)
colnames(design) = gsub("Condition", "", colnames(design))

# **** Contrusction of contrast matrix *****
contr.matrix = makeContrasts(MediumvsPolyl = groupPolyl-groupMedium, levels = colnames(design))

#***** Voom conversion and fitting of data ******
v <- voom(dgeobj, design, plot=FALSE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)

# ***** MA-plot ******
log2FoldChange = efit$coefficients[,1]
amean = efit$Amean
fdr = efit$F.p.value

edgeR_DE = data.frame(amean, log2FoldChange, fdr)
names(edgeR_DE)=c("Amean", "log2FoldChange", "fdr")

  plot(x=edgeR_DE$Amean, y=edgeR_DE$log2FoldChange, ylim=c(-12,12), pch=19)
  legend(.....)
  
#Marking the dots with my own gene-lists
  (removed not relevant code here)
  
  abline(h=1.5, lty=2, col="red")
  abline(h=-1.5, lty=2, col="red")
  
#This is the index for marking the red dots
  aboveFDR = rownames(edgeR_DE[decideTests(efit, adjust.method="BH")!=0,])
  
  #color above FDR threshold
  with(edgeR_DE[aboveFDR, ], {
    points(Amean, log2FoldChange, col="red", cex=0.5, pch=19)})
```

ADD REPLY • link 7.7 years ago joseph.bergenstrahle • 0

score 2 · Answer 1 · 2016-08-17

Firstly, your question is tagged edgeR but the bulk of your code actually uses limma with voom. There's a difference, so be aware of that when you post questions in the future.

Secondly, the F.p.value field is not the FDR-adjusted p-value. Rather, it is the p-value from the F-test of testing all contrasts simultaneously in an ANOVA (which, because you only have one contrast, will be the same as the raw p-value for that contrast). If you want the adjusted p-value - or more generally, if you want the DE results - you should just do topTable(efit, n=Inf), which will give you a data frame very similar to edgeR_DE. Check out ?topTable for more details, in particular on the sorting of the output rows.

Thirdly, there's a native plotMA function in limma that you can apply directly to efit to generate this type of plot. This is more convenient and error-free than setting up your own code, unless you're at the final stages of preparing figures for publication (in which case some customized plotting code is inevitably required).

Other than that, I can't see any obvious reason for why some genes with the same mean but higher log-fold changes are not detected. I can only assume that those genes have higher sample variances, which reduces the power to detect them as being DE. Perhaps a volcano plot may be more instructive as it will highlight the relationship between the p-value and log-fold change.

score 1 · Answer 2 · 2016-08-17

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 23 minutes ago

WEHI, Melbourne, Australia

It is unfortunate that you chose to keep plot=FALSE when you ran voom(), because the plot would have answered many of your questions. Sure, the variances tend to decrease as the average count size increases. But every gene has its own variance, and some genes are highly variable even if they have large counts. So it is totally expected and normal that some genes with large fold changes will not be assessed as significantly DE.

ADD COMMENT • link 7.7 years ago Gordon Smyth 50k

0

Entering edit mode

Yeah, as mentioned in my comment above that was the case. And It all makes total sense, I will call it a newbie error in thinking :D

Thanks for the answer!

ADD REPLY • link 7.7 years ago joseph.bergenstrahle • 0