Different number of DE genes using edgeR and DESEQ2
Good morning to everyone, I'm trying to compare the DE transcripts results obtained by using two different workflows (RNASeqGeneEdgeRQL and rnaseqgene which rely on edgeR quasi-likelihood and DESEQ2 respectively). I obtained different results I'm quite perplexed about, both in terms of number and identity of differentially expressed genes found.
................
I'm not sure the two pipelines are exactly comparable. Before getting to the point, I'm going to summarize what I did before obtained my list of DE transcripts.
The functions I used to align .BAM files to features are featureCounts and summarizeOverlaps respectively, the latter resulting in higher expression levels for different transcripts, and with the following parameters:
For edgeR
fc <- featureCounts(all.bam, annot.ext="pathofmygff3", isGTFAnnotationFile=TRUE, nthreads=16, GTF.featureType="transcript", GTF.attrType="ID", allowMultiOverlap=TRUE, fraction=TRUE, isPairedEnd=TRUE)
targets <- read.delim("pathofmytargetfile", stringsAsFactors=FALSE)
For DESEQ2
se <- summarizeOverlaps(features=ebg_transcripts, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE, inter.feature=FALSE)
..................
With edgeR, I removed any transcript (I'm working on transcripts, not genes) whose expression level is < 10 On a total of 88792, on edgeR 65157 were kept.
Similarly, with DESEQ2, I removed any transcript whose expression level sum over the samples is < 10 This way, 75231 were kept.
..........
I continued with the pipelines sticking to default, performing normalization etc, till I got to perform DE transcript analysis. At this point, I used glmQLFTest for edgeR and secondly I applied a threshold (log FC > 1.5) with glmTreat. Apparently edgeR filters genes showing FDR < 5%. I'm not sure if a p value threshold is applied at all here.
This is what I obtained (on the left, unfiltered transcripts, the right, only logFC > 1.5 transcripts).
https://i.ibb.co/N7pPf45/DE-GOANA.png
I tried to do the same with DESEQ2, comparing the same condition over samples, by specifying:
dds <- DESeq(dds)
res <- results(dds, contrast=c("Condizione","Carica","Scarica"))
summary(res)
8741 upregulated and 9545 downregulated transcripts were found (far more than the number I obtained starting from edgeR quasi-likelihood)
And then (I removed no transcripts on the basis of FDR, just on log FC). Only transcripts for pvalue < 0.05 are kept apparently.
resLFC15 <- results(dds, lfcThreshold=1.5)
table(resLFC15$padj < 0.05)
summary(resLFC15)
But this time I got no DE transcripts! To obtain a similar number of DE genes (about 100) I had to use a lower treshold (log FC > 0.5)... and always for pvalue < 0.05
resLFC05 <- results(dds, lfcThreshold=0.5)
table(resLFC15$padj < 0.05)
summary(resLFC05)
This way, I got 78 upregulated and 36 downregulated ones.
How can the results obtained by applying functionally equivalent scripts over different pipelines differ so much? Maybe it's because they apply very different statistical methods? Is that expected or I did something wrong? Not to mention the transcripts I obtain differ quite a lot in terms of position and priority in my final list (but this may be due to the different counts, what worries me the most is the number of them with apparently similar criteria...) Which one would be better and why does the threshold I chose have a different effect? Am I missing something? Thanks in advance!
Hello, thank you for your answer, I'll try using salmon which is apparently the best choice for transcripts and re-run the analysis.
All the workflows I tried quantified genes, not transcripts, in their vignette example, and I just tried (maybe by over-simplyfing the whole thing) to make them work on transcripts instead by changing parameters.