Question

Different numbers of DE genes using edgeR and DESEQ2

0

Entering edit mode

Raito92 ▴ 60

@raito92-20399

Last seen 3.4 years ago

Italy

Different number of DE genes using edgeR and DESEQ2

Good morning to everyone, I'm trying to compare the DE transcripts results obtained by using two different workflows (RNASeqGeneEdgeRQL and rnaseqgene which rely on edgeR quasi-likelihood and DESEQ2 respectively). I obtained different results I'm quite perplexed about, both in terms of number and identity of differentially expressed genes found.

................

I'm not sure the two pipelines are exactly comparable. Before getting to the point, I'm going to summarize what I did before obtained my list of DE transcripts.

The functions I used to align .BAM files to features are featureCounts and summarizeOverlaps respectively, the latter resulting in higher expression levels for different transcripts, and with the following parameters:

For edgeR

fc <- featureCounts(all.bam, annot.ext="pathofmygff3", isGTFAnnotationFile=TRUE, nthreads=16, GTF.featureType="transcript", GTF.attrType="ID", allowMultiOverlap=TRUE, fraction=TRUE, isPairedEnd=TRUE)
targets <- read.delim("pathofmytargetfile", stringsAsFactors=FALSE)

For DESEQ2

se <- summarizeOverlaps(features=ebg_transcripts, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE, inter.feature=FALSE)

..................

With edgeR, I removed any transcript (I'm working on transcripts, not genes) whose expression level is < 10 On a total of 88792, on edgeR 65157 were kept.

Similarly, with DESEQ2, I removed any transcript whose expression level sum over the samples is < 10 This way, 75231 were kept.

..........

I continued with the pipelines sticking to default, performing normalization etc, till I got to perform DE transcript analysis. At this point, I used glmQLFTest for edgeR and secondly I applied a threshold (log FC > 1.5) with glmTreat. Apparently edgeR filters genes showing FDR < 5%. I'm not sure if a p value threshold is applied at all here.

This is what I obtained (on the left, unfiltered transcripts, the right, only logFC > 1.5 transcripts).

https://i.ibb.co/N7pPf45/DE-GOANA.png

I tried to do the same with DESEQ2, comparing the same condition over samples, by specifying:

dds <- DESeq(dds)
res <- results(dds, contrast=c("Condizione","Carica","Scarica"))
summary(res)

8741 upregulated and 9545 downregulated transcripts were found (far more than the number I obtained starting from edgeR quasi-likelihood)

And then (I removed no transcripts on the basis of FDR, just on log FC). Only transcripts for pvalue < 0.05 are kept apparently.

resLFC15 <- results(dds, lfcThreshold=1.5)
table(resLFC15$padj < 0.05)
summary(resLFC15)

But this time I got no DE transcripts! To obtain a similar number of DE genes (about 100) I had to use a lower treshold (log FC > 0.5)... and always for pvalue < 0.05

resLFC05 <- results(dds, lfcThreshold=0.5)
table(resLFC15$padj < 0.05)
summary(resLFC05)

This way, I got 78 upregulated and 36 downregulated ones.

How can the results obtained by applying functionally equivalent scripts over different pipelines differ so much? Maybe it's because they apply very different statistical methods? Is that expected or I did something wrong? Not to mention the transcripts I obtain differ quite a lot in terms of position and priority in my final list (but this may be due to the different counts, what worries me the most is the number of them with apparently similar criteria...) Which one would be better and why does the threshold I chose have a different effect? Am I missing something? Thanks in advance!

edger deseq2 rnaseq de • 2.2k views

ADD COMMENT • link updated 5.6 years ago by Gordon Smyth 53k • written 5.6 years ago by Raito92 ▴ 60

score 1 · Answer 1 · 2020-04-15

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 7 hours ago

United States

I would strongly recommend using Salmon for quantifying transcripts, rather than summarizeOverlaps, we've even profiled differential transcript expression with Bioconductor DE tools here:

https://f1000research.com/articles/7-952/v3

Note that the live rnaseqGene workflow on Bioconductor uses Salmon and tximeta now:

https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

I don't really have much comment on DE of transcript counts counted by simple overlaps and not by disambiguation of read assignment via EM, it seems like really non-informative data.

ADD COMMENT • link 5.6 years ago Michael Love 43k

0

Entering edit mode

Hello, thank you for your answer, I'll try using salmon which is apparently the best choice for transcripts and re-run the analysis.

All the workflows I tried quantified genes, not transcripts, in their vignette example, and I just tried (maybe by over-simplyfing the whole thing) to make them work on transcripts instead by changing parameters.

ADD REPLY • link 5.6 years ago Raito92 ▴ 60

score 1 · Answer 2 · 2020-04-15

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 6 hours ago

WEHI, Melbourne, Australia

Your analysis seems to have multiple problems, as I have pointed out elsewhere: https://support.bioconductor.org/p/129909/

I agree with Michael Love that Salmon would be an appropriate quantification tool for analysing transcripts, but my advice is different. Given the problematic nature of your data, I strongly recommend that you cut down on the complexity by doing a genewise analysis instead. I very much doubt that the transcripts for your non-model species are sufficiently well annotated to make a transcript-level analysis sensible. Your results would I think be much better if you really did follow the advice in the RNASeqGeneEdgeRQL workflow, which I don't think you are currently doing.

ADD COMMENT • link 5.6 years ago Gordon Smyth 53k

0

Entering edit mode

Hello, thank you for your kind answer!

Forgive me for my somehow 'naive' questions, but I'm not so skilled in the bioinformatics field yet, and this is the first time I perform an analysis on transcripts, after running a sort of simulation on genes instead one year ago... (the one you remembered answering me about).

So maybe as I'm starting to realise I applied a pipeline which would do fine on genes, but not so obviously on transcripts.

I think I'm going to rethink my approach according to the suggestions, starting from the way I quantify reads alignments (salmon).

Your results would I think be much better if you really did follow the advice in the RNASeqGeneEdgeRQL workflow

I actually did, even though that pipeline was intended for genes. What did I do wrong in your opinion? I'm going to answer on the other topic though.

I very much doubt that the transcripts for your non-model species are sufficiently well annotated to make a transcript-level analysis sensible

Kind of, I noticed that even while aligning reads to transcripts... the problem is that my tutors would be more interesting in knowing differences in expression at isoform and then transcript levels... even though I suggested the analysis didn't seem to discriminate well. Do you think we should switch to a gene approach instead or is there any way to increase the sensitivity if we're interested in this?

Above all, could you suggest me an alternative workflow which works on TRANSCRIPTS instead?

Thank you in advance for your tips!

EDIT: I had prepared a detailed answer to my other post (https://support.bioconductor.org/p/129909/) but I have to wait a few hours before the forum would let me post again.