Hi,
I have posted this on biostars but this is the more appropriate forum to ask this question (apologies if you have seen this already).
I am running an enrichment analysis on 3000 differentially expressed genes (mouse) following DESeq2.
The pwf plot for upregulated genes (log2FoldChange > 0) is similar to the one in the vignette, with long genes being more differentially expressed.
However, the plot for significantly downregulated genes is inverted. High proportion of short genes that are DE and low proportion of long DE genes.
If I plot all DE expressed genes no sensible line can be dawn as the bins cancel each other out (high scatter, poor fit). I have seen quite a few of these graphs when searching for "pwf plot goseq" on google including some on the bioconductor forum - but as yet there is no explanation for this.
Any ideas why this might be?
When I manually bin the deseq2 results by baseMean I do not get a difference in for downregulated (=FALSE) vs upregulated (=TRUE) genes.
Jakub
PS:
If images are not displayed, here are some external links:
Image1: https://goo.gl/photos/M8pmZSeaEZPb9F7J7
FYI (although you've already probably noticed), the last 2 images dind't come through.
Also, just curious: did you provide your own gene lengths, or are you using the ones that goseq fetches/provides for you?
Thanks,
yes, I've noticed that the images disappear on my mobile phone, but I can see them both on mac and windows computers. I will provide additional external links to the images to fix that.
I used the bioconductor package
TxDb.Mmusculus.UCSC.mm10.ensGene
.Did you also use TxDb.Mmusculus.UCSC.mm10.ensGene as the gene models to quantify expression? What pipeline (featureCounts, HTSeq, summarizeOverlaps, something else?) did you use to do that?
I used genecounts, our local featurecounts-like package to count reads and then used DESeq2 to quantify expression. I didn't use the TxDb package for that but used mm10 annotations (ensembl81) with a minor modification (for one human gene; transgenic mouse model).
It would be natural and correct to get the gene lengths from your local "featurecounts" package. Otherwise the exon definitions used to compute the counts won't be the same as that used to compute the gene lengths.
Thanks. To avoid a discussion around the fidelity of the genecounts package, I have rerun the analysis with "featurecounts", and get identical results.