Question

Discrepancy between DESeq2 files produced in Galaxy and R(RStudio)

0

Entering edit mode

peTTU • 0

@97afab4e

Last seen 10 months ago

United States

Recently, I reanalyzed some gene expression data using DESeq2. To my surprise, the resulting volcano plots looked quite different from the previous analysis. During the troubleshooting phase of the analysis, I realized that the data was previously generated using a Galaxy-based version of DESeq2 v 1.34.0 and I used version 1.40.1 in RStudio (see below). Volcano Plots generated using Galaxy and R(RStudio) results tables

Close inspection of the results files showed that the calculation of the log2FoldChange was the major reason for the discrepancy. I am including (see below) as an example, information about gene A. I am also including the raw numbers for the gene, the normalized counts coming from the dds object (RStudio), and the VST normalized counts for both Galaxy and RStudio files.

1) Would you please comment which fold change will be closer to reality?

2) Why am I getting such a discrepancy in the log2FoldChange between the analyses?

Thank you for your help!

Platform    baseMean    Log2FoldChange  lfcSE       stat        pvalue      padj
Galaxy      17.102747   -0.44276    0.15465304  -2.86295    0.004197    0.0227
Rstudio     17.1026     -5.92047    1.49447     -3.961587   7.45E-05    0.000820

Samples             CntRNA-1    CntRNA-2    CntRNA-3    KORNA-1 KORNA-2 KORNA-3
Raw counts          1       78      6       1   0   1
Normalizeddds-RStudio       1.120692    92.55239    7.274768    0.68380 0   0.983963
VST-Counts-RStudio      9.446782    9.920987    9.538052    9.43386 9.3877  9.443066
VST-Counts-Galaxy       7.477439    8.417805    7.661466    7.45136 7.3582  7.469936

DESeq2 • 665 views

ADD COMMENT • link updated 3 months ago by kethselly • 0 • written 10 months ago by peTTU • 0

score 0 · Answer 1 · 2023-06-08

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 11 hours ago

United States

I guess that Galaxy may be using some kind of LFC shrinkage, but without seeing their code I wouldn't know for sure.

Shrinkage generally gives much more accurate LFC. I would recommend using apeglm or ashr methods in lfcShrink.

ADD COMMENT • link 10 months ago Michael Love 41k

0

Entering edit mode

Hello @peTTU- did you ever determine what was causing this discrepancy between using DeSeq2 in R/R Studio and Galaxy? I am seeing a similar result as well.

I used the apeglm method for lfcShrink in R/Rstudio (according to the vignette) and filtered out a large number of rows with low counts. However, based on the output below it looks like Galaxy has filtered out 25% of the rows because of low counts too. Perhaps this is why the volcano plots looks so different? However, this looks like it might have occurred not during the pre-filtering step but in the independent filtering of the results.

I've already used the code suggested below but perhaps I should increase the group size to 6 or 7 since I have a total of 13 samples?

smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds2 <- dds[keep,]
dds2

I also looked into the code that it seems like Galaxy is running. I've copied a portion of that below:

Rscript '/cvmfs/main.galaxyproject.org/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/deseq2/9a616afdbda5/deseq2/deseq2.R' -o '/galaxy-repl/main/files/033/938/dataset_33938743.dat' -p '/galaxy-repl/main/files/033/938/dataset_33938744.dat' -n '/galaxy-repl/main/files/033/938/dataset_33938745.dat'                                -H  -f '[["genotype", ...

the code then continues (at the ...) with multiple file names (I'm guessing the featureCounts inputs).

Also the output looks like:

13 samples with counts over 17753 genes
using disperion fit type: parametric 
creating plots
summary of results
genotype: control_pex41 vs mutant_ex3

out of 15636 with nonzero total read count

adjusted p-value < 0.1
LFC > 0 (up)     : 565, 3.6% 

LFC < 0 (down)   : 570, 3.6% 

outliers [1]     : 167, 1.1% 

low counts [2]   : 3881, 25% 

(mean count < 2)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

So I'm sort of guessing that perhaps my sample might have a large number of samples where there is <2 counts and this is causing the difference in the volcano plots? Any other ideas would be very helpful. thanks in advance!

ADD REPLY • link 3 months ago kethselly • 0

0

Entering edit mode

I may have figured out a little part of this problem... when I output the data into a .csv I used the code below:

write.csv(as.data.frame(resOrdered), 
          file="condition_treated_results.csv")

instead I probably should have used something like:

write.csv(as.data.frame(resLFC_ordered), 
          file="condition_treated_results.csv")

where resLFC was the updated dds object after applying the LFC shrinkage.

ADD REPLY • link 3 months ago kethselly • 0

0

Entering edit mode

One more comment - I think I have this figured out now (at least as far as the differences in the volcano plot go, which should reflect the log2FC as well, I think).

The volcano plot generated from the data from Galaxy uses shrinkLFC(type = 'normal').

When I saved a .csv file from R/R Studio using what I guess was the original normal distribution to shrink the LFC estimates and then used those LFC estimates to create a volcano plot I got something very similar to what Galaxy gave, so I'm guessing that the Galaxy server probably uses that method.

When I used shrinkLFC(type = 'apeglm') and then output the data into a .csv as I mentioned in my comment immediately above, I got something that looked like so:

enter image description here

So there are fewer (none?) values on the x-axis, compared to the original plot you posted from R/R Studio.

ADD REPLY • link 3 months ago kethselly • 0