Question

Deseq2 versions discrepancy

0

Entering edit mode

dmffrancisco91 • 0

@dmffrancisco91-13195

Last seen 8.1 years ago

Hello,

Me and a colleague are performing differential expression analysis on a dataset comprised of 2 groups, controls and cases. We wanted to test/correct for batch effect so we decided to use SVA and we were obtaining the initial files from a shared folder.

We were however getting different results. We thought it might be due to small differences in our scripts, however, when running the exact same script the difference persisted both in the post-SVA correction as well as in the DESeq2 results (previous to SVA)

As a comparison, my results:

With DESeq2 package version: 1.16.1 and R version 3.4.0 (2017-04-21):

> summary(res)

out of 47541 with nonzero total read count

adjusted p-value < 0.1

LFC > 0 (up) : 41, 0.086%

LFC < 0 (down) : 9, 0.019%

outliers [1] : 168, 0.35%

low counts [2] : 29741, 63%

(mean count < 11)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

And post SVA:

> summary(ressva)

out of 47541 with nonzero total read count

adjusted p-value < 0.1

LFC > 0 (up) : 6, 0.013%

LFC < 0 (down) : 4, 0.0084%

outliers [1] : 0, 0%

low counts [2] : 5378, 11%

(mean count < 0)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

And her results

With DESeq2 package version: 1.10.1 and R version 3.2.4 (2016-03-10):

> summary(res)

out of 47541 with nonzero total read count

adjusted p-value < 0.1

LFC > 0 (up) : 26, 0.055%

LFC < 0 (down) : 3, 0.0063%

outliers [1] : 287, 0.6%

low counts [2] : 30152, 63%

(mean count < 12.5)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

And post SVA:

> summary(ressva)

out of 47541 with nonzero total read count

adjusted p-value < 0.1

LFC > 0 (up) : 34, 0.072%

LFC < 0 (down) : 5, 0.011%

outliers [1] : 0, 0%

low counts [2] : 30270, 64%

(mean count < 12.5)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

If we do a dds command we also see some differences:

Mine:

> dds

class: DESeqDataSet

dim: 65988 12

metadata(1): version

assays(3): counts mu cooks

rownames(65988): ENSG00000000003 ENSG00000000005 ... ENSG00000282821 ENSG00000282822

rowData names(21): baseMean baseVar ... deviance maxCooks

colnames(12): c1 c2 ... m5 m6

colData names(2): condition sizeFactor

Hers:

> dds

class: DESeqDataSet

dim: 65988 12

exptData(0):

assays(3): counts mu cooks

rownames(65988): ENSG00000000003 ENSG00000000005 ... ENSG00000282821 ENSG00000282822

rowRanges metadata column names(27): baseMean baseVar ... deviance maxCooks

colnames(12): c1 c2 ... m5 m6

colData names(2): condition sizeFactor

After doing some troubleshooting to make sure that the files were synched and everything was working properly we decided to see if it was due to the version difference and as soon as she upgraded her DESeq2 package version and R we got the same results.

Another small thing, we checked the vectors that we get for SVA correction and they are the same independent of the difference in the DESeq2 results (which I find curious, since the different metrics (log2fc, post normalisation reads, p-adjusted) seem to differ.

So, I think that my question for you is if this big of a difference is "normal" regarding this differences and maybe some insight on why we would have them.

Thank you for your time,

Cheers,

David M. F. Francisco

deseq2 version compatibility • 1.8k views

ADD COMMENT • link updated 8.1 years ago by Michael Love 43k • written 8.1 years ago by dmffrancisco91 • 0

score 0 · Answer 1 · 2017-06-06

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 7 hours ago

United States

There is no such thing as 'normal' differences between packages that are 6 full versions different. That's three years of development and improvements, and depending on the maintainer and how actively the field of study is changing, there may be quite a few differences. In this case you have a very active package author, in a field that is still in active development, so you shouldn't be surprised that there are big differences. You can see the changes in the NEWS file, if you are interested.

ADD COMMENT • link 8.1 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for the answer!

Yeah, that is why I said "normal".

What intrigues me is the differences in the post SVA correction since the vectors we obtain are the same but the results diverge quite a lot for all metrics.

Anyway,

Thanks again for the answer

ADD REPLY • link 8.1 years ago dmffrancisco91 • 0

score 0 · Answer 2 · 2017-06-07

Here's some more discussion on why it's not a big difference in the methods, although you see the numbers change by e.g. dozens and it seems like a large change from the perspective of someone looking a DEG sets. The estimated parameters might change only very slightly, but since we are doing a simultaneous analysis of maybe 10,000 expressed genes, a few dozen can get pushed above or below a threshold. Both the dispersion estimation and Benjamini Hochberg correction introduce an interdependence between gene-wise results. And then you look at the tails of a probability distribution and put an arbitrary threshold, which exaggerates small differences. Another thing which I think takes a while to appreciate is that there is not a single best dispersion value or gene set. Should the estimator be unbiased but with high error, or low error but biased, or should it be robust to model misspecification, etc. Also for gene sets, there are many gene sets which can satisfy the bounds, and different methods, even if the differences are very slight, can both hit the target in terms of expected operating characteristics, while not delivering the exact same set or size of set.