Wrong results of DESeq compared to DESeq2 for some test cases?
2
0
Entering edit mode
@megapode32559-23129
Last seen 4.8 years ago

https://filebin.net/o4opqajy9jhrofg8

  • count_data.tsv
  • meta_data.tsv

I have the above test files. The DEGs are obviously those *_de genes.

count_file='count_data.tsv'
sample_file='meta_data.tsv' 
condA='A'
condB='B'
qthr=0.5

sample_table=read.table(sample_file, header = T, sep = '\t', quote = '', check.names = F, comment.char = '', stringsAsFactors = F)
count_table=read.table(count_file, header = T, sep = '\t', quote = '', check.names = F, comment.char = '', stringsAsFactors = F)[, sample_table[, 1], drop=F]

suppressPackageStartupMessages(library(DESeq))
scaled_count_table=counts(
  estimateSizeFactors(
    newCountDataSet(count_table, sample_table[, 2])
    )
  , normalized=T
  ) 
rs=rowSums(scaled_count_table)
sub_count_table=count_table[rs > quantile(rs, qthr), , drop=F]

cds=estimateDispersions(
  estimateSizeFactors(
    newCountDataSet(sub_count_table, sample_table[, 2])
    )
  , method='per-condition'
  , fitType='local'
  ) 

res=nbinomTest(cds, condA, condB)
sort(res, ~ padj, decreasing = F)

But when I try the above code, I don't get those *_de genes as the DEG. Is there anything wrong with DESeq?

R> head(sort(res, ~ padj, decreasing = F))
           id baseMean baseMeanA baseMeanB foldChange log2FoldChange
64  gene00072 21.04255  30.18614 11.898966  0.3941864     -1.3430499
68  gene00076 16.58478  24.51167  8.657886  0.3532148     -1.5013824
73  gene00085 16.37519  21.81525 10.935135  0.5012611     -0.9963659
75  gene00087 20.99714  29.26174 12.732543  0.4351259     -1.2004952
117 gene00180 16.87296  23.34788 10.398049  0.4453530     -1.1669788
131 gene00207 16.29031  22.26660 10.314017  0.4632058     -1.1102748
           pval      padj
64  0.002871331 0.6846296
68  0.006280698 0.6846296
73  0.038339258 0.6846296
75  0.008285847 0.6846296
117 0.014302842 0.6846296
131 0.025645128 0.6846296

The following DESeq2 code can call the *_de genes as DEG correctly, however.

count_file='/tmp/count_data.tsv'
sample_file='/tmp/meta_data.tsv'
condA='A'
condB='B'
qthr=0.5

sample_table=read.table(sample_file, header = T, sep = '\t', quote = '', check.names = F, comment.char = '', stringsAsFactors = F)
count_table=read.table(count_file, header = T, sep = '\t', quote = '', check.names = F, comment.char = '', stringsAsFactors = F)[, sample_table[, 1], drop=F]

design=as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
scaled_count_table=counts(
    estimateSizeFactors(
        suppressWarnings(
            DESeqDataSetFromMatrix(
                countData=count_table
                , colData=sample_table[, 2, drop=F]
                , design=design
                )
            )
        )
    , normalized=T
    )
rs=rowSums(scaled_count_table)
sub_count_table=count_table[rs > quantile(rs, qthr), , drop=F]

cds=estimateSizeFactors(
    suppressWarnings(
        DESeqDataSetFromMatrix(
            countData=sub_count_table
            , colData=sample_table
            , design=design
            )
        )
    )

res=data.frame(
    results(
        DESeq(cds, quiet = T)
        , contrast=c(names(sample_table)[[2]], condA, condB)
        )
    )
res = sort(res, ~ padj, decreasing = F)
R> head(res)
             baseMean log2FoldChange     lfcSE      stat       pvalue
gene00016_de 260.5795      -5.568279 0.3368658 -16.52966 2.243893e-61
gene00049_de 249.7879      -5.049353 0.3150445 -16.02743 8.221806e-58
gene00040_de 245.6060      -5.027730 0.3179104 -15.81493 2.454812e-56
gene00041_de 273.9157      -5.212352 0.3482061 -14.96915 1.167964e-50
gene00007_de 276.0693      -4.827268 0.3257409 -14.81935 1.098469e-49
gene00046_de 244.2483      -4.657491 0.3144487 -14.81161 1.232599e-49
                     padj
gene00016_de 1.121946e-58
gene00049_de 2.055451e-55
gene00040_de 4.091354e-54
gene00041_de 1.459956e-48
gene00007_de 1.027166e-47
gene00046_de 1.027166e-47
deseq2 DESeq • 1.6k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 2 days ago
United States

Hi,

I have to save my extra time for specific software usage questions regarding DESeq2 functions. I’d suggest to look over your own code and make sure you’ve not mixed up labels or rows somehow (but finding that bug is something I unfortunately don’t have time to do). I promise that DESeq2 has been extensively tested both internally and by independent groups.

ADD COMMENT
0
Entering edit mode

Is DESeq deprecated? If it is not supported anymore, why not add a message to the package info so that people will not use it?

Regarding the DESeq code that I used, can you take a look whether is anything wrong with it? Thanks.

ADD REPLY
0
Entering edit mode

No, I actually don’t have time for that.

ADD REPLY
0
Entering edit mode

Yes, DESeq is depreacated since at least five years, Maybe the message that we put there five years ago ("Welcome to 'DESeq'. For improved performance, usability and functionality, please consider migrating to 'DESeq2'.") is too mild.

But it is hard to understand why people still use DESeq (first released in 2010 and last updated in 2014), when DESeq2 has now been out for 6 years. Bioinformatics is a fast moving field, and 6 years is an eternity. You can't keep using 10-year old software, because the manpower to maintain superseded tools is simply not there.

ADD REPLY
0
Entering edit mode
Simon Anders ★ 3.8k
@simon-anders-3855
Last seen 4.4 years ago
Zentrum für Molekularbiologie, Universi…

Yes, of course, there is something wrong with DESeq: It is over 10 years old, and was one of the very first attempts to offer inference for RNA-Seq.

With hindsight, its method to estimate dispersion was rather simplistic, compared to the current state of the art. It is not robust enough to offer reliable results in all cases, and this is why we decided, 7 years ago, to collect the experiences people using DESeq on many very diverse data sets and devise an improved and especially more robust approach, which we released as DESeq2.

So, everybody, please stop using old software that has been replaced by more reliable versions years ago. This is a widespread issue: just look at how many new citations our 10-year DESeq-1 paper still gets, how often Bowtie 1 and TopHat 1 are still used, even though they have been superseded first by Bowtie-2 and then Hisat-1 and Hisat-2, etc. It is part of proper bioinformatics research to perform regular literature searches to make sure you are up to the state of the art in the field.

ADD COMMENT
0
Entering edit mode

In that sense, the best way probably is to put DESeq in a deprecated state so that it will not be installed easily with bioconductor. Users have to install it from an archive repo. In this way, it is most clear that it should not be used. According to its current homepage, I cannot see its usage is discouraged. I also don't see when it was last maintained. It looks like it is still maintained.

https://bioconductor.org/packages/release/bioc/html/DESeq.html

But regarding this specific data set, can you be sure it is due to the dispersion estimation problem was done incorrectly in DEseq?

ADD REPLY
0
Entering edit mode

Hi guys,

Just to clarify, DESeq is not formally deprecated. We have a well documented procedure for deprecating Bioconductor packages. See: https://bioconductor.org/developers/package-end-of-life/ We could start the procedure for DESeq if its maintainers are ok with that.

Cheers,

H.

ADD REPLY
0
Entering edit mode

If the maintainer clearly states that the package is not maintained anymore, I think that it should be formally deprecated in Bioconductor to stop wasting users time.

ADD REPLY
0
Entering edit mode

If the maintainer clearly states that the package is not maintained anymore, I think that it should be formally deprecated in Bioconductor to stop wasting users time.

ADD REPLY

Login before adding your answer.

Traffic: 542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6