Possible bug in DEseq2 1.6.1 ?
1
0
Entering edit mode
skiaphrene ▴ 10
@skiaphrene-6914
Last seen 10.0 years ago
New Zealand

Dear DESeq development team,


My name is Alex SMITH and I am currently using DESeq2 to analyse mouse RNA-seq data across 2 conditions with 3 samples each.

I recently upgraded my DESeq2 version from ~1.5 to 1.6.1 (R version: 3.1.0, session info below) to access the new functionalities. I was investigating the reason why my differentially expressed gene list had increased from ~530 out of ~10 500 tested (~5%) to ~760 out of ~11 250 (~7%) when I came across something that might be a bug (but which is not, in fact, related to the change in DEG counts).

 

After running the DESeq command as below, I checked the maxCooks column, and it is full of NA's:

dgd <- DESeq(dgd, test="Wald", fitType="parametric", parallel=TRUE)
summary( mcols(dgd)$maxCooks ) # => only NA's


I have temporarily replaced the values as calculated using the code in the vignette:

maxCooks <- apply(assays(dgd)[['cooks']],1,max)
summary( maxCooks ) # => no NA's
mcols(dgd)$maxCooks <- maxCooks

 

In my data, replacing the NA's with these calculated values does not impact the DEG count by much (less than 5 different), but this does indicate that adding these values back in does lead to outlier removal.


I have not been able to find an explanation to this column of NA's. Is this a bug? Or is it due to how I am performing the analysis?


Thanks in advance and best regards,


-- Alex

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats4    grDevices utils     datasets  graphics  stats     methods   base     

other attached packages:
 [1] FactoMineR_1.27           BiocInstaller_1.16.0      DESeq2_1.6.1              RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3              
 [6] GenomicRanges_1.18.1      GenomeInfoDb_1.2.0        IRanges_2.0.0             S4Vectors_0.4.0           BiocGenerics_0.12.0      
[11] ellipse_0.3-8             car_2.0-21               

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.0 BBmisc_1.7           BatchJobs_1.4        Biobase_2.26.0       BiocParallel_1.0.0   DBI_0.3.1            Formula_1.1-2       
 [8] Hmisc_3.14-5         MASS_7.3-31          RColorBrewer_1.0-5   RSQLite_0.11.4       XML_3.98-1.1         XVector_0.6.0        acepack_1.3-3.3     
[15] annotate_1.44.0      base64enc_0.1-2      brew_1.0-6           checkmate_1.5.0      cluster_1.15.2       codetools_0.2-8      colorspace_1.2-4    
[22] digest_0.6.4         fail_1.2             foreach_1.4.2        foreign_0.8-61       genefilter_1.48.1    geneplotter_1.44.0   ggplot2_1.0.0       
[29] grid_3.1.0           gtable_0.1.2         iterators_1.0.7      labeling_0.3         lattice_0.20-29      latticeExtra_0.6-26  leaps_2.9           
[36] locfit_1.5-9.1       munsell_0.4.2        nnet_7.3-8           plyr_1.8.1           proto_0.3-10         reshape2_1.4         rpart_4.1-8         
[43] scales_0.2.4         scatterplot3d_0.3-35 sendmailR_1.2-1      splines_3.1.0        stringr_0.6.2        survival_2.37-7      tools_3.1.0         
[50] xtable_1.7-4        

 

 

 

deseq2 cook NA • 2.4k views
ADD COMMENT
0
Entering edit mode

Can you provide the code used to construct the dds in particular the design and show colData(dds)?

ADD REPLY
1
Entering edit mode

hi Alex,

Thanks for your report. I haven't been able to replicate this using the same code on an example data set with condition A and B and 3 samples each. If you can send me a reproducible example I can try to figure out what's going on. 

By two conditions, do you mean condition A/B, or do you mean condition1 A/B and condition2 A/B?

Note that version 1.5 is a development version (odd second digit) which is the branch to allow developer to experiment and add new features and functionality. You can't really count on something like sum(padj < threshold) -- where p-values are tail probabilities which are highly sensitive to slight changes in the model parameters -- staying constant between a development version and a release version.

ADD REPLY
0
Entering edit mode

Hi Michael,

Thanks for your reply! Sorry for my delay - I've been away during the NZ bank holiday...

My main experimental setup is a treatment vs control one: I am doing the DEA of 3 "Nb"-treated samples vs 3 "PBS" control samples.

I agree that I would not expect constant DEGs counts between versions, though the change was more than I would have expected. However I'm quite surprised that I had a development version to begin with!

I have prepared some slightly-reduced code and some partial data to reproduce my observation. If you wish to run it yourself, you can find a ~300Kb file here: https://www.dropbox.com/s/r5c1v0oaubma2rj/test.tar.gz?dl=0

In essence, my "experimental setup" matrix looks like this:

                                 sampleName                                    fileName Condition ERCCMix    Sex    ExpID  BioSampleID
C42K9ACXX-1257-01-1-1 C42K9ACXX-1257-01-1-1 C42K9ACXX-1257-01-1-1/HTSeq_counts_part.txt        Nb    Mix1   Male SCT14-08  SCT14-08-Nb
C42K9ACXX-1257-02-1-1 C42K9ACXX-1257-02-1-1 C42K9ACXX-1257-02-1-1/HTSeq_counts_part.txt        Nb    Mix2   Male SCT14-10  SCT14-10-Nb
C42K9ACXX-1257-03-1-1 C42K9ACXX-1257-03-1-1 C42K9ACXX-1257-03-1-1/HTSeq_counts_part.txt        Nb    Mix2 Female SCT14-11  SCT14-11-Nb
C42K9ACXX-1257-04-1-1 C42K9ACXX-1257-04-1-1 C42K9ACXX-1257-04-1-1/HTSeq_counts_part.txt       PBS    Mix2 Female SCT14-08 SCT14-08-PBS
C42K9ACXX-1257-05-1-1 C42K9ACXX-1257-05-1-1 C42K9ACXX-1257-05-1-1/HTSeq_counts_part.txt       PBS    Mix1   Male SCT14-10 SCT14-10-PBS
C42K9ACXX-1257-06-1-1 C42K9ACXX-1257-06-1-1 C42K9ACXX-1257-06-1-1/HTSeq_counts_part.txt       PBS    Mix1 Female SCT14-11 SCT14-11-PBS

 

My call to prepare the HTSeq count data for DESeq2 is:

dgd <- DESeqDataSetFromHTSeqCount(
  sampleTable = exp.dat2,
  directory   = ".",
  design      = ~ Sex + ERCCMix + Condition
)

 

I get a warning:

factor levels were dropped which had no samples

 

But this is due to how I've designed my pipeline and I can't see how this would affect anything.

The rest you already know.

 

Thanks again and best regards,

-- Alex

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

hi Alex,

This is not a bug but, the intended behavior of DESeq2. Because you have other variables in the design other than Condition (Sex and ERCCMix), you do not have 3 replicates of a unique combination of the design variables. The count outlier behavior is explained in the vignette section "Approach to count outliers":

"The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates."

With less than 3 replicates per unique combination, there is no filtering of potential count outliers.

ADD COMMENT
0
Entering edit mode

Thank you Michael for your reply!

I had explored the vignette, and had naively thought that it was "3 replicates" for the target factor of interest, not for all unique experimental factor combinations. In the case of 1 sample per unique combination, can the values in assays(dds)[["cooks"]] still be used in any informative way?

Thanks again and best regards,

-- Alex

ADD REPLY
0
Entering edit mode

No we don't recommend filtering in these situations. If you have 2 replicates in a condition, I don't see how one can say which or if any of these are outliers.

ADD REPLY

Login before adding your answer.

Traffic: 914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6