Question

Possible bug in DEseq2 1.6.1 ?

0

Entering edit mode

skiaphrene ▴ 10

@skiaphrene-6914

Last seen 9.5 years ago

New Zealand

Dear DESeq development team,

My name is Alex SMITH and I am currently using DESeq2 to analyse mouse RNA-seq data across 2 conditions with 3 samples each.

I recently upgraded my DESeq2 version from ~1.5 to 1.6.1 (R version: 3.1.0, session info below) to access the new functionalities. I was investigating the reason why my differentially expressed gene list had increased from ~530 out of ~10 500 tested (~5%) to ~760 out of ~11 250 (~7%) when I came across something that might be a bug (but which is not, in fact, related to the change in DEG counts).

After running the DESeq command as below, I checked the maxCooks column, and it is full of NA's:

dgd <- DESeq(dgd, test="Wald", fitType="parametric", parallel=TRUE)
summary( mcols(dgd)$maxCooks ) # => only NA's

I have temporarily replaced the values as calculated using the code in the vignette:

maxCooks <- apply(assays(dgd)[['cooks']],1,max)
summary( maxCooks ) # => no NA's
mcols(dgd)$maxCooks <- maxCooks

In my data, replacing the NA's with these calculated values does not impact the DEG count by much (less than 5 different), but this does indicate that adding these values back in does lead to outlier removal.

I have not been able to find an explanation to this column of NA's. Is this a bug? Or is it due to how I am performing the analysis?

Thanks in advance and best regards,

-- Alex

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats4    grDevices utils     datasets  graphics  stats     methods   base     

other attached packages:
 [1] FactoMineR_1.27           BiocInstaller_1.16.0      DESeq2_1.6.1              RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3              
 [6] GenomicRanges_1.18.1      GenomeInfoDb_1.2.0        IRanges_2.0.0             S4Vectors_0.4.0           BiocGenerics_0.12.0      
[11] ellipse_0.3-8             car_2.0-21               

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.0 BBmisc_1.7           BatchJobs_1.4        Biobase_2.26.0       BiocParallel_1.0.0   DBI_0.3.1            Formula_1.1-2       
 [8] Hmisc_3.14-5         MASS_7.3-31          RColorBrewer_1.0-5   RSQLite_0.11.4       XML_3.98-1.1         XVector_0.6.0        acepack_1.3-3.3     
[15] annotate_1.44.0      base64enc_0.1-2      brew_1.0-6           checkmate_1.5.0      cluster_1.15.2       codetools_0.2-8      colorspace_1.2-4    
[22] digest_0.6.4         fail_1.2             foreach_1.4.2        foreign_0.8-61       genefilter_1.48.1    geneplotter_1.44.0   ggplot2_1.0.0       
[29] grid_3.1.0           gtable_0.1.2         iterators_1.0.7      labeling_0.3         lattice_0.20-29      latticeExtra_0.6-26  leaps_2.9           
[36] locfit_1.5-9.1       munsell_0.4.2        nnet_7.3-8           plyr_1.8.1           proto_0.3-10         reshape2_1.4         rpart_4.1-8         
[43] scales_0.2.4         scatterplot3d_0.3-35 sendmailR_1.2-1      splines_3.1.0        stringr_0.6.2        survival_2.37-7      tools_3.1.0         
[50] xtable_1.7-4

deseq2 cook NA • 2.3k views

ADD COMMENT • link updated 9.5 years ago by Michael Love 41k • written 9.5 years ago by skiaphrene ▴ 10

0

Entering edit mode

Can you provide the code used to construct the dds in particular the design and show colData(dds)?

ADD REPLY • link 9.5 years ago Michael Love 41k

1

Entering edit mode

hi Alex,

Thanks for your report. I haven't been able to replicate this using the same code on an example data set with condition A and B and 3 samples each. If you can send me a reproducible example I can try to figure out what's going on.

By two conditions, do you mean condition A/B, or do you mean condition1 A/B and condition2 A/B?

Note that version 1.5 is a development version (odd second digit) which is the branch to allow developer to experiment and add new features and functionality. You can't really count on something like sum(padj < threshold) -- where p-values are tail probabilities which are highly sensitive to slight changes in the model parameters -- staying constant between a development version and a release version.

ADD REPLY • link 9.5 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for your reply! Sorry for my delay - I've been away during the NZ bank holiday...

My main experimental setup is a treatment vs control one: I am doing the DEA of 3 "Nb"-treated samples vs 3 "PBS" control samples.

I agree that I would not expect constant DEGs counts between versions, though the change was more than I would have expected. However I'm quite surprised that I had a development version to begin with!

I have prepared some slightly-reduced code and some partial data to reproduce my observation. If you wish to run it yourself, you can find a ~300Kb file here: https://www.dropbox.com/s/r5c1v0oaubma2rj/test.tar.gz?dl=0

In essence, my "experimental setup" matrix looks like this:

                                 sampleName                                    fileName Condition ERCCMix    Sex    ExpID  BioSampleID
C42K9ACXX-1257-01-1-1 C42K9ACXX-1257-01-1-1 C42K9ACXX-1257-01-1-1/HTSeq_counts_part.txt        Nb    Mix1   Male SCT14-08  SCT14-08-Nb
C42K9ACXX-1257-02-1-1 C42K9ACXX-1257-02-1-1 C42K9ACXX-1257-02-1-1/HTSeq_counts_part.txt        Nb    Mix2   Male SCT14-10  SCT14-10-Nb
C42K9ACXX-1257-03-1-1 C42K9ACXX-1257-03-1-1 C42K9ACXX-1257-03-1-1/HTSeq_counts_part.txt        Nb    Mix2 Female SCT14-11  SCT14-11-Nb
C42K9ACXX-1257-04-1-1 C42K9ACXX-1257-04-1-1 C42K9ACXX-1257-04-1-1/HTSeq_counts_part.txt       PBS    Mix2 Female SCT14-08 SCT14-08-PBS
C42K9ACXX-1257-05-1-1 C42K9ACXX-1257-05-1-1 C42K9ACXX-1257-05-1-1/HTSeq_counts_part.txt       PBS    Mix1   Male SCT14-10 SCT14-10-PBS
C42K9ACXX-1257-06-1-1 C42K9ACXX-1257-06-1-1 C42K9ACXX-1257-06-1-1/HTSeq_counts_part.txt       PBS    Mix1 Female SCT14-11 SCT14-11-PBS

My call to prepare the HTSeq count data for DESeq2 is:

dgd <- DESeqDataSetFromHTSeqCount(
  sampleTable = exp.dat2,
  directory   = ".",
  design      = ~ Sex + ERCCMix + Condition
)

I get a warning:

factor levels were dropped which had no samples

But this is due to how I've designed my pipeline and I can't see how this would affect anything.

The rest you already know.

Thanks again and best regards,

-- Alex

ADD REPLY • link 9.5 years ago skiaphrene ▴ 10

score 2 · Answer 1 · 2014-10-27

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

hi Alex,

This is not a bug but, the intended behavior of DESeq2. Because you have other variables in the design other than Condition (Sex and ERCCMix), you do not have 3 replicates of a unique combination of the design variables. The count outlier behavior is explained in the vignette section "Approach to count outliers":

"The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates."

With less than 3 replicates per unique combination, there is no filtering of potential count outliers.

ADD COMMENT • link 9.5 years ago Michael Love 41k

0

Entering edit mode

Thank you Michael for your reply!

I had explored the vignette, and had naively thought that it was "3 replicates" for the target factor of interest, not for all unique experimental factor combinations. In the case of 1 sample per unique combination, can the values in assays(dds)[["cooks"]] still be used in any informative way?

Thanks again and best regards,

-- Alex

ADD REPLY • link 9.5 years ago skiaphrene ▴ 10

0

Entering edit mode

No we don't recommend filtering in these situations. If you have 2 replicates in a condition, I don't see how one can say which or if any of these are outliers.

ADD REPLY • link 9.5 years ago Michael Love 41k