Question

DESeq2: same count matrix, different design

0

Entering edit mode

da.de ▴ 30

@dade-7723

Last seen 17 months ago

Austria

Dear all,
I am a beginner and have the following problem. I have a dds object and run DESeq two times using exactly the same count matrix but different design.

(1) why is the baseMean of dds1 not the same as the baseMean of dds2? I thought the baseMean is only normalized by sizeFactor.

(2) I also had a look at the cooks distance. But I am not sure how to interpret the results. The samples in the 2nd and 4th plot that show higher cooks distance are always those of groupJ. Why? And does this influence my results?

https://drive.google.com/open?id=1t73Qx48g_b3mJiaXAd16IOyEp9RO9WcX

How should I proceed with those comparisons? Thanks for your help!

table(colData(dds)$Group)
# groupA     4
# groupB    22 
# groupC     7
# groupD    30
# groupE     3    
# groupF     5
# groupG     3
# groupH     9   
# groupI     5
# groupJ    16  
# groupK     7 
# groupL     4
table(colData(dds)$GroupJvsall)
# other      99
# groupJ    16

design(dds) <- ~Group
dds1 <- DESeq(dds,parallel=TRUE, betaPrior=TRUE)
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates: 16 workers
# mean-dispersion relationship
# final dispersion estimates, MLE betas: 16 workers
# fitting model and testing: 16 workers
# -- replacing outliers and refitting for 283 genes
# -- DESeq argument 'minReplicatesForReplace' = 7
# -- original counts are preserved in counts(dds)
# estimating dispersions
# fitting model and testing

design(dds) <- ~ GroupJvsall
dds2 <- DESeq(dds,parallel=TRUE, betaPrior=TRUE)
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates: 16 workers
# mean-dispersion relationship
# final dispersion estimates, MLE betas: 16 workers
# fitting model and testing: 16 workers
# -- replacing outliers and refitting for 3209 genes
# -- DESeq argument 'minReplicatesForReplace' = 7
# -- original counts are preserved in counts(dds)
# estimating dispersions
# fitting model and testing

stopifnot(identical(rownames(dds1), rownames(dds2)))
stopifnot(identical(colnames(dds1), colnames(dds2)))
stopifnot(identical(rowData(dds1)$baseMean, rowData(dds2)$baseMean))
# Error: identical(rowData(dds1)$baseMean, rowData(dds2)$baseMean) is not TRUE

design(dds) <- ~Group
dds1no <- DESeq(dds,parallel=TRUE, betaPrior=TRUE, minReplicatesForReplace=Inf)
design(dds) <- ~ GroupJvsall
dds2no <- DESeq(dds,parallel=TRUE, betaPrior=TRUE, minReplicatesForReplace=Inf)

stopifnot(identical(rowData(dds1no)$baseMean, rowData(dds1)$baseMean))
#Error: identical(rowData(dds1no)$baseMean, rowData(dds1)$baseMean) is not TRUE
stopifnot(identical(rowData(dds1no)$baseMean, rowData(dds2no)$baseMean))

pdf(file="boxplot.pdf", height=10, width=15,onefile=TRUE)
par(mfrow=c(1,4))
boxplot(log10(assays(dds1)[["cooks"]]), range=0, las=2, main="dds1_Group")
boxplot(log10(assays(dds2)[["cooks"]]), range=0, las=2, main="dds2_GroupJvsall")
boxplot(log10(assays(dds1no)[["cooks"]]), range=0, las=2, main="dds1no_minReplicatesForReplace=Inf")
boxplot(log10(assays(dds2no)[["cooks"]]), range=0, las=2, main="dds1no_minReplicatesForReplace=Inf")
graphics.off()

deseq2 baseMean cooks distance • 1.4k views

ADD COMMENT • link updated 6.5 years ago by Michael Love 42k • written 6.5 years ago by da.de ▴ 30

score 0 · Answer 1 · 2018-02-02

The baseMean has to change when you replace outliers. If you have a gene with counts of ~10 and then one count of 100,000, when this count is labelled as an outlier, the baseMean needs to be update to give the gene a more reasonable mean, and therefore a more reasonable dispersion prior.

It looks like you have many genes with outliers when you use the second design, most likely because the "all" group is not compatible with a single mean value, and NB counts dispersed around that mean. I'd recommend actually turning off outlier replacement, minReplicatesForReplace=Inf, and instead inspecting the top genes by eye to make sure they are not driven by individual counts.