Hi,
I've got a question regarding the Cooks distance calculation in DESeq2.
My dataset consists of 4 groups with 5-6 biological replicates per group. For some exploratory QC plotting I performed the GLM fitting with different design matrices, one with the batch as covariate (dds.bc) and one with only my group of interest (dds.nbc):
# Read summarized experiment
se <- readRDS(se.path)
# Prepare colData
group = factor(se$group)
batch = factor(se$exp)
col.data <- data.frame(row.names=colnames(se), group, batch)
# Create dds with batch correction
dds.bc <- DESeq2::DESeqDataSetFromMatrix(countData = assay(se, "counts"),
colData = col.data,
design = ~batch+group)
# Create dds without batch correction
dds.nbc <- DESeq2::DESeqDataSetFromMatrix(countData = assay(se, "counts"),
colData = col.data,
design = ~group)
# Calculate differential expression
dds.bc <- DESeq(dds.bc, betaPrior = FALSE, fitType = 'local')
dds.nbc <- DESeq(dds.nbc, betaPrior = FALSE, fitType = 'local')
Now for some reason the maxCooks value (rowData(dds)['maxCooks']) is only set for the analysis using only the group in the design formula (dds.nbc). In dds.bc the values are all set to 'NA':
> head(rowData(dds.bc)['maxCooks'])
DataFrame with 6 rows and 1 column
maxCooks
<logical>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
> head(rowData(dds.nbc)['maxCooks'])
DataFrame with 6 rows and 1 column
maxCooks
<numeric>
1 2.5255207003168
2 0.657178470968108
3 1.27111489329578
4 0.144701162255673
5 2.65777960268025
6 13.1324009789259
The Cooks distances are calculated (assay(dds, 'cooks')) but not transferred to the column in the rowData:
> head(assay(dds.nbc, 'cooks'))
a1 a2 a3 a4 a5 a6 b1 b2 b3 b4 b5
0610007P14Rik 0.056467175 0.032335019 1.441970e+00 0.057708134 0.058649497 0.0577954102 0.043284253 0.063377144 0.014263303 0.09582077 1.806441441
0610009L18Rik 0.657178471 0.018659529 3.290204e-02 0.032453661 0.033013610 0.0325055582 0.006252503 0.009041238 0.013313577 0.01367133 0.013652200
0610009O20Rik 0.076308209 0.066229282 1.133886e-02 0.076655314 1.271114893 0.0363556309 0.005891674 0.008305582 0.011777309 0.01205639 0.012041508
0610010B08Rik 0.004603687 0.003851429 2.935343e-02 0.004630839 0.004650915 0.0046327188 0.003165467 0.003751235 0.004327378 0.00436450 0.004362548
0610010F05Rik 0.066643541 0.108785212 5.065448e-05 0.011146832 0.024336550 0.0005165392 0.101602398 2.657779603 0.119600118 0.12444698 0.124415077
0610010K14Rik 0.097618110 0.027128868 3.158805e-01 0.049495746 0.050334776 0.0495735232 0.006252688 0.009041506 0.013313970 0.01367173 0.013652602
b6 c1 c2 c3 c4 c5 c6 d1 d2 d3
0610007P14Rik 0.086250911 7.766767e-03 7.774940e-03 1.349925e-02 1.373434e-02 1.178278e-02 1.309203e-02 0.059908254 0.145012806 2.5255207003
0610009L18Rik 0.012288280 7.766634e-03 7.774807e-03 1.349902e-02 1.373411e-02 1.178258e-02 1.309180e-02 0.020024088 0.048928772 0.5657196744
0610009O20Rik 0.010967809 9.780299e-03 9.789584e-03 1.585565e-02 1.608741e-02 1.412400e-02 7.858235e-02 0.045306524 0.075009889 1.2260605501
0610010B08Rik 0.004213122 5.358793e-03 5.360122e-03 5.938908e-03 1.447012e-01 5.817130e-03 5.912585e-03 0.007867464 0.008202417 0.0009390616
0610010F05Rik 0.088157912 2.292114e-07 2.294538e-07 3.999231e-07 4.069522e-07 3.486694e-07 3.877527e-07 0.046491910 0.079683380 1.2482338185
0610010K14Rik 0.012288643 7.766659e-03 7.774831e-03 1.349906e-02 1.373415e-02 1.178262e-02 1.309184e-02 0.492206223 0.880840891 0.9371058211
d4 d5
0610007P14Rik 0.14581105 0.101974129
0610009L18Rik 0.04922949 0.033502302
0610009O20Rik 0.07519935 0.062800918
0610010B08Rik 0.05747505 0.008097566
0610010F05Rik 0.07991326 0.065384214
0610010K14Rik 13.13240098 0.525006091
Is there a reason for this that I am missing? Is it okay to just manually add the values?
rowData(dds.bc)['maxCooks'] <- apply(assay(dds.bc, 'cooks'), 1, max)
Also, when only using the group in the design formula, the analysis is detecting 20% outliers. Would you object to setting the minReplicatesForReplace to 5 or 6 to deal with this?
Thank you very much,
Joe
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.20.0 SummarizedExperiment_1.10.1 DelayedArray_0.6.5 BiocParallel_1.14.2 matrixStats_0.54.0
[6] Biobase_2.40.0 GenomicRanges_1.32.6 GenomeInfoDb_1.16.0 IRanges_2.14.10 S4Vectors_0.18.3
[11] BiocGenerics_0.26.0
Thank you Michael, I obviously missed that note! And thank you also for the advice regarding the outlier replacement! :)