To the developers of DESeq2,
First of all, thank you for creating such a great tool!
I am currently trying to analyze 16S rRNA gene amplicon sequencing data of samples which were taken from the oral microbiota of patients at two timepoints: before and after treatment.
My colData is as below:
Group ParticipantID
1 BEFORE Participant_ONE
2 BEFORE Participant_TWO
3 BEFORE Participant_THREE
4 BEFORE Participant_FOUR
5 AFTER Participant_ONE
6 AFTER Participant_TWO
7 AFTER Participant_THREE
My planned analysis is as below
dds <- DESeqDataSetFromMatrix(count, colData, design = ~ ParticipantID + Group)
dds <- DESeq(dds, fitType = "local")
res <- results(dds, contrast = c("Group", "BEFORE", "AFTER"))
I noticed that in a previous post you mentioned that "samples without pair doesn't help you estimate the treatment effect".
However, because the sample size of my study is quite small, I would like to keep the non-paired sample (sample belonging to Participant_FOUR) in the analysis instead of removing them completely.
If you could please give some idea on whether this makes sense or not in the current analysis, that would be great.
P.S. Here is the sessionInfo() if needed:
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/appli/freeware/OpenBLAS/0.3.6.gcc.THREAD/lib/libopenblas_haswellp-r0.3.6.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] biom_0.3.12 DESeq2_1.16.1
[3] SummarizedExperiment_1.6.5 DelayedArray_0.2.7
[5] matrixStats_0.54.0 Biobase_2.36.2
[7] GenomicRanges_1.28.6 GenomeInfoDb_1.12.3
[9] IRanges_2.10.5 S4Vectors_0.14.7
[11] BiocGenerics_0.22.1 optparse_1.6.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.0 Formula_1.2-3
[4] assertthat_0.2.1 latticeExtra_0.6-28 blob_1.1.1
[7] GenomeInfoDbData_0.99.0 pillar_1.4.1 RSQLite_2.1.1
[10] backports_1.1.2 lattice_0.20-35 glue_1.3.1
[13] digest_0.6.15 RColorBrewer_1.1-2 XVector_0.16.0
[16] checkmate_1.9.3 colorspace_1.3-2 htmltools_0.3.6
[19] Matrix_1.2-17 plyr_1.8.4 XML_3.98-1.11
[22] pkgconfig_2.0.2 genefilter_1.58.1 zlibbioc_1.22.0
[25] purrr_0.3.2 xtable_1.8-2 scales_1.0.0
[28] getopt_1.20.2 BiocParallel_1.10.1 htmlTable_1.13.1
[31] tibble_2.1.1 annotate_1.54.0 ggplot2_3.1.1
[34] nnet_7.3-12 lazyeval_0.2.1 RJSONIO_1.3-0
[37] survival_2.42-3 magrittr_1.5 crayon_1.3.4
[40] memoise_1.1.0 foreign_0.8-70 tools_3.4.0
[43] data.table_1.12.2 stringr_1.4.0 locfit_1.5-9.1
[46] munsell_0.5.0 cluster_2.0.7-1 AnnotationDbi_1.38.2
[49] compiler_3.4.0 rlang_0.3.4 grid_3.4.0
[52] RCurl_1.95-4.10 rstudioapi_0.10 htmlwidgets_1.3
[55] bitops_1.0-6 base64enc_0.1-3 gtable_0.2.0
[58] DBI_1.0.0 R6_2.4.0 gridExtra_2.3
[61] knitr_1.23 dplyr_0.8.1 bit_1.1-14
[64] Hmisc_4.1-1 stringi_1.4.3 Rcpp_1.0.1
[67] geneplotter_1.54.0 rpart_4.1-13 acepack_1.4.1
[70] tidyselect_0.2.5 xfun_0.7
Dear James,
Thank you for your reply and sorry for my question being unclear.
What I should have asked was "if I have an unpaired sample, would the other paired samples still be tested for the effect of treatment (the second factor, i.e. "Group") and be controlled for the effect of participant (the first factor, i.e. "ParticipantID")?"
Also, if the answer is "NO", would it still be acceptable to analyze my sample with DESeq2 using a single factor design such as "design = ~ Group"?
Many thanks in advance for your patience.
Sincerely, Ken
If you want to include the unpaired sample, why not use duplicateCorrelation as suggested in the link?
Dear Michael,
Thank you for your reply!
Please correct me if I missed something, but duplicateCorrelation() is a function in limma and I need to work with DESeq2 for the time being for various reasons.
If it is not acceptable to keep the unpaired sample in my dataset with "design ~ ParticipantID + Group", would it be appropriate to use "design ~ Group" instead?
Much appreciation for your patience with me in advance.
Sincerely, Ken
In DESeq2 the unpaired sample won’t be used to estimate treatment unless you drop the pairing variable from the design.
Dear Michael,
I now understand the point I was missing! Thanks for making it clear to me the reason why keeping unpaired samples would not make any difference in the current design.
Please allow me ask one last question:
Because my sample size is quite limited, I am thinking of dropping the pairing variable (i.e. Group), so that all of my samples can be tested for the effect of treatment, but would it be wrong to do so?
Many thanks for your kind help.
Sincerely, Ken
I'd say the optimal method would be to use duplicateCorrelation in limma, which figures out the degree to which the samples are correlated according to pair by looking across all genes. You could also diagnose this by eye using plotPCA. If the samples are highly correlated and close in the PCA plot, relative to the difference due to treatment, then you're better off dropping the unpaired sample in DESeq2.