Hello,
I'm trying to analyse an RNA-Seq dataset using DESeq2 that has four experimental factors: disease, treatment, donor and sequencing run.
This is my colData (C is control):
treatment |
disease |
donor |
run |
R |
Disease |
p1 |
three |
C |
Disease |
p1 |
three |
C |
Disease |
p2 |
one |
R |
Disease |
p2 |
one |
C |
Disease |
p3 |
one |
R |
Disease |
p3 |
one |
C |
Disease |
p4 |
one |
R |
Disease |
p4 |
one |
C |
Disease |
p5 |
one |
R |
Disease |
p5 |
one |
C |
Disease |
p6 |
two |
R |
Disease |
p6 |
two |
C |
Disease |
p7 |
two |
R |
Disease |
p7 |
two |
R |
Disease |
p8 |
two |
C |
Disease |
p8 |
two |
R |
Disease |
p9 |
two |
C |
Disease |
p9 |
two |
R |
Disease |
p10 |
three |
C |
Disease |
p10 |
three |
R |
Disease |
p11 |
three |
C |
Disease |
p11 |
three |
R |
Disease |
p12 |
three |
C |
Disease |
p12 |
three |
R |
C |
p1 |
three |
C |
C |
p1 |
three |
R |
C |
p2 |
three |
C |
C |
p2 |
three |
R |
C |
p3 |
three |
C |
C |
p3 |
three |
R |
C |
p4 |
three |
C |
C |
p4 |
three |
R |
C |
p5 |
three |
C |
C |
p5 |
three |
R |
C |
p6 |
three |
C |
C |
p6 |
three |
C |
C |
p7 |
one |
R |
C |
p7 |
one |
C |
C |
p8 |
two |
R |
C |
p8 |
two |
C |
C |
p9 |
two |
R |
C |
p9 |
two |
C |
C |
p10 |
two |
R |
C |
p10 |
two |
I have been having problems trying to find the correct design matrix for this dataset. So far I have been using
design = ~ disease + treatment + donor + disease:treatment
Adding the sequencing run factor to the formula gives me very few DE genes. When I run DESeq2 using the above formula, I get the following
resultsNames(dds) [1] "Intercept" "diseaseDisease" "diseaseC" [4] "treatmentC" "treatmentR" "donorp1" [7] "donorp10" "donorp11" "donorp12" [10] "donorp2" "donorp3" "donorp4" [13] "donorp5" "donorp6" "donorp7" [16] "donorp8" "donorp9" "diseaseDisease.treatmentC" [19] "diseaseC.treatmentC" "diseaseDisease.treatmentR" "diseaseC.treatmentR"
I am interested in genes that are DE between individuals with diseaseDisease that have received treatmentR vs diseaseC individuals that have received treatmentR. Is the correct contrast formula
resdds <- results(dds, contrast = list(c( "diseaseDisease", "diseaseDisease.treatmentR"), c("diseaseC","diseaseC.treatmentR")))
or
resdds <- results(dds, contrast = list(c( "diseaseDisease.treatmentR", "diseaseC.treatmentR")))
In addition, is the design formula I'm using correct for my dataset and the question I want to answer? Is there a way to integrate sequencing run into the design formula without reducing DE genes? (I've looked at a PCA plot of the samples and there does seem to be variation between sequencing runs.)
Thanks!
Output of sessionInfo()
R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocParallel_1.2.22 DESeq2_1.8.2 [3] RcppArmadillo_0.6.100.0.0 Rcpp_0.12.1 [5] GenomicRanges_1.20.8 GenomeInfoDb_1.4.3 [7] IRanges_2.2.9 S4Vectors_0.6.6 [9] BiocGenerics_0.14.0 edgeR_3.10.5 [11] limma_3.24.15 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 [4] XVector_0.8.0 futile.options_1.0.0 tools_3.2.2 [7] rpart_4.1-10 digest_0.6.8 RSQLite_1.0.0 [10] annotate_1.46.1 gtable_0.1.2 lattice_0.20-33 [13] DBI_0.3.1 proto_0.3-10 gridExtra_2.0.0 [16] genefilter_1.50.0 stringr_1.0.0 cluster_2.0.3 [19] locfit_1.5-9.1 nnet_7.3-11 grid_3.2.2 [22] Biobase_2.28.0 AnnotationDbi_1.30.1 XML_3.98-1.3 [25] survival_2.38-3 foreign_0.8-66 latticeExtra_0.6-26 [28] Formula_1.2-1 geneplotter_1.46.0 ggplot2_1.0.1 [31] reshape2_1.4.1 lambda.r_1.1.7 magrittr_1.5 [34] scales_0.3.0 Hmisc_3.17-0 MASS_7.3-44 [37] splines_3.2.2 xtable_1.7-4 colorspace_1.2-6 [40] stringi_0.5-5 acepack_1.3-3.3 munsell_0.4.2