Hello,
I am working on a large breast cancer dataset with 298 samples from 35 patients (myData). I have 3 different Tissue Type (TT1, TT2, TT3) and I saw that there was a batch effect due to the technician that did the experiment. As some of my patients have only 1 sample at the moment I am not using it in my design so when I wanted to compare TT1 with TT2 I used the following code:
dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)
dedds<-DESeq(dds)
res1=results(dedds,contrast=c("Tissue_Type","TT1","TT2"))
Now, I am interested in knowing in a specific patient (PAT1) what are the differentially expressed genes between TT1 and TT2 as I know it can be patient specific. PAT1 has 12 samples TT1, 5 samples TT2 and 3 samples TT3. What I did is that I redefined the Tissue Type in the following way:
Tissue_Type=rep("Other",298)
Tissue_Type[intersect(grep("TT1",batch$Tissue_Type),which(batch$Patient_ID=="PAT1"))]="TT1_PAT1"
Tissue_Type[intersect(grep("TT2",batch$Tissue_Type),which(batch$Patient_ID=="PAT1"))]="TT2_PAT1"
batch$Tissue_Type=factor(Tissue_Type)
dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)
dedds<-DESeq(dds)
res2=results(dedds,contrast=c("Tissue_Type","TT1_PAT1","TT2_PAT1"))
Is it correct? Or should I split "Other" in another way (like TT1_Other, TT2_Other etc.)? Or should I just take the dataset with only the 17 samples I am interested in? I tried this last solution and it gives me a totally different list of genes, which seems right because the size factor and the correction for the technician would be totally different when using only 17 samples.
Thank you for your help!
sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] sva_3.12.0 genefilter_1.48.1 [3] mgcv_1.8-6 nlme_3.1-120 [5] DESeq2_1.6.3 RcppArmadillo_0.4.650.1.1 [7] Rcpp_0.11.5 GenomicRanges_1.18.4 [9] GenomeInfoDb_1.2.5 IRanges_2.0.1 [11] S4Vectors_0.4.0 BiocGenerics_0.12.1 loaded via a namespace (and not attached): [1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.2 [4] base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9 [7] Biobase_2.26.0 BiocParallel_1.0.3 brew_1.0-6 [10] checkmate_1.5.2 cluster_2.0.1 codetools_0.2-11 [13] colorspace_1.2-6 DBI_0.3.1 digest_0.6.8 [16] fail_1.2 foreach_1.4.2 foreign_0.8-63 [19] Formula_1.2-1 geneplotter_1.44.0 ggplot2_1.0.1 [22] grid_3.1.1 gtable_0.1.2 Hmisc_3.15-0 [25] iterators_1.0.7 lattice_0.20-31 latticeExtra_0.6-26 [28] locfit_1.5-9.1 MASS_7.3-40 Matrix_1.2-0 [31] munsell_0.4.2 nnet_7.3-9 plyr_1.8.1 [34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1 [37] rpart_4.1-9 RSQLite_1.0.0 scales_0.2.4 [40] sendmailR_1.2-1 splines_3.1.1 stringr_0.6.2 [43] survival_2.38-1 tools_3.1.1 XML_3.98-1.1 [46] xtable_1.7-4 XVector_0.6.0
I agree with Ryan.
In addition, you need to update to the latest version of R and Bioconductor. You are using DESeq2 v1.6 which is 1.5 years old and no longer supported. Here are instructions:
http://www.bioconductor.org/install
Thank you for the fast answer it helps me understand how I can have the results for the patient-specific tissue effects.
About my other question with the "remaining" samples, let's say I want to know the patient-specific tissue effects of only 3 patients Pat1, Pat2, Pat3, what do I do with the other samples? Like I said I have 35 patients but some of them have only 1 sample and even if I had more it would take forever to run your model with 35 patients, 3 technician and 3 Tissue Type for 298 samples. How do you do in this case? Do you create a new dataset with only your 3 patients and 45 samples? Or do you just define the Patient factor as: Pat1 (15 samples), Pat2 (15 samples), Pat 3 (15 samples) and Other (remaining 253 samples) (to have only 4 levels) ?
For the latest version of R, unfortunately it's not my decision, I am working on our cluster and they didn't update the R version yet.
It's up to you as the analyst what to do here, you can either subset and perform the analysis on a subset or include all the samples and extract specific terms.
This is a FAQ we discuss in the DESeq2 vignette: "If I have multiple groups, should I run all together or split into pairs of groups?"
It's absolutely worth an email to the cluster maintainers to update the version of R. This is their job and it will take someone just a few hours to do. Bioinformatics moves too fast for you to be working with 1.5 year old software.
Great, thank you so much for the help, I read your vignette 1000 times but for some reason I have skipped this part apparently.