I am working on a large breast cancer dataset with 298 samples from 35 patients (myData). I have 3 different Tissue Type (TT1, TT2, TT3) and I saw that there was a batch effect due to the technician that did the experiment. As some of my patients have only 1 sample at the moment I am not using it in my design so when I wanted to compare TT1 with TT2 I used the following code:
dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)dedds<-DESeq(dds) res1=results(dedds,contrast=c("Tissue_Type","TT1","TT2"))
Now, I am interested in knowing in a specific patient (PAT1) what are the differentially expressed genes between TT1 and TT2 as I know it can be patient specific. PAT1 has 12 samples TT1, 5 samples TT2 and 3 samples TT3. What I did is that I redefined the Tissue Type in the following way:
Tissue_Type=rep("Other",298) Tissue_Type[intersect(grep("TT1",batch$Tissue_Type),which(batch$Patient_ID=="PAT1"))]="TT1_PAT1" Tissue_Type[intersect(grep("TT2",batch$Tissue_Type),which(batch$Patient_ID=="PAT1"))]="TT2_PAT1" batch$Tissue_Type=factor(Tissue_Type)
dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)dedds<-DESeq(dds) res2=results(dedds,contrast=c("Tissue_Type","TT1_PAT1","TT2_PAT1"))
Is it correct? Or should I split "Other" in another way (like TT1_Other, TT2_Other etc.)? Or should I just take the dataset with only the 17 samples I am interested in? I tried this last solution and it gives me a totally different list of genes, which seems right because the size factor and the correction for the technician would be totally different when using only 17 samples.
Thank you for your help!
sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  sva_3.12.0 genefilter_1.48.1  mgcv_1.8-6 nlme_3.1-120  DESeq2_1.6.3 RcppArmadillo_0.4.650.1.1  Rcpp_0.11.5 GenomicRanges_1.18.4  GenomeInfoDb_1.2.5 IRanges_2.0.1  S4Vectors_0.4.0 BiocGenerics_0.12.1 loaded via a namespace (and not attached):  acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.2  base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9  Biobase_2.26.0 BiocParallel_1.0.3 brew_1.0-6  checkmate_1.5.2 cluster_2.0.1 codetools_0.2-11  colorspace_1.2-6 DBI_0.3.1 digest_0.6.8  fail_1.2 foreach_1.4.2 foreign_0.8-63  Formula_1.2-1 geneplotter_1.44.0 ggplot2_1.0.1  grid_3.1.1 gtable_0.1.2 Hmisc_3.15-0  iterators_1.0.7 lattice_0.20-31 latticeExtra_0.6-26  locfit_1.5-9.1 MASS_7.3-40 Matrix_1.2-0  munsell_0.4.2 nnet_7.3-9 plyr_1.8.1  proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1  rpart_4.1-9 RSQLite_1.0.0 scales_0.2.4  sendmailR_1.2-1 splines_3.1.1 stringr_0.6.2  survival_2.38-1 tools_3.1.1 XML_3.98-1.1  xtable_1.7-4 XVector_0.6.0