Question: DESeq2 how to define my conditions when I want the results comparing a subset of samples
gravatar for ldetorrente
3.4 years ago by
ldetorrente10 wrote:



I am working on a large breast cancer dataset with 298 samples from 35 patients (myData). I have 3 different Tissue Type (TT1, TT2, TT3) and I saw that there was a batch effect due to the technician that did the experiment. As some of my patients have only 1 sample at the moment I am not using it in my design so when I wanted to compare TT1 with TT2 I used the following code:

dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)

Now, I am interested in knowing in a specific patient (PAT1) what are the differentially expressed genes between TT1 and TT2 as I know it can be patient specific. PAT1 has 12 samples TT1, 5 samples TT2 and 3 samples TT3. What I did is that I redefined the Tissue Type in the following way:

dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)

Is it correct? Or should I split "Other" in another way (like TT1_Other, TT2_Other etc.)? Or should I just take the dataset with only the 17 samples I am interested in? I tried this last solution and it gives me a totally different list of genes, which seems right because the size factor and the correction for the technician would be totally different when using only 17 samples.

Thank you for your help!

R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] sva_3.12.0                genefilter_1.48.1
 [3] mgcv_1.8-6                nlme_3.1-120
 [5] DESeq2_1.6.3              RcppArmadillo_0.4.650.1.1
 [7] Rcpp_0.11.5               GenomicRanges_1.18.4
 [9] GenomeInfoDb_1.2.5        IRanges_2.0.1
[11] S4Vectors_0.4.0           BiocGenerics_0.12.1

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.2
 [4] base64enc_0.1-2      BatchJobs_1.6        BBmisc_1.9
 [7] Biobase_2.26.0       BiocParallel_1.0.3   brew_1.0-6
[10] checkmate_1.5.2      cluster_2.0.1        codetools_0.2-11
[13] colorspace_1.2-6     DBI_0.3.1            digest_0.6.8
[16] fail_1.2             foreach_1.4.2        foreign_0.8-63
[19] Formula_1.2-1        geneplotter_1.44.0   ggplot2_1.0.1
[22] grid_3.1.1           gtable_0.1.2         Hmisc_3.15-0
[25] iterators_1.0.7      lattice_0.20-31      latticeExtra_0.6-26
[28] locfit_1.5-9.1       MASS_7.3-40          Matrix_1.2-0
[31] munsell_0.4.2        nnet_7.3-9           plyr_1.8.1
[34] proto_0.3-10         RColorBrewer_1.1-2   reshape2_1.4.1
[37] rpart_4.1-9          RSQLite_1.0.0        scales_0.2.4
[40] sendmailR_1.2-1      splines_3.1.1        stringr_0.6.2
[43] survival_2.38-1      tools_3.1.1          XML_3.98-1.1
[46] xtable_1.7-4         XVector_0.6.0
rnaseq deseq2 • 530 views
ADD COMMENTlink modified 3.4 years ago by Ryan C. Thompson7.4k • written 3.4 years ago by ldetorrente10
Answer: DESeq2 how to define my conditions when I want the results comparing a subset of
gravatar for Ryan C. Thompson
3.4 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.4k wrote:

It sounds like you want to fit a patient-specific tissue effect for each tissue. This is an interaction model, and the following approach will probably give you what you're looking for. First, I'll create some example data similar to your own, with 3 tissue types, 3 patients, 5 replicate samples for each combination of patient & tissue, and then randomly split the samples between 3 lab techs:

tissue <- rep(c("TT1", "TT2", "TT3"), times=3)
pat <- rep(c("PAT1", "PAT2", "PAT3"), each=3)
replicate <- rep(sprintf("REP%i", 1:5), each=length(pat))
tech <- sample(c("TECH1", "TECH2", "TECH3"), size=length(replicate), replace=TRUE)
sampletable <- data.frame(Tissue=tissue, Patient=pat, Replicate=replicate, Technician=tech)

Now, the design you probably want is ~Patient*Tissue - Tissue + Technician, which will give you the following coefficients to test:

> colnames(model.matrix(~Patient*Tissue - Tissue + Technician, samples))
[1] "(Intercept)"           "PatientPAT2"           "PatientPAT3"        
[4] "TechnicianTECH2"       "TechnicianTECH3"       "PatientPAT1:TissueTT2"
[7] "PatientPAT2:TissueTT2" "PatientPAT3:TissueTT2" "PatientPAT1:TissueTT3"
[10] "PatientPAT2:TissueTT3" "PatientPAT3:TissueTT3"

Coefficients 6 through 11 are the patient-specific tissue effects. Each coefficient compares either tissue 2 or 3 vs tissue 1, which is used as the baseline for tissue comparisons. If you want to compare tissues 2 and 3, just contrast the two tissue coefficients for a given patient.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Ryan C. Thompson7.4k

I agree with Ryan.

In addition, you need to update to the latest version of R and Bioconductor. You are using DESeq2 v1.6 which is 1.5 years old and no longer supported. Here are instructions:

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Michael Love25k

Thank you for the fast answer it helps me understand  how I can have the results for the patient-specific tissue effects.

About my other question with the "remaining" samples, let's say I want to know the patient-specific tissue effects of only 3 patients Pat1, Pat2, Pat3, what do I do with the other samples? Like I said I have 35 patients but some of them have only 1 sample and even if I had more it would take forever to run your model with 35 patients, 3 technician and 3 Tissue Type for 298 samples. How do you do in this case? Do you create a new dataset with only your 3 patients and 45 samples? Or do you just define the Patient factor as: Pat1 (15 samples), Pat2 (15 samples), Pat 3 (15 samples) and Other (remaining 253 samples) (to have only 4 levels) ?

For the latest version of R, unfortunately it's not my decision, I am working on our cluster and they didn't update the R version yet. 



ADD REPLYlink written 3.4 years ago by ldetorrente10

It's up to you as the analyst what to do here, you can either subset and perform the analysis on a subset or include all the samples and extract specific terms.

This is a FAQ we discuss in the DESeq2 vignette: "If I have multiple groups, should I run all together or split into pairs of groups?"

It's absolutely worth an email to the cluster maintainers to update the version of R. This is their job and it will take someone just a few hours to do. Bioinformatics moves too fast for you to be working with 1.5 year old software.

ADD REPLYlink written 3.4 years ago by Michael Love25k

Great, thank you so much for the help, I read your vignette 1000 times but for some reason I have skipped this part apparently. 

ADD REPLYlink written 3.4 years ago by ldetorrente10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 319 users visited in the last hour