Question

DESeq2 how to define my conditions when I want the results comparing a subset of samples

0

Entering edit mode

ldetorrente ▴ 10

@ldetorrente-10730

Last seen 8.0 years ago

Hello,

I am working on a large breast cancer dataset with 298 samples from 35 patients (myData). I have 3 different Tissue Type (TT1, TT2, TT3) and I saw that there was a batch effect due to the technician that did the experiment. As some of my patients have only 1 sample at the moment I am not using it in my design so when I wanted to compare TT1 with TT2 I used the following code:

dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)
dedds<-DESeq(dds)
res1=results(dedds,contrast=c("Tissue_Type","TT1","TT2"))

Now, I am interested in knowing in a specific patient (PAT1) what are the differentially expressed genes between TT1 and TT2 as I know it can be patient specific. PAT1 has 12 samples TT1, 5 samples TT2 and 3 samples TT3. What I did is that I redefined the Tissue Type in the following way:

Tissue_Type=rep("Other",298)
Tissue_Type[intersect(grep("TT1",batch$Tissue_Type),which(batch$Patient_ID=="PAT1"))]="TT1_PAT1"
Tissue_Type[intersect(grep("TT2",batch$Tissue_Type),which(batch$Patient_ID=="PAT1"))]="TT2_PAT1"
batch$Tissue_Type=factor(Tissue_Type)
dds <- DESeqDataSetFromMatrix(countData = myData,colData=batch[,c("Sample_ID","Tissue_Type","Technician")],design = ~ Technician+Tissue_Type)
dedds<-DESeq(dds)
res2=results(dedds,contrast=c("Tissue_Type","TT1_PAT1","TT2_PAT1"))

Is it correct? Or should I split "Other" in another way (like TT1_Other, TT2_Other etc.)? Or should I just take the dataset with only the 17 samples I am interested in? I tried this last solution and it gives me a totally different list of genes, which seems right because the size factor and the correction for the technician would be totally different when using only 17 samples.

Thank you for your help!

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] sva_3.12.0                genefilter_1.48.1
 [3] mgcv_1.8-6                nlme_3.1-120
 [5] DESeq2_1.6.3              RcppArmadillo_0.4.650.1.1
 [7] Rcpp_0.11.5               GenomicRanges_1.18.4
 [9] GenomeInfoDb_1.2.5        IRanges_2.0.1
[11] S4Vectors_0.4.0           BiocGenerics_0.12.1

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.2
 [4] base64enc_0.1-2      BatchJobs_1.6        BBmisc_1.9
 [7] Biobase_2.26.0       BiocParallel_1.0.3   brew_1.0-6
[10] checkmate_1.5.2      cluster_2.0.1        codetools_0.2-11
[13] colorspace_1.2-6     DBI_0.3.1            digest_0.6.8
[16] fail_1.2             foreach_1.4.2        foreign_0.8-63
[19] Formula_1.2-1        geneplotter_1.44.0   ggplot2_1.0.1
[22] grid_3.1.1           gtable_0.1.2         Hmisc_3.15-0
[25] iterators_1.0.7      lattice_0.20-31      latticeExtra_0.6-26
[28] locfit_1.5-9.1       MASS_7.3-40          Matrix_1.2-0
[31] munsell_0.4.2        nnet_7.3-9           plyr_1.8.1
[34] proto_0.3-10         RColorBrewer_1.1-2   reshape2_1.4.1
[37] rpart_4.1-9          RSQLite_1.0.0        scales_0.2.4
[40] sendmailR_1.2-1      splines_3.1.1        stringr_0.6.2
[43] survival_2.38-1      tools_3.1.1          XML_3.98-1.1
[46] xtable_1.7-4         XVector_0.6.0

deseq2 rnaseq • 1.2k views

ADD COMMENT • link updated 8.6 years ago by Ryan C. Thompson ★ 7.9k • written 8.6 years ago by ldetorrente ▴ 10

score 3 · Accepted Answer · 2016-05-18

3

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 9 weeks ago

Icahn School of Medicine at Mount Sinai…

It sounds like you want to fit a patient-specific tissue effect for each tissue. This is an interaction model, and the following approach will probably give you what you're looking for. First, I'll create some example data similar to your own, with 3 tissue types, 3 patients, 5 replicate samples for each combination of patient & tissue, and then randomly split the samples between 3 lab techs:

tissue <- rep(c("TT1", "TT2", "TT3"), times=3)
pat <- rep(c("PAT1", "PAT2", "PAT3"), each=3)
replicate <- rep(sprintf("REP%i", 1:5), each=length(pat))
tech <- sample(c("TECH1", "TECH2", "TECH3"), size=length(replicate), replace=TRUE)
sampletable <- data.frame(Tissue=tissue, Patient=pat, Replicate=replicate, Technician=tech)

Now, the design you probably want is ~Patient*Tissue - Tissue + Technician, which will give you the following coefficients to test:

> colnames(model.matrix(~Patient*Tissue - Tissue + Technician, samples))
[1] "(Intercept)"           "PatientPAT2"           "PatientPAT3"        
[4] "TechnicianTECH2"       "TechnicianTECH3"       "PatientPAT1:TissueTT2"
[7] "PatientPAT2:TissueTT2" "PatientPAT3:TissueTT2" "PatientPAT1:TissueTT3"
[10] "PatientPAT2:TissueTT3" "PatientPAT3:TissueTT3"

Coefficients 6 through 11 are the patient-specific tissue effects. Each coefficient compares either tissue 2 or 3 vs tissue 1, which is used as the baseline for tissue comparisons. If you want to compare tissues 2 and 3, just contrast the two tissue coefficients for a given patient.

ADD COMMENT • link 8.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I agree with Ryan.

In addition, you need to update to the latest version of R and Bioconductor. You are using DESeq2 v1.6 which is 1.5 years old and no longer supported. Here are instructions:

http://www.bioconductor.org/install

ADD REPLY • link 8.6 years ago Michael Love 43k

0

Entering edit mode

Thank you for the fast answer it helps me understand how I can have the results for the patient-specific tissue effects.

About my other question with the "remaining" samples, let's say I want to know the patient-specific tissue effects of only 3 patients Pat1, Pat2, Pat3, what do I do with the other samples? Like I said I have 35 patients but some of them have only 1 sample and even if I had more it would take forever to run your model with 35 patients, 3 technician and 3 Tissue Type for 298 samples. How do you do in this case? Do you create a new dataset with only your 3 patients and 45 samples? Or do you just define the Patient factor as: Pat1 (15 samples), Pat2 (15 samples), Pat 3 (15 samples) and Other (remaining 253 samples) (to have only 4 levels) ?

For the latest version of R, unfortunately it's not my decision, I am working on our cluster and they didn't update the R version yet.

ADD REPLY • link 8.6 years ago ldetorrente ▴ 10

1

Entering edit mode

It's up to you as the analyst what to do here, you can either subset and perform the analysis on a subset or include all the samples and extract specific terms.

This is a FAQ we discuss in the DESeq2 vignette: "If I have multiple groups, should I run all together or split into pairs of groups?"

It's absolutely worth an email to the cluster maintainers to update the version of R. This is their job and it will take someone just a few hours to do. Bioinformatics moves too fast for you to be working with 1.5 year old software.

ADD REPLY • link 8.6 years ago Michael Love 43k

0

Entering edit mode

Great, thank you so much for the help, I read your vignette 1000 times but for some reason I have skipped this part apparently.

ADD REPLY • link 8.6 years ago ldetorrente ▴ 10