Question

Analyze time course experiment with multiple samples from multiple animals with DESeq2

0

Entering edit mode

Caleb Bostwick ▴ 30

@caleb-bostwick-6580

Last seen 2.2 years ago

United States

Hello, I am a graduate student who would like to know if I am designing my differential expression experiment correctly using DESeq2. I performed RNA-seq on tissues originating from 14 total animals at 4 time points of exposure to a drug. 3 animals were exposed for 0 hours (no exposure; control), 3 animals were exposed for 0.5 hours, 4 animals were exposed for 1 hour, and 4 animals were exposed for 2 hours. From each animal, I extracted multiple tissue samples - from the controls I got 8 tissue samples from each animal, for the drug-exposed animals I was only (due to experimental difficulties) able to get some of the same tissues (the best was 7 of 8 tissues samples from animal #6, the worst was 3 of 8 from animals #8 and #9). Additionally, during extraction, some of the tissues were slightly damaged; due to this, they were excluded from drug treatment (e.g. Ac5_LPeG has 0hr in the treatment column below, while the remaining tissues from Ac5 were subjected to 2hr of drug treatment). My sample information column data is as follows:

colData
                 Animal       Tissue        Treatment
Ac1c_AG            Ac1c           AG              0hr
Ac1c_CG            Ac1c           CG              0hr
Ac1c_LPeG          Ac1c         LPeG              0hr
Ac1c_RPeG          Ac1c         RPeG              0hr
Ac1c_LPlG          Ac1c         LPlG              0hr
Ac1c_RPlG          Ac1c         RPlG              0hr
Ac1c_LBG           Ac1c          LBG              0hr
Ac1c_RBG           Ac1c          RBG              0hr
Ac2c_AG            Ac2c           AG              0hr
Ac2c_CG            Ac2c           CG              0hr
Ac2c_LPeG          Ac2c         LPeG              0hr
Ac2c_RPeG          Ac2c         RPeG              0hr
Ac2c_LPlG          Ac2c         LPlG              0hr
Ac2c_RPlG          Ac2c         RPlG              0hr
Ac2c_LBG           Ac2c          LBG              0hr
Ac2c_RBG           Ac2c          RBG              0hr
Ac3c_AG            Ac3c           AG              0hr
Ac3c_CG            Ac3c           CG              0hr
Ac3c_LPeG          Ac3c         LPeG              0hr
Ac3c_RPeG          Ac3c         RPeG              0hr
Ac3c_LPlG          Ac3c         LPlG              0hr
Ac3c_RPlG          Ac3c         RPlG              0hr
Ac3c_LBG           Ac3c          LBG              0hr
Ac3c_RBG           Ac3c          RBG              0hr
Ac1_AG              Ac1           AG            0.5hr
Ac1_CG              Ac1           CG            0.5hr
Ac1_LPeG            Ac1         LPeG            0.5hr
Ac1_LPlG            Ac1         LPlG            0.5hr
Ac1_BG              Ac1           BG            0.5hr
Ac2_AG              Ac2           AG            0.5hr
Ac2_CG              Ac2           CG            0.5hr
Ac2_LPeG            Ac2         LPeG              0hr
Ac2_RPeG            Ac2         RPeG            0.5hr
Ac2_RPlG            Ac2         RPlG            0.5hr
Ac2_BG              Ac2           BG            0.5hr
Ac3_AG              Ac3           AG            0.5hr
Ac3_CG              Ac3           CG            0.5hr
Ac3_LPeG            Ac3         LPeG            0.5hr
Ac3_LPlG            Ac3         LPlG            0.5hr
Ac3_BG              Ac3           BG            0.5hr
Ac4_AG              Ac4           AG              2hr
Ac4_CG              Ac4           CG              2hr
Ac4_LPeG            Ac4         LPeG              2hr
Ac4_BG              Ac4           BG              2hr
Ac5_AG              Ac5           AG              2hr
Ac5_CG              Ac5           CG              2hr
Ac5_LPeG            Ac5         LPeG              0hr
Ac5_RPeG            Ac5         RPeG              2hr
Ac5_RPlG            Ac5         RPlG              2hr
Ac5_BG              Ac5           BG              2hr
Ac6_AG              Ac6           AG              2hr
Ac6_CG              Ac6           CG              2hr
Ac6_LPeG            Ac6         LPeG              2hr
Ac6_RPeG            Ac6         RPeG              2hr
Ac6_LPlG            Ac6         LPlG              2hr
Ac6_RPlG            Ac6         RPlG              2hr
Ac6_BG              Ac6           BG              2hr
Ac7_AG              Ac7           AG              1hr
Ac7_CG              Ac7           CG              1hr
Ac7_LPeG            Ac7         LPeG              0hr
Ac7_RPeG            Ac7         RPeG              1hr
Ac7_RPlG            Ac7         RPlG              1hr
Ac7_BG              Ac7           BG              1hr
Ac8_AG              Ac8           AG              1hr
Ac8_CG              Ac8           CG              1hr
Ac8_LPeG            Ac8         LPeG              1hr
Ac9_LPeG            Ac9         LPeG              1hr
Ac9_RPlG            Ac9         RPlG              1hr
Ac9_BG              Ac9           BG              1hr
Ac10_AG            Ac10           AG              2hr
Ac10_CG            Ac10           CG              2hr
Ac10_LPeG          Ac10         LPeG              2hr
Ac10_LPlG          Ac10         LPlG              2hr
Ac10_BG            Ac10           BG              2hr
Ac11_AG            Ac11           AG              1hr
Ac11_CG            Ac11           CG              1hr
Ac11_RPeG          Ac11         RPeG              1hr
Ac11_RPlG          Ac11         RPlG              1hr
Ac11_BG            Ac11           BG              1hr

I want to make several multifactor comparisons.

First, I would like to compare a time course of the drug effects (treatment) on each tissue type (ideally for each tissue, e.g AG 0hr vs AG 2hr - however I realize some tissues/timepoints don't have at least 3 replicates and I am prepared to disregard those). Examining the column data, one can see the 3 control animals have tissues LBG and RBG, while the drug-exposed animals only have BG (due to experimental difficulties). The LBG and RBG are half-portions of the whole BG tissue in the animals. To compare these tissues, I have averaged the counts for each row (gene) from the LBG and RBG to get a "pseudo-BG" count for each control animal - is this correct/reasonable to do, and if not, is there an accepted way to deal with this issue?
I would also like to compare differences between the tissues (e.g. 0hr AG vs 0hr CG). I am unsure how to design the formula to control for the treatment variation (and inter-animal variation?) in order to increase the sensitivity for finding differences due to the tissue type.
Finally, I would like to compare differences between individual animals using all the tissues extracted from that animal (particularly those in the same treatment group - eg. Ac1c vs Ac2c or Ac4 vs Ac6 - here I wonder if I should exclude "damaged" samples like Ac5_LPeG from the overall Ac5 "profile" ?)

So far, I have combined the tissue type and treatment time into one factor (group) as follows:

dds$group <- factor(paste0(dds$Tissue, dds$Treatment))

and used this as my experimental design:

design(dds) <- ~ group

to give the following colData:

colData(dds)
DataFrame with 76 rows and 5 columns
                 Animal       Tissue        Treatment    group sizeFactor
               <factor>     <factor>         <factor> <factor>  <numeric>
Ac1c_AG            Ac1c           AG              0hr    AG0hr   1.004482
Ac1c_CG            Ac1c           CG              0hr    CG0hr   1.629999
Ac1c_LPeG          Ac1c         LPeG              0hr  LPeG0hr   1.412265
Ac1c_RPeG          Ac1c         RPeG              0hr  RPeG0hr   2.084133
Ac1c_LPlG          Ac1c         LPlG              0hr  LPlG0hr   2.176424
...                 ...          ...              ...      ...        ...
Ac11_AG            Ac11           AG              1hr    AG1hr  1.5952341
Ac11_CG            Ac11           CG              1hr    CG1hr  0.4082796
Ac11_RPeG          Ac11         RPeG              1hr  RPeG1hr  2.3312163
Ac11_RPlG          Ac11         RPlG              1hr  RPlG1hr  0.6341157
Ac11_BG            Ac11           BG              1hr    BG1hr  1.6152299

I have been using this design to make contrasts such as:

AGres2v0hr <-  results(dds, contrast=c("group","AG2hr","AG0hr"))

to answer my first comparison question about the time course of the treatment on each tissue, as well as:

AGvCGres0hr <-  results(dds, contrast=c("group","AG0hr","CG0hr"))

to compare differences between the tissues.

However, I am not confident this design is always best for controlling variation and making the 3 types of comparisons I listed previously. Should I be using different designs to make the various comparisons I want? I am concerned about not controlling for inter-animal variation because when I perform an exploratory PCA after using the variance stabilizing transformation:

var <- varianceStabilizingTransformation(dds, blind = F)
pca <- plotPCA(var, intgroup = c("Treatment", "Tissue")

I get a similar overall clustering pattern with each tissue - i.e. Ac1c and Ac3c are always closely together to the right on the x-axis (PC1) while Ac4 is always to the far left on the x-axis for each tissue group. My collaborators are concerned this demonstrates that inter-animal variation has a greater effect on clustering than the effects of my drug treatment. My last question is: can we discern that is true from this PCA plot and is it possibly due to my experimental design?

My apologies for the extremely long post. I have highlighted key questions from the post to assist in locating them. If there is any additional information or clarification needed, please just let me know and I can provide it. I greatly appreciate any insight or help on these topics. Thank you very much for taking the time to read this. Finally, below is my R session information:

Best,

Caleb

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggplot2_2.2.1              DESeq2_1.16.1              SummarizedExperiment_1.6.3 DelayedArray_0.2.7         matrixStats_0.52.2        
 [6] Biobase_2.36.2             GenomicRanges_1.28.4       GenomeInfoDb_1.12.2        IRanges_2.10.2             S4Vectors_0.14.3          
[11] BiocGenerics_0.22.0        edgeR_3.18.1               limma_3.32.5              

loaded via a namespace (and not attached):
 [1] genefilter_1.58.1       locfit_1.5-9.1          splines_3.4.1           lattice_0.20-35         colorspace_1.3-2        htmltools_0.3.6        
 [7] base64enc_0.1-3         blob_1.1.0              survival_2.41-3         XML_3.98-1.9            rlang_0.1.1             DBI_0.7                
[13] foreign_0.8-69          BiocParallel_1.10.1     bit64_0.9-7             RColorBrewer_1.1-2      GenomeInfoDbData_0.99.0 plyr_1.8.4             
[19] stringr_1.2.0           zlibbioc_1.22.0         munsell_0.4.3           gtable_0.2.0            htmlwidgets_0.9         memoise_1.1.0          
[25] labeling_0.3            latticeExtra_0.6-28     knitr_1.16              geneplotter_1.54.0      AnnotationDbi_1.38.2    htmlTable_1.9          
[31] Rcpp_0.12.12            acepack_1.4.1           xtable_1.8-2            scales_0.4.1            backports_1.1.0         checkmate_1.8.3        
[37] Hmisc_4.0-3             annotate_1.54.0         XVector_0.16.0          bit_1.1-12              gridExtra_2.2.1         digest_0.6.12          
[43] stringi_1.1.5           grid_3.4.1              tools_3.4.1             bitops_1.0-6            magrittr_1.5            RSQLite_2.0            
[49] lazyeval_0.2.0          RCurl_1.95-4.8          tibble_1.3.3            Formula_1.2-2           cluster_2.0.6           Matrix_1.2-10          
[55] data.table_1.10.4       rpart_4.1-11            nnet_7.3-12             compiler_3.4.1

deseq2 multiple factor design multifactorial design multiple time points design and contrast matrix • 1.3k views

ADD COMMENT • link updated 6.7 years ago by Michael Love 41k • written 6.7 years ago by Caleb Bostwick ▴ 30

score 0 · Answer 1 · 2017-08-17

hi Caleb,

DESeq2 estimates fixed effects for coefficients. And this means you can't control for the animal-level variation and compare directly across groups in which the animals are nested, e.g. the time point (labeled "treatment" above), which is your goal (1) above. The reason is that these effects are confounded. If you take a look at the vignette, you may have seen this section, where we show how you can make comparisons of effects that are within animal (e.g. tissue), and then compare those effects across timepoint: are the differences between these two tissue different across time point. But you can't directly compare e.g. time 1 to time 2 and try to simultaneously control for animal variation using the fixed effects which DESeq2 uses. If you want to do this, you have to use something like random effects modeling. An implementation that accounts for known correlations between samples nested within groups that you wish to compare is limma's duplicateCorrelation(). You may want to try that framework (limma-voom for RNA-seq), if those direct comparisons are necessary.

You don't want to create pseudosamples comprised of biological replicates and give these to the statistical models as if they were a single biological replicate. It's critical that you not collapse the biological variation for certain samples.