Question

DESeq2 time course with two genotypes

0

Entering edit mode

cp1015 • 0

@cp1015-13112

Last seen 2.8 years ago

United States

Hello All,

I have 90 samples in total for my dataset with two genotypes, A (wild-type) and B (single gene mutant). Each genotype is broken down into 5 treatments groups, which represent the time since the treatment was applied. Each treatment has 3 biological replicates and this experiment setup was repeated 3 times. So there are 3 biological replicates from each experiment (E1,E2,E3), giving a total of 9 biological replicates per treatment. So for the genotype A, the breakdown is:

Treatment	Biological Replicates
0hr/Control	9 samples (3 from E1, 3 from E2, 3 from E3)
12hrs	9 samples (3 from E1, 3 from E2, 3 from E3)
24hrs	9 samples (3 from E1, 3 from E2, 3 from E3)
36hrs	9 samples (3 from E1, 3 from E2, 3 from E3)
48hrs	9 samples (3 from E1, 3 from E2, 3 from E3)

(The same setup exists for genotype B)

My initial thought was to run the analyses separately for each genotype, but I saw in other posts that it would probably be better to combine the genotypes into the same dataset. I did so using the following code:

The design is meant to test for the effect of treatment time (12hrs, 24hrs, 36hrs, 48hrs) while controlling for the effect of the different experiments (E1, E2, E3) and genotype.

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design =~ experiment + genotype + treatment)

colData(ddsHTSeq)$treatment <- factor(colData(ddsHTSeq)$treatment, levels = c("Control", "12hrs", "24hrs", "36hrs", "48hrs"))
colData(ddsHTSeq)$experiment <- factor(colData(ddsHTSeq)$experiment, levels = c("E1", "E2", "E3"))
colData(ddsHTSeq)$genotype <- factor(colData(ddsHTSeq)$genotype, levels = c("A", "B))

dds <- DESeq(ddsHTSeq)

I am looking to find the genes that change across the treatment relative to the 0h/Control. I want to find genes that have a large (+/-) log fold-change at either 12hrs or 24hrs and then come back to 0 in 36hrs and 48hrs. My thought was to perform a pair-wise comparison between the treatments and the Control. I would then take the significant genes for 12hrs and 24hrs and plot the log fold-changes for the comparisons. I also used the lfcShrink function with the apeglm setting. (See attached plot) To find the genes of interest, I would then perform some clustering.

LFCs for significant genes from 12hrs and 24hrs

I think my approach is appropriate, but I wanted some confirmation of this.

Also, because I am including genotype in the design formula, the log fold-changes for the genes in the plot are the average between the two genotypes or only the log fold-changes for the wild-type genotype?

Thanks in advance!

>   sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] tcltk     parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Mfuzz_2.44.0                DynDoc_1.62.0               widgetTools_1.62.0          e1071_1.7-2                 pheatmap_1.0.12            
 [6] scales_1.0.0                tidyr_0.8.3                 RColorBrewer_1.1-2          gplots_3.0.1.1              ggplot2_3.2.1              
[11] dplyr_0.8.3                 tibble_2.1.3                readr_1.3.1                 DESeq2_1.24.0               SummarizedExperiment_1.14.1
[16] DelayedArray_0.10.0         BiocParallel_1.18.1         matrixStats_0.55.0          Biobase_2.44.0              GenomicRanges_1.36.1       
[21] GenomeInfoDb_1.20.0         IRanges_2.18.2              S4Vectors_0.22.0            BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            tools_3.6.1            backports_1.1.4        R6_2.4.0               rpart_4.1-15          
 [7] KernSmooth_2.23-15     Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12           
[13] withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3          bit_1.1-14             compiler_3.6.1         htmlTable_1.13.1      
[19] labeling_0.3           caTools_1.17.1.2       checkmate_1.9.4        genefilter_1.66.0      stringr_1.4.0          digest_0.6.20         
[25] foreign_0.8-72         XVector_0.24.0         base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6        htmlwidgets_1.3       
[31] rlang_0.4.0            rstudioapi_0.10        RSQLite_2.1.2          gtools_3.8.1           acepack_1.4.1          RCurl_1.95-4.12       
[37] magrittr_1.5           GenomeInfoDbData_1.2.1 Formula_1.2-3          Matrix_1.2-17          Rcpp_1.0.2             munsell_0.5.0         
[43] stringi_1.4.3          zlibbioc_1.30.0        grid_3.6.1             blob_1.2.0             gdata_2.18.0           crayon_1.3.4          
[49] lattice_0.20-38        splines_3.6.1          annotate_1.62.0        hms_0.5.1              locfit_1.5-9.1         zeallot_0.1.0         
[55] knitr_1.24             pillar_1.4.2           tkWidgets_1.62.0       geneplotter_1.62.0     XML_3.98-1.20          glue_1.3.1            
[61] latticeExtra_0.6-28    data.table_1.12.2      vctrs_0.2.0            gtable_0.3.0           purrr_0.3.2            assertthat_0.2.1      
[67] xfun_0.9               xtable_1.8-4           class_7.3-15           survival_2.44-1.1      AnnotationDbi_1.46.1   memoise_1.1.0         
[73] cluster_2.1.0

deseq2 timecourse michael love multiple time points • 946 views

ADD COMMENT • link 4.6 years ago cp1015 • 0

score 0 · Answer 1 · 2019-10-02

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 hours ago

United States

hi,

I also used the lfcShrink function with the apeglm setting. (See attached plot) To find the genes of interest, I would then perform some clustering.

I've had a similar dataset in the past, and this is exactly what I did as well: calculate posterior LFCs for each time point against 0 and then clustered the genes by their posterior LFCs. It really helped distinguish signal from noise, because apeglm will shrink noisy LFCs to 0. I wouldn't bother with p-values, but you can just use LFC alone or with svalues from lfcShrink.

Also, because I am including genotype in the design formula, the log fold-changes for the genes in the plot are the average between the two genotypes or only the log fold-changes for the wild-type genotype?

The average.

ADD COMMENT • link 4.6 years ago Michael Love 41k

0

Entering edit mode

Great, thank you for your response Michael.

For the second part, regarding the log fold-changes, if I wanted only the log fold-changes for the wild-type genotype I would add an interaction term to the design to make it:

design =~ experiment + genotype + condition + condition:genotype

Then the wild-type log fold-changes would be represented by the main condition effect, right?

When I tried this though, I got much fewer significant genes for the wild-type and no significant genes for the mutant genotype compared to the design without the interaction, as shown below.

Design =~ experiment + genotype + condition

genotypeBvsA
- 216
condition12hrsvs0h
- 205
condition24hrsvs0h
- 553
condition36hrsvs0h
- 1
condition48hrsvs0h
- 30

Design: =~ experiment + genotype * condition

genotypeBvsA
- 37
condition12hrsvs0h
- 21
condition24hrsvs0h
- 30
condition36hrsvs0h
- 2
condition48hrsvs0h
- 1
genotypeB.condition12hrs
- 0
genotypeB.condition24hrs
- 0
genotypeB.condition36hrs
- 0
genotypeB.condition48hrs
- 0

These may be novice questions but could you please provide some insight into:

What is leading to these different results as far as the number of significant genes?
What research question would the initial design (without the interaction) be testing?

ADD REPLY • link 4.6 years ago cp1015 • 0

0

Entering edit mode

1) One reason you can see less DEG thresholding on adjusted p-value is that you've added parameters to estimate while keeping your sample size the same.

2) The initial one is looking at the condition effect across both genotypes (assuming that it is identical) and controlling for any shift that is due to genotype.

ADD REPLY • link 4.6 years ago Michael Love 41k