Hi,
I'm looking for help with a complex experiment in DESeq2. I've already read the vignette and help functions, but I'm still not sure if these data can actually be analyzed properly with this strategy.
My experiment involves two different library types (RNAseq and riboSeq) and due to the difference in preparation method, I ended up with about an order of magnitude difference in total counts between the two sample types, plus variation in library size within each sequencing type (table below). I know that DESeq2 can handle random variation in library sizes, but does this apply when there's an experiment-wide confounding factor like this? Should I be changing any of the normalization/dispersion functions to non-default methods in order to better handle this situation?
My experiment design is below, as is my info table (with an extra column showing the total counts in each sample).
I'm mainly interested in the interaction term Condition:SeqType for each drug condition in comparison to the control condition, but I'm also using subsetted counts/info tables to look separately at DE in RNAseq and DE in separate DESeqDataSets. I am getting results (as in I have the code working and it is outputting something normal-looking) but I don't know if I should trust them because of this issue. This is my first time doing RNAseq analysis outside a class/workshop setting so I'd really appreciate any insights anyone can offer - thanks!
dds <- DESeqDataSetFromMatrix(countData = counts, colData = info, design = ~ Batch + Condition + SeqType + Condition:SeqType)
dds$Condition <- relevel(dds$Condition, "ctrl")
dds$SeqType <- relevel(dds$SeqType, "rna")
results <- results(dds, name="Conditiondrug1.SeqTyperibo")
SampleID Condition SeqType Batch counts
1 ctrl ribo a 1942364
2 drug1 ribo a 3001103
3 drug2 ribo a 3044069
4 drug3 ribo a 3907332
5 drug4 ribo a 3882473
6 drug5 ribo a 7170015
7 drug1 ribo b 2185257
8 drug3 ribo b 3113696
9 drug5 ribo b 2526094
10 ctrl ribo b 2969142
11 drug4 ribo b 2595123
12 ctrl rna a 6956456
13 drug1 rna a 19859915
14 drug2 rna a 17286784
15 drug3 rna a 14246168
16 drug4 rna a 14623399
17 drug5 rna a 18566015
18 drug1 rna b 12868576
19 drug3 rna b 16169977
20 drug5 rna b 12427488
21 ctrl rna b 19013439
22 drug2 rna b 17368432
23 drug4 rna b 15375688
Thank you for your reply! I ended up filtering by counts with a little bit less restrictive parameters than you mentioned here (something like genes with >1 count in 90% of ribosome profiling samples & >1 count in 90% of RNAseq samples), as I have a lot of samples and it seemed pretty restrictive to require expression in every single one. I ended up with ~11500 genes in my counts table input to DEseq and I'm satisfied with that for this preliminary work at least.
I have a somewhat related followup question -
I had previously been inputting the filtered and then subsetted riboSeq and RNAseq counts tables into separate DEseq objects to do regular differential expression in each sequencing type, ie:
I'm now revisiting this and because I have a known batch effect that exists across assay type, I'm concerned about accounting for that separately in the two different seqType-specific DE analyses, and thought to maybe handle it like this instead, by combining
seqType
+condition
into onegroup
term (i.e.group <- factor(paste0(seqType, condition))
as described in the vignette):However, I think I understood that this approach is possibly problematic because of the differences in library size? So would it then be better to get the per-seqType DE using dds.all with the size factors determined by seqType (so separate library size normalization for each subset of samples - riboSeq and RNAseq) as described in this post?
When I did the analysis this way (with seqType-specific size factors but also just using the dds.all with sizeFactors estimated across all samples), the results mostly overlap with the original method, except that it seems like there's a greater statistical power for the riboSeq-only DE at the expense of some power in the RNAseq-only DE (I have more significant hits in riboSeq and lose some in RNAseq). I'm comfortable with the tradeoff based on my biological question/downstream plans but I just would like to know if this is a reasonable approach.
If it's appropriate to use the dds.all and seqType-specific sizeFactor strategy, would I then want to get the interaction term from
drug1.dTE.res <- results(dds.dTE, name="conditiondrug1.seqTyperibo")
using the original calculated size factors across all samples (from dds.dTE), or using the seqType-specific size factors? Again there's a good amount of overlap between both approaches so I'm mostly just hoping to hear your thoughts about if it's reasonable to use this strategy.Take a look at the example here:
DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling)
Essentially, yes we do recommend using an interaction term.
If you deal with the confounded sequencing depth using one of the approaches above, then I would recommend to just follow standard DESeq2 steps.