Question

deseq2 - paired samples in 2 sequencing types with very different library sizes

0

Entering edit mode

cja • 0

@cja-23661

Last seen 2.9 years ago

United States

Hi,

I'm looking for help with a complex experiment in DESeq2. I've already read the vignette and help functions, but I'm still not sure if these data can actually be analyzed properly with this strategy.

My experiment involves two different library types (RNAseq and riboSeq) and due to the difference in preparation method, I ended up with about an order of magnitude difference in total counts between the two sample types, plus variation in library size within each sequencing type (table below). I know that DESeq2 can handle random variation in library sizes, but does this apply when there's an experiment-wide confounding factor like this? Should I be changing any of the normalization/dispersion functions to non-default methods in order to better handle this situation?

My experiment design is below, as is my info table (with an extra column showing the total counts in each sample).

I'm mainly interested in the interaction term Condition:SeqType for each drug condition in comparison to the control condition, but I'm also using subsetted counts/info tables to look separately at DE in RNAseq and DE in separate DESeqDataSets. I am getting results (as in I have the code working and it is outputting something normal-looking) but I don't know if I should trust them because of this issue. This is my first time doing RNAseq analysis outside a class/workshop setting so I'd really appreciate any insights anyone can offer - thanks!


dds <- DESeqDataSetFromMatrix(countData = counts, colData = info, design = ~ Batch + Condition + SeqType  + Condition:SeqType)
dds$Condition <- relevel(dds$Condition, "ctrl")
dds$SeqType <- relevel(dds$SeqType, "rna")

results <- results(dds, name="Conditiondrug1.SeqTyperibo")


SampleID    Condition   SeqType Batch   counts
1   ctrl    ribo    a   1942364
2   drug1   ribo    a   3001103
3   drug2   ribo    a   3044069
4   drug3   ribo    a   3907332
5   drug4   ribo    a   3882473
6   drug5   ribo    a   7170015
7   drug1   ribo    b   2185257
8   drug3   ribo    b   3113696
9   drug5   ribo    b   2526094
10  ctrl    ribo    b   2969142
11  drug4   ribo    b   2595123
12  ctrl    rna a   6956456
13  drug1   rna a   19859915
14  drug2   rna a   17286784
15  drug3   rna a   14246168
16  drug4   rna a   14623399
17  drug5   rna a   18566015
18  drug1   rna b   12868576
19  drug3   rna b   16169977
20  drug5   rna b   12427488
21  ctrl    rna b   19013439
22  drug2   rna b   17368432
23  drug4   rna b   15375688

deseq2 • 1.2k views

ADD COMMENT • link updated 3.1 years ago by Michael Love 41k • written 3.9 years ago by cja • 0

score 1 · Answer 1 · 2020-06-09

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 10 hours ago

United States

"experiment-wide confounding factor like this"

We actually mention that this is a pathological case for library size normalization in the 2014 paper.

There are some threads on the support site that get into the details, previously I have recommended filter such that all samples have a non-zero count. The issue arises when the lowly sequenced samples have 0's and the highly sequenced samples have non-zero, and the comparisons are confounded perfectly with sequencing depth. It's an unfortunate situation to have to remove a substantial amount of genes, but hard to avoid this problem with this type of technical confounding.

Another approach, besides filtering out the genes where 0's appear, would be to downsample the counts from the higher sequenced samples.

Give a shot at finding the previous posts, but these were the main suggestions I've offered previously.

ADD COMMENT • link 3.9 years ago Michael Love 41k

0

Entering edit mode

Thank you for your reply! I ended up filtering by counts with a little bit less restrictive parameters than you mentioned here (something like genes with >1 count in 90% of ribosome profiling samples & >1 count in 90% of RNAseq samples), as I have a lot of samples and it seemed pretty restrictive to require expression in every single one. I ended up with ~11500 genes in my counts table input to DEseq and I'm satisfied with that for this preliminary work at least.

I have a somewhat related followup question -

I had previously been inputting the filtered and then subsetted riboSeq and RNAseq counts tables into separate DEseq objects to do regular differential expression in each sequencing type, ie:

dds.dTE <- DESeqDataSetFromMatrix(countData = filtered_counts, colData = info, design = ~ batch + seqType + condition + condition:seqType)
dds.rna <- DESeqDataSetFromMatrix(countData = filtered_counts[ , 15:28], colData = info[15:28, ], design = ~ batch + condition)
dds.ribo <- DESeqDataSetFromMatrix(countData = filtered_counts[ , 1:14], colData = info[1:14, ], design = ~ batch + condition)
drug1.dTE.res <- results(dds.dTE, name="conditiondrug1.seqTyperibo")
drug1.rna.res <- results(dds.rna, name="condition_drug1_vs_ctrl")
drug1.ribo.res <- results(dds.ribo, name="condition_drug1_vs_ctrl")

I'm now revisiting this and because I have a known batch effect that exists across assay type, I'm concerned about accounting for that separately in the two different seqType-specific DE analyses, and thought to maybe handle it like this instead, by combining seqType+condition into one group term (i.e. group <- factor(paste0(seqType, condition)) as described in the vignette):

dds.all <- DESeqDataSetFromMatrix(countData = filtered_counts, colData = info[, c("id","batch","group")], design = ~ batch + group)
    drug1.rna.res.v2 <- results(dds.all, contrast = c("group", "drug1.rna","ctrl.rna"))
    drug1.ribo.res.v2 <- results(dds.all, contrast = c("group", "drug1.ribo","ctrl.ribo"))

However, I think I understood that this approach is possibly problematic because of the differences in library size? So would it then be better to get the per-seqType DE using dds.all with the size factors determined by seqType (so separate library size normalization for each subset of samples - riboSeq and RNAseq) as described in this post?

When I did the analysis this way (with seqType-specific size factors but also just using the dds.all with sizeFactors estimated across all samples), the results mostly overlap with the original method, except that it seems like there's a greater statistical power for the riboSeq-only DE at the expense of some power in the RNAseq-only DE (I have more significant hits in riboSeq and lose some in RNAseq). I'm comfortable with the tradeoff based on my biological question/downstream plans but I just would like to know if this is a reasonable approach.

If it's appropriate to use the dds.all and seqType-specific sizeFactor strategy, would I then want to get the interaction term from drug1.dTE.res <- results(dds.dTE, name="conditiondrug1.seqTyperibo") using the original calculated size factors across all samples (from dds.dTE), or using the seqType-specific size factors? Again there's a good amount of overlap between both approaches so I'm mostly just hoping to hear your thoughts about if it's reasonable to use this strategy.

ADD REPLY • link 3.1 years ago cja • 0

0

Entering edit mode

I had previously been inputting the filtered and then subsetted riboSeq and RNAseq counts tables into separate DESeqDataSet objects

Take a look at the example here:

DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling)

Essentially, yes we do recommend using an interaction term.

If you deal with the confounded sequencing depth using one of the approaches above, then I would recommend to just follow standard DESeq2 steps.

ADD REPLY • link 3.1 years ago Michael Love 41k