Question

Deseq2 DE analysis of host-pathogen samples (model separately or jointly?)

1

Entering edit mode

dudutchy ▴ 10

@55dd470a

Last seen 5 days ago

United States

Hello,

we're working on a differential expression (DEG) analysis using DESeq2 for a dataset involving an eukaryotic host experimentally infected with a virus.

Dataset: Our design includes comparisons across different infection treatments, between infected vs. non-infected controls, and across four time points. This leads to 16 different "conditions" including these three variables, and we have 4-5 biological replicates per condition after QC.

Read quantification: We quantified transcript expression using Salmon, with a combined index that includes both host and viral transcriptomes. We used tximport to map the transcripts to host and viral genes. Across all samples, about 99% of reads map to the host.

Our goal is to analyze differential expression in both host and viral genes. This leads to a key question:

Should we perform DEG analysis on the combined host + virus transcriptome in DESeq2, or analyze host and viral genes separately?

A different post here suggested that this is often the best choice unless "the inter-sample variability (e.g. the spread of points as you could see in a PCA plot -- see vignette) is vastly different across subsets." This is certainly the case here, if I understand this point correctly. A PCA of vst counts separates time point 3 from everything else in PC1 (~35% var explained) when looking at this joint analysis or just the host (these two very similar) but the virus-only PCA1 looks very different (~ 84%, separating time point 1). This makes sense (biologically) to us, but we're not sure if it qualifies as one of these situations that warrants separate DE analysis.

We've noticed substantial differences in results depending on the approach. For example, in a particular contrast between infected treatments, we identify 172 significantly viral DEGs when analyzing the full dataset (all of them in the same direction compared to the control), but only a handful when restricting the analysis to viral genes alone.

Subsetting to viral genes results in much smaller library sizes, and we're also considering that host and viral genes may be influenced by different biological processes or technical factors. Given these considerations, what would be the most appropriate strategy for this kind of analysis?

Thanks in advance for your insights!

DESeq2 • 2.1k views

ADD COMMENT • link 7 months ago dudutchy ▴ 10

score 2 · Answer 1 · 2025-05-27

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 10 days ago

United States

It makes sense to me to separate into the two groups. You may want to normalize in one go as this is accounting for shared technical bias. But the dispersion could be different so the rest of the pipeline can be run separate.

E.g.

dds <- estimateSizeFactors(dds)
# then separate
# then assign SF from above
sizeFactors(dds_host) <- sizeFactors(dds)
sizeFactors(dds_viral) <- sizeFactors(dds)
# then continue with DESeq() -- it will not re-compute SF

ADD COMMENT • link 7 months ago Michael Love 43k

0

Entering edit mode

Thank you so much for your reply. This is in line with what I was thinking. I will do that asap and will update this post. Cheers

ADD REPLY • link 7 months ago dudutchy ▴ 10