Hello Bioconductor community,
I have used DESeq2 to test for differences in the overall transcriptional activity of species in a microbiome sample. However, i am not quite sure whether what I'm doing is correct. Thus, I would be very thankful for somebody confirming that this solution makes sense or pointing me to an error.
I am comparing two different biological conditions. For each condition, i have sequencing data from DNA and RNA libraries. This coldata-table describes the data:
sample | condition | type |
---|---|---|
l_1_dna | low | dna |
l_2_dna | low | dna |
l_1_rna | low | rna |
l_2_rna | low | rna |
h_1_dna | high | dna |
h_2_dna | high | dna |
h_1_rna | high | rna |
h_2_rna | high | rna |
I use metagenome assembled genomes representing the microbial species which are present in the sample. To generate count data, i map DNA and RNA reads to these genomes. The DNA and RNA libraries have vastly different sequencing depth and read length, but from what i understand about the DESeq2 normalization I assume that this won't be an issue.
So i end up with a count table that looks like this (counts made up):
genome | dna_l1 | dna_l2 | dna_h1 | dna_h2 | rna_l1 | rna_l2 | rna_h1 | rna_h2 |
---|---|---|---|---|---|---|---|---|
species_A | 20000 | 20000 | 20000 | 2000 | 10000 | 10000 | 2000000 | 2000000 |
species_B | 1000 | 1000 | 1000 | 1000 | 5000000 | 5000000 | 10000 | 10000 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
What i want to test is whether individual species change their transcriptional activity. So something like: "Is species A pumping out more mRNA per individual cell under condition high?". To that end, I'm using the likelihood ratio test.
dds_raw <- DESeqDataSetFromMatrix(countData=count, colData=col, design = ~ condition + type + condition:type)
dds <- DESeq(dds_raw, test="LRT", reduced= ~ condition + type)
results_lrt <- results(dds)
I then interpret low p-values to mean that the activity of the respective species has changed, since the differences between samples cannot be sufficiently explained without taking into account whether we are looking at genome- or transcriptome-abundance.
Is this the correct way of using DESeq2 to find out what i want to know? Is it even valid to use the package like that? It seems like I have strayed pretty far from the original purpose of the software..
Thank you for any input,
Tom
Thank you for your answer and especially for pointing me into the direction of zinbwave and the issue of zero inflation.
I will look into how i may take that into account. Regarding my initial question: i don't expect any zeros in the count matrix at all, given that reads are counted for whole genomes instead of individual genes. But it might be very relevant to the other analyses that i work on.
I have found publications about zero inflation in scRNA-seq and have started reading those. However, so far i couldn't find any discussion of zero inflation with regards to metatranscriptomics. If anybody can provide a link that would be appreciated. I assume this is out of the scope of this site, though, so i will mark the thread as answered.
Thanks again and have a nice day,
Tom
"...reads are counted for whole genomes instead of individual genes"
Oh I see. I didn't catch that. You may have no problem then with the setup above.