Hi there, Michael,
First, thanks for the great software and the excellent documentation you all have provided. My question is more theory-based. I'm interested in characterizing a mixed community as a whole. I've co-assembled metatranscriptomes, called genes, and used these coding sequences as my reference library during mapping of individual samples to create my count matrix - which i then normalized to reads per kb using the length of each coding sequence. Not surprisingly there are many coding sequences that are orthologs. After annotating, with kegg orthologs for instance, I've been considering collapsing together all coding sequences within a specific kegg ortholog (i.e. summing the rpk values for all CDS within an ortholog into one row), and then running deseq2 on that table. I've spent a fair amount of time trying to get more familiar with your manual and thinking about this, and it seems plausible to me (when the intent is to assess the mixed community system as a whole). Do you have any thoughts on this or are there required assumptions I'd be blatantly violating that makes this seem like a poor idea?
happy sunday,
Mike
Thanks for the quick response. And sorry for the confusion, I'm sure you know what it's like trying to succinctly convey a process you've been entrenched in for some time, ha. We don't have to worry too much about the rest, but just to be clear on the reads per kb component:
The coding sequences i recruited my reads to are assembled from metatranscriptomes, so in theory they represent full transcripts built de novo. I thought normalizing to these coding sequence lengths (as in converting my raw counts to reads per kb) was comparable to, or the same as, inputting a transcript abundance rather than raw counts – which section 1.3.4 in the deseq2 package pdf discusses (i.e. importing transcript abundance from something like Sailfish). Does that part make sense?
Thanks again
"I thought normalizing to these coding sequence lengths (as in converting my raw counts to reads per kb) was comparable to, or the same as, inputting a transcript abundance rather than raw counts"
No, the tximport-DESeq2 pipeline is definitely not feeding normalized counts to DESeq2.
We have this in the vignette:
"Full details on the motivation and methods for importing transcript level abundance and count estimates, summarizing to gene-level count matrices and producing an offset which corrects for potential changes in average transcript length across samples are described in [8]."
The transcript abundance import is handled using the tximport package, and with a hand-off through the DESeqDataSetFromTximport() constructor, which puts the necessary information in certain places where it will be picked up during the standard DESeq() steps.
In short: tximport reads in estimated counts, abundances and effective lengths per transcript. On the DESeq2 side: as always inference is on the gene-level un-normalized counts, while the effective lengths and abundances per transcript are collapsed into a per-gene average transcript length, which is converted into a normalization factor, and then size factors (for sequencing depth) are estimated taking into account the average transcript length changes across samples, and then the normalization factors are updated to take into account the size factors. But the important thing is that the inference is not on normalized counts.
Ahh, thank you. I started doing this with just linear models, but then thought I'd get creative and try to do this hybridization with your software. But apparently the very foundation of my thought process was already misguided so you just saved me from a potentially huge waste of time. Thanks for clearing that up!