Question

applying deseq2 to ortholog groups

0

Entering edit mode

AstrobioMike • 0

@astrobiomike-12214

Last seen 7.3 years ago

Hi there, Michael,

First, thanks for the great software and the excellent documentation you all have provided. My question is more theory-based. I'm interested in characterizing a mixed community as a whole. I've co-assembled metatranscriptomes, called genes, and used these coding sequences as my reference library during mapping of individual samples to create my count matrix - which i then normalized to reads per kb using the length of each coding sequence. Not surprisingly there are many coding sequences that are orthologs. After annotating, with kegg orthologs for instance, I've been considering collapsing together all coding sequences within a specific kegg ortholog (i.e. summing the rpk values for all CDS within an ortholog into one row), and then running deseq2 on that table. I've spent a fair amount of time trying to get more familiar with your manual and thinking about this, and it seems plausible to me (when the intent is to assess the mixed community system as a whole). Do you have any thoughts on this or are there required assumptions I'd be blatantly violating that makes this seem like a poor idea?

happy sunday,

Mike

deseq2 • 1.3k views

ADD COMMENT • link updated 7.3 years ago by Michael Love 41k • written 7.3 years ago by AstrobioMike • 0

score 0 · Answer 1 · 2017-01-23

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

I'm not sure I follow all the details here, so you'd be on your own to make sure it makes sense. One assumption that seems to be violated is that, you should provide something on the scale of *counts* to DESeq2, and never normalized counts (whether normalized for sequence depth or for length). All of our pipelines for DESeq2 take as input counts or expected counts, and then deal with sequencing depth and possible length normalization (when average transcript length changes across samples) using internal normalization factors (analogous to offsets).

If you are working with normalized non-counts, it may be simpler for you to use linear models, such as limma.

ADD COMMENT • link 7.3 years ago Michael Love 41k

0

Entering edit mode

Thanks for the quick response. And sorry for the confusion, I'm sure you know what it's like trying to succinctly convey a process you've been entrenched in for some time, ha. We don't have to worry too much about the rest, but just to be clear on the reads per kb component:

The coding sequences i recruited my reads to are assembled from metatranscriptomes, so in theory they represent full transcripts built de novo. I thought normalizing to these coding sequence lengths (as in converting my raw counts to reads per kb) was comparable to, or the same as, inputting a transcript abundance rather than raw counts – which section 1.3.4 in the deseq2 package pdf discusses (i.e. importing transcript abundance from something like Sailfish). Does that part make sense?

Thanks again

ADD REPLY • link 7.3 years ago AstrobioMike • 0

1

Entering edit mode

"I thought normalizing to these coding sequence lengths (as in converting my raw counts to reads per kb) was comparable to, or the same as, inputting a transcript abundance rather than raw counts"

No, the tximport-DESeq2 pipeline is definitely not feeding normalized counts to DESeq2.

We have this in the vignette:

"Full details on the motivation and methods for importing transcript level abundance and count estimates, summarizing to gene-level count matrices and producing an offset which corrects for potential changes in average transcript length across samples are described in [8]."

The transcript abundance import is handled using the tximport package, and with a hand-off through the DESeqDataSetFromTximport() constructor, which puts the necessary information in certain places where it will be picked up during the standard DESeq() steps.

In short: tximport reads in estimated counts, abundances and effective lengths per transcript. On the DESeq2 side: as always inference is on the gene-level un-normalized counts, while the effective lengths and abundances per transcript are collapsed into a per-gene average transcript length, which is converted into a normalization factor, and then size factors (for sequencing depth) are estimated taking into account the average transcript length changes across samples, and then the normalization factors are updated to take into account the size factors. But the important thing is that the inference is not on normalized counts.

ADD REPLY • link 7.3 years ago Michael Love 41k

0

Entering edit mode

Ahh, thank you. I started doing this with just linear models, but then thought I'd get creative and try to do this hybridization with your software. But apparently the very foundation of my thought process was already misguided so you just saved me from a potentially huge waste of time. Thanks for clearing that up!

ADD REPLY • link 7.3 years ago AstrobioMike • 0