Question

DESeq2 normalization prior to identification of highly-expressed functional categories

0

Entering edit mode

Matt • 0

@matt-12117

Last seen 6.9 years ago

I've been using DESeq2 for differential expression analysis of microbial (meta)transcriptomic datasets and have been very happy with its performance. I've started to overlay pathway analyses onto these differential expression results to identify functional groupings of genes (via KEGG or SEED) that are over- or under-represented in these DE gene sets. In parallel, I'd also like to be able to take a dataset, order the genes from most- to least-expressed, and look for enrichment of certain functional groupings in the most highly-expressed genes in a given dataset. My question is whether it makes sense to normalize, specifically via a DESeq2-performed size factor, rlog, or vst normalization, prior to ordering the genes from greatest to least expression?

I'm aware of the value of these normalization strategies for preparing datasets for differential expression analyses but would greatly appreciate an opinion on whether these are also appropriate methods for preparing a transcriptional dataset for the types of analysis I described.

deseq2 pathway analysis order normalization • 1.7k views

ADD COMMENT • link updated 8.2 years ago by Michael Love 43k • written 8.2 years ago by Matt • 0

score 0 · Answer 1 · 2017-01-06

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

The gene counts don't tell you about the order of expression across genes. For that you need to estimate a quantity like TPM, where ideally there is normalization for transcript length, fragment length distribution, and various other sample-specific biases. You can estimate TPMs very quickly with software like Salmon, Sailfish, or kallisto, and then import these into R with the tximport package. These are also my preferred way to generate count matrices for DESeq2, as we mention in the current version of the vignette and workflow.

ADD COMMENT • link 8.2 years ago Michael Love 43k

0

Entering edit mode

Mike,

Thanks. I agree on further consideration that TPM is the appropriate method for comparing different genes within a given library. As for your recommendation on the various software packages and tximport, you don't mean that you are importing TPM as the primary data type for DE analysis, right? This would be an alternative treatment of counts used for analyses other than DE calling, wouldn't they?

ADD REPLY • link 8.2 years ago Matt • 0

0

Entering edit mode

Take a look at the tximport vignette and the associated citation for details. In short, if you use the suggested code I've laid out there (you can also find it in the DESeq2 vignette), DESeq2 will use the estimated fragment *counts* summarized to the gene level, and then internally it computes normalization factors for those counts which account for technical biases as well as potential changes in average transcript length per gene across samples. So it's still a count based method, and DESeq2 will round the incoming estimated counts to integers which are stored in counts(dds). The user-facing part is just: point tximport to the quantification files, then use DESeqDataSetFromTximport instead of the other alternative constructor functions.

ADD REPLY • link 8.2 years ago Michael Love 43k

0

Entering edit mode

Thanks again for the clarification. Using an external package and tximport doesn't fit our current workflow but may as we evolve our analysis strategy. As for your previous recommendations for using TPM to compare genes your advice is well-taken. On Fri, Jan 6, 2017 at 3:41 PM Michael Love [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Michael Love <https: support.bioconductor.org="" u="" 5822=""/> wrote Comment: > DESeq2 normalization prior to identification of highly-expressed functional > categories <https: support.bioconductor.org="" p="" 90840="" #90883="">: > > Take a look at the tximport vignette and the associated citation for > details. In short, if you use the suggested code I've laid out there (you > can also find it in the DESeq2 vignette), DESeq2 will use the estimated > fragment *counts* summarized to the gene level, and then internally it > computes normalization factors for those counts which account for technical > biases as well as potential changes in average transcript length per gene > across samples. So it's still a count based method, and DESeq2 will round > the incoming estimated counts to integers which are stored in counts(dds). > The user-facing part is just: point tximport to the quantification files, > then use DESeqDataSetFromTximport instead of the other alternative > constructor functions. > ------------------------------ > > Post tags: deseq2, pathway analysis, order, normalization > > You may reply via email or visit > C: DESeq2 normalization prior to identification of highly-expressed functional cate >

ADD REPLY • link 8.2 years ago Matt • 0