I've been using DESeq2 for differential expression analysis of microbial (meta)transcriptomic datasets and have been very happy with its performance. I've started to overlay pathway analyses onto these differential expression results to identify functional groupings of genes (via KEGG or SEED) that are over- or under-represented in these DE gene sets. In parallel, I'd also like to be able to take a dataset, order the genes from most- to least-expressed, and look for enrichment of certain functional groupings in the most highly-expressed genes in a given dataset. My question is whether it makes sense to normalize, specifically via a DESeq2-performed size factor, rlog, or vst normalization, prior to ordering the genes from greatest to least expression?
I'm aware of the value of these normalization strategies for preparing datasets for differential expression analyses but would greatly appreciate an opinion on whether these are also appropriate methods for preparing a transcriptional dataset for the types of analysis I described.
Mike,
Thanks. I agree on further consideration that TPM is the appropriate method for comparing different genes within a given library. As for your recommendation on the various software packages and tximport, you don't mean that you are importing TPM as the primary data type for DE analysis, right? This would be an alternative treatment of counts used for analyses other than DE calling, wouldn't they?
Take a look at the tximport vignette and the associated citation for details. In short, if you use the suggested code I've laid out there (you can also find it in the DESeq2 vignette), DESeq2 will use the estimated fragment *counts* summarized to the gene level, and then internally it computes normalization factors for those counts which account for technical biases as well as potential changes in average transcript length per gene across samples. So it's still a count based method, and DESeq2 will round the incoming estimated counts to integers which are stored in counts(dds). The user-facing part is just: point tximport to the quantification files, then use DESeqDataSetFromTximport instead of the other alternative constructor functions.