Question

Gene level exploratory data analysis from tximport results

1

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 22 months ago

United States

The tximport vignette shows two ways to import results for differential expression analysis at the gene level.

The first approach (used with DESeq and edgeR) is to use the estimated counts from your quantification tool of choice, and let (or set) DESeq (or edgeR) use an offset to account for changes in average transcript length per sample.

The second way is to just fetch countsFromAbundance="lengthScaledTPM" directly (no offsets used here), which is then suitable for use by voom (presumably these can also be used (but sub optimal) in the DESeq2 and edgeR world (right?)).

I'm curious about what version of the imported data to use in the "exploratory data analysis" phase. For instance, in DESeq2, one would either vst or rlog the data first before doing clustering, etc. and in the edgeR world, you'd simply run cpm with a higher prior.count.

In these situations, is it better to use the the data as you would have imported it for use with voom? I see that different places in the DESeq2 codebase (fpm and fpkm), for instance, there is support for an avgTxLength matrix, but calls to vst and rlog seem to interact directly with the counts function, which doesn't look at this transcript length.

Similarly, edgeR's cpm function doesn't use offset information ...

My guess is that I should technically use the data as I would import for voom in this case, although it would most likely practically make little difference during EDA either way, but thought I'd get input from the experts.

tximport edgeR DESeq2 • 2.9k views

ADD COMMENT • link updated 8.5 years ago by Michael Love 43k • written 8.5 years ago by Steve Lianoglou ★ 13k

score 2 · Accepted Answer · 2016-06-20

"presumably these can also be used (but sub optimal) in the DESeq2 and edgeR world (right?))"

I like the idea of counts and offsets, but I think it would be hard to find a big performance difference. In Fig 3, there's not much of a difference between the offset and scaledTPM approach. I slightly prefer "lengthScaledTPM" over "scaledTPM" because the former creates count-scale quantities that should be closer to the true (unobservable) counts. The difference is that lengthScaledTPM multiplies by feature length before scaling to library size, while scaledTPM just scales TPM to library size. Charlotte had some results on this but I don't believe they made it into the F1000R article.

"but calls to vst and rlog seem to interact directly with the counts function, which doesn't look at this transcript length"

Calls to vst() and rlog() go for counts(dds, normalized=TRUE), and if you have run estimateSizeFactors() after DESeqDataSetFromTximport(), the avgTxLength matrix has been used to create normalization factors. So, avgTxLength will be incorporated to the normalization when calling vst() and rlog().