The tximport vignette shows two ways to import results for differential expression analysis at the gene level.
The first approach (used with DESeq and edgeR) is to use the estimated counts from your quantification tool of choice, and let (or set) DESeq (or edgeR) use an offset to account for changes in average transcript length per sample.
The second way is to just fetch countsFromAbundance="lengthScaledTPM"
directly (no offsets used here), which is then suitable for use by voom (presumably these can also be used (but sub optimal) in the DESeq2 and edgeR world (right?)).
I'm curious about what version of the imported data to use in the "exploratory data analysis" phase. For instance, in DESeq2, one would either vst
or rlog
the data first before doing clustering, etc. and in the edgeR world, you'd simply run cpm
with a higher prior.count.
In these situations, is it better to use the the data as you would have imported it for use with voom? I see that different places in the DESeq2 codebase (fpm
and fpkm
), for instance, there is support for an avgTxLength
matrix, but calls to vst
and rlog
seem to interact directly with the counts
function, which doesn't look at this transcript length.
Similarly, edgeR's cpm
function doesn't use offset information ...
My guess is that I should technically use the data as I would import for voom in this case, although it would most likely practically make little difference during EDA either way, but thought I'd get input from the experts.