Question: Correct usage of tximport counts in edgeR without offset matrix.
0
5 months ago by
ttekath0
ttekath0 wrote:

Hi everyone,

first I want to thank the authors of tximport and edgeR for their very informative vignettes – these are super helpful.
Nonetheless I have two small questions still bugging me:

1.  If I am using the method “bias corrected counts without an offset” from the tximport vignette to use my tximport counts in edgeR: Would i still need to execute the edgeR calcNormfacotrs() method?
As far as I understand it would not be necessary, because the usage of "lengthscaledTPM" in tximport should have already corrected for library size differences, correct?

Here is an exemplified rundown:
txi <- tximport::tximport(files = files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM" )
dge <- edgeR::DGEList(counts = txi$counts, group = grouping_variable) design <- model.matrix(~batch_variable + group , data=dge$samples)
#filter low expressed genes
keep <- edgeR::filterByExpr(dge, design)
dge <- dge[keep, , keep.lib.sizes=FALSE]

#necessary?
dge <- calcNormFactors(dge, method = "TMM")

dge <- edgeR::estimateDisp(dge, design, robust=T)
#....


2. Is one of the methods "bias corrected counts without an offset” and “original counts and offset” recommended over the other? Because the cpm() method of edgeR is not taking the offset-matrix into account, therefore it is much easier to get log-transformed CPM for plotting (e.g. heatmaps) without using the offset approach.

edger rna-seq tximport • 202 views
modified 5 months ago by James W. MacDonald50k • written 5 months ago by ttekath0
Answer: Correct usage of tximport counts in edgeR without offset matrix.
2
5 months ago by
United States
James W. MacDonald50k wrote:
1. Yes. There is a difference between scaling counts by relative transcript abundance (where you are accounting for the fact that a sample with predominantly shorter transcripts should have fewer counts for a gene than a sample with longer transcripts for that gene, all things equal) and generating an offset to account for differences in library size. If you look at the vignette where the counts are used directly, you will note that normalization factors are still computed.
2. This question, to me, is asking about orthogonal things. What you use for modeling and what you plot are (in this case) not the same thing, regardless (you aren't using counts for your heatmap are you?), so what does it matter? If you think the length-scaled TPM data will make a more interpretable heatmap, then have at it. I would be surprised if the difference in color gradations of a heatmap would really be noticeable, so personally I would put this in the list of things I don't really worry about, but I may have a much longer list of such things than other people do.
1

Agree with James:

1) Yes you need to calculate normalization factors for both count matrices (note the column sum of both of these matrices is equal to the column sum of the estimate counts, so contains library size differences).

2) I very slightly prefer original counts with offset, for the same reason that we don't apply library size normalization directly to counts. Arguably though the differences in counts are slight, because often the changes in average transcript length induced by DTU are slight. So it probably doesn't matter. Counts-from-abundance are convenient to produce and work with as not all methods can incorporate an offset matrix. It's basically original counts with DTU-induced differences in average transcript length regressed out.