Question

Trouble with Tximport for edgeR

0

Entering edit mode

Xiang Wang • 0

@xiang-wang-14990

Last seen 5.0 years ago

I have several questions about tximport results used for edgeR.

According to the tximport vignette, the ideal method is to provide the estimated counts from the default condition (countsFromAbundance with "no") combined with an offset that corrects for changes to the average transcript length across samples for edgeR analysis. Your example of creating a DGEList for use with edgeR is as follows:

library(edgeR)
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
# y is now ready for estimate dispersion functions see edgeR User's Guide

A basic edgeR analysis procedure is listed below:

y <- DGEList(counts=..., gene=..., group=...)
keep <- rowSums(cpm(y)>...) >= ...
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(...)
y <- estimateDisp(y, design, robust=TRUE)
fit <- glmQLFit(y, design, robust=TRUE), or et <- exactTest(y, pair=...)

Q1: How to incorporate y (with offset) into the edgeR analysis procedure, namely, which step in the edgeR is followed by y (with offset)? Is y (with offset) directly used for this step “y <- estimateDisp(y, design, robust=TRUE)”?

If so, whether no need to use library size (y <- calcNormFactors(y)) for further normalization to y (with offset).

Q2: I want to know which step in the edgeR analysis procedure use the offset information to correct final results. It seems that the edgeR's cpm function doesn't use it.

Q3: If countsFromAbundance="lengthScaledTPM" is used to generate the scaled counts, whether this step (y <- calcNormFactors(y)) in the edgeR can be omitted because these counts have been scaled using the average transcript length, averaged over samples and to library size in the tximport.

tximport edger • 1.1k views

ADD COMMENT • link updated 6.2 years ago by Gordon Smyth 50k • written 6.2 years ago by Xiang Wang • 0

score 4 · Accepted Answer · 2018-02-08

4

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

Just omit the calcNormFactors() step. The tximport offsets are already intended to normalize, and you shouldn't normalize twice.

The offsets are automatically used by estimateDisp() and glmQLFit(). You don't have to do anything. The same is true of all the glm functions in edgeR, including glmFit(), glmLRT() and so on.

The fact that cpm() doesn't use offsets is not important, as the offsets don't have a substantial effect on the filtering. You might though consider using our new function:

keep <- filterByExpr(y, design)

instead.

ADD COMMENT • link 6.2 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you very much! There are two additional questions. 1. If the classic edgeR approach is used to make pairwise comparisons between the groups, are the offsets automatically used by exactTest()? 2. If I want to use cpm or logcpm for clustering and heatmap, how to obtain the corrected cpm or logcpm by the offsets. Thanks in advance.

ADD REPLY • link 6.2 years ago Xiang Wang • 0

0

Entering edit mode

No, offsets are not used by exactTest(). Offsets are only used by the glm-based functions.

ADD REPLY • link 6.2 years ago Gordon Smyth 50k