Question

Offsets and normalization in EdgeR

0

Entering edit mode

ognjen011 • 0

@ognjen011-22005

Last seen 3.5 years ago

I am trying to understand the choices when using qunatification pseduoaligners like Kallisto for per-gene estimates. EdgeR official documentation mentions that we can use Tximport "which produces gene-level estimated counts and an associated edgeR offset matrix". In another place I read that EdgeR ignores estimated normalization factors if it detects provided offsets. Finally, I've read that GC content can be used to generate offsets as well.

1) How big a difference these make if we are doing alternative splicing and per-transcript analysis as well? 2) Is it true that calculated offsets are used instead of internal normalization? Is there an explanation somewhere how y$offset variable is handled in each function? 3) If we want to normalize on multiple criteria, can we add all those offsets and is that recommended?

Thanks!

edgeR DifferentialExpression • 1.4k views

ADD COMMENT • link updated 3.5 years ago by Gordon Smyth 50k • written 3.5 years ago by ognjen011 • 0

score 0 · Answer 1 · 2020-11-03

If you are using tximport to input data to edgeR, just follow the advice in the tximport vignette about how to do that. Once you create the DGEList object for edgeR, you can proceed with a standard edgeR analysis.

1) How big a difference these make if we are doing alternative splicing and per-transcript analysis as well?

Gene-level differential expression, transcript-level differential expression and testing for alternative splicing are quite different things and need three different approaches to quantification and normalization. The tximport import protocol and offset matrix is only for the gene-level differential expression.

2) Is it true that calculated offsets are used instead of internal normalization?

Yes. Offsets are normalization and encode observation-specific effective library sizes. There would be no point in supplying an offset matrix to edgeR if edgeR then overwrote it.

Is there an explanation somewhere how y$offset variable is handled in each function?

Every function has a help page. Basically, offsets are used throughout. The offsets are used whenever edgeR fits a glm and hence the offset becomes part of any downstream analysis such as dispersion estimation or testing.

3) If we want to normalize on multiple criteria, can we add all those offsets and is that recommended?

edgeR accepts offset matrices from external normalization packages such as EDASeq, cqn or tximport but does not create observation-specific offset matrices itself. If you want create your own offset matrix according to your own criteria, then making sure the offset matrix is sensible is your responsibility. I would not recommend just adding up separate offset matrices. If you are worried about GC content, you could use Salmon, which already adjusts for GC content as part of the transcript quantification.