edgeR + tximport: calcNormFactors and the usage of offsets
2
0
Entering edit mode
Last seen 16 months ago

Hi all, Hope you are all doing well during this pandemic. I have a few questions that hopefully can be addressed here. I am running both DESeq2 and edgeR on two experimental designs I have (one with paired samples and one without). I assembled a de novo transcriptome to align and quantify reads with salmon, and I annotated this transcriptome accordingly. I now have aligned reads as well as a tx2gene file to perform differential expression analysis. I have successfully run tximport on this data for DESeq2, and I was able to push the program all the way to the end with edgeR, however, when going through this tutorial online I noticed some inconsistencies on the usage of both calcNormFactors and of offsets.

Onto my first question: In the tximport vignette the authors do a very good job guiding the users on how to format their data as to import it into DESeq2 and edgeR. At the end of the tximport edgeR part of the vignette the authors state:

'y <- y[keep, ] y is now ready for estimate dispersion functions see edgeR User's Guide'

As a consequence, I originally followed the edgeR vignette and estimateDisp(y, design=design), where design is the experimental design I was interested in. After running the pipeline there was broad overlap in the results of edgeR and DESeq2, which was satisfying. Nevertheless, I noticed that I might not have used the offset appropriately. I ran into this post which talks about differences in the final set of singificant results depending on whether offsets are explicitly incorporate during the glmFit function, or if they are incorporated as in tximport. Which of the two is the appropriate way to go?

My second question is related to the first: After going back to my data, I realized that by following the steps of the tximport vignette I end up with a DGEList object that does indeed have $offset, but whose norm.factors within$samples are all equal to 1. After checking the tutorial I mentioned above I noticed that after the authors of the tutorial incorporated the offsets into the DGEList they then ran calcNormFactors on the list, which leads the norm.factors to change. My question is whether or not I should run:

y <- calcNormFactors(y)

After having incorporated the offsets to the DGEList. From the edgeR vignette under subsection 2.8.6 the authors mention that "gene-specific correction factors can be entered into the glm functions of edgeR as offsets. In the latter case, the offset matrix will be assumed to account for all normalization issues, including sequencing depth and RNA composition." Will edgeR be able to incorporate the offset in addition to TMM, or will it by default not perform the TMM normalization to the data, assuming that the offset will have already taken that into account?

Lastly I'd like to ask the simplest of all questions, which I have seen addressed elsewhere, but have always been somewhat confusing. I am interested in performing a gene set enrichment analysis of my genes in GSEA, to see if genes from particular pathways are upregulated in given tissue types. To do so, I need to obtain TMM values for each sample and gene. It is my understanding that I can obtain these values 'by hand' simply by calculating the following for each gene:

y$counts[i]/(y$samples$lib.size[j]*y$samples\$norm.factors[j])

For every 'i' gene across every 'j' sample. Could I simply obtain this same value by using the cpm function in edgeR? This would then mean that with the cpm I would obtain the TMM value *1M. Is that correct?

I am using: R 3.6 edgeR3.28.1 tximport1.14.2

Thank you so much for your time.

Best,

Pietro de Mello

edgeR tximport differential expression • 276 views
1
Entering edit mode

My question from last year on that matter could help here as well: https://support.bioconductor.org/p/121087/#121104

The code that Aaron provided there is now part of the tximport vignette.

0
Entering edit mode

Hi there Gordon, Michael and AT,

AT, I did see your post, it's just that there is so much stuff online saying so many different things that I got lost. I do agree that it would have originally have answered my question and I appreciate it.

Michael, thanks for the feedback, DESeq2, tximport and all these software you all develop. I appreciate it.

Gordon, thanks so much for your explanation. This makes very good sense, and as I mentioned above, it is easy to get lost with so many different posts. I'm glad you were able to answer so quickly and I appreciate it. I also appreciate the feedback on how to use CPM (actually, log CPM as you mentioned) and the suggestion to look into the fry() and camera() functions.

I consider this post to be answered and it can be closed.

Thank you all much.

P

0
Entering edit mode

Just use the "accept" button (next to the thumbs-up button) to show that you accept one or more of the answers and hence consider the post to be resolved.

Regards Gordon

1
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

I believe that the current code in the tximport vignette was vetted by Aaron (at some point we added scaleOffset instead of doing this manually), but I'll let some edgeR folks reply on the first and second questions.

1
Entering edit mode
@gordon-smyth
Last seen 24 minutes ago
WEHI, Melbourne, Australia

Just follow the tximport vignette, which shows you how to run calcNormFactors, how to set the edgeR offsets and how to compute CPMs.

In other words, to use the tximport offsets in edgeR, follow the documentation that comes with tximport and edgeR.

Background

In edgeR, the offsets are just log library sizes. In the traditional edgeR pipeline, calcNormFactors is used to convert the raw library sizes to effective (normalized) library sizes, then these are converted to offsets (usually by estimateDisp). Once the offsets are formed, they take precedence over the library sizes and norm.factors and the latter are never used again. The offsets are stored in edgeR output objects and are re-used by any downstream edgeR functions.

In the traditional edgeR pipeline, there is one library size and one offset for each sample and the offsets are computed by internal edgeR code rather than set by the user. edgeR can however accommodate a matrix of offsets, which means we can allow packages like cqn, EDAseq or tximport to input a matrix of offsets to edgeR. This allows those packages to implement fancy observation-specific normalization methods, providing sample-gene specific adjustments for GC content, gene length or transcript length, anything in fact that the external package authors think is appropriate.

As always, the offset matrix that is input to edgeR will take precedence over the library sizes. If you set the offset matrix, then you have to make sure it reflects both the library sizes and observation-specific adjustments. Once the offset matrix is input to edgeR, it will be used by all downstream edgeR functions and there is no purpose in doing any further library size normalization. Any further use of calcNormFactors for example will just be ignored.

You can theoretically input the offset matrix to edgeR at any point in the pipeline but, generally speaking, there's no point in inputing it anywhere other than at the start, because the offsets are needed for almost every calculation. There's no point in using different log library sizes for different steps in the pipeline.

I ran into this post

There are multiple usage errors in the Biostars question (wrong offset matrix etc) so just ignore it. If you read the answer, it advises the OP to consult the EDAseq vignette regarding how to input an EDAseq offset matrix to edgeR. The offset matrix naturally needs to be input at the start of the pipeline, not part way through.

After checking the tutorial I mentioned

I'll have to leave this for the tximport authors. The online tutorial claims to reproduce the tximport vignette but actually it doesn't, perhaps because the tximport vignette has changed. As noted above, running calcNormFactors after you've set the offset matrix will be ignored, so that line of code is incorrect.

It is my understanding that I can obtain these values 'by hand'

No you can't. You need log-CPMs for input to GSEA rather than CPMs, and you presumably want to use your fancy observation-specific tximport log library sizes instead of the simple library sizes. So you need to use the code shown in the tximport vignette.

I am interested in performing a gene set enrichment analysis of my genes in GSEA

You could also consider the pathway analysis functions camera and fry provided in edgeR.