Hi all, Hope you are all doing well during this pandemic. I have a few questions that hopefully can be addressed here. I am running both DESeq2 and edgeR on two experimental designs I have (one with paired samples and one without). I assembled a de novo transcriptome to align and quantify reads with salmon, and I annotated this transcriptome accordingly. I now have aligned reads as well as a tx2gene file to perform differential expression analysis. I have successfully run tximport on this data for DESeq2, and I was able to push the program all the way to the end with edgeR, however, when going through this tutorial online I noticed some inconsistencies on the usage of both calcNormFactors and of offsets.
Onto my first question: In the tximport vignette the authors do a very good job guiding the users on how to format their data as to import it into DESeq2 and edgeR. At the end of the tximport edgeR part of the vignette the authors state:
'y <- y[keep, ] y is now ready for estimate dispersion functions see edgeR User's Guide'
As a consequence, I originally followed the edgeR vignette and estimateDisp(y, design=design), where design is the experimental design I was interested in. After running the pipeline there was broad overlap in the results of edgeR and DESeq2, which was satisfying. Nevertheless, I noticed that I might not have used the offset appropriately. I ran into this post which talks about differences in the final set of singificant results depending on whether offsets are explicitly incorporate during the glmFit function, or if they are incorporated as in tximport. Which of the two is the appropriate way to go?
My second question is related to the first: After going back to my data, I realized that by following the steps of the tximport vignette I end up with a DGEList object that does indeed have $offset, but whose norm.factors within $samples are all equal to 1. After checking the tutorial I mentioned above I noticed that after the authors of the tutorial incorporated the offsets into the DGEList they then ran calcNormFactors on the list, which leads the norm.factors to change. My question is whether or not I should run:
y <- calcNormFactors(y)
After having incorporated the offsets to the DGEList. From the edgeR vignette under subsection 2.8.6 the authors mention that "gene-specific correction factors can be entered into the glm functions of edgeR as offsets. In the latter case, the offset matrix will be assumed to account for all normalization issues, including sequencing depth and RNA composition." Will edgeR be able to incorporate the offset in addition to TMM, or will it by default not perform the TMM normalization to the data, assuming that the offset will have already taken that into account?
Lastly I'd like to ask the simplest of all questions, which I have seen addressed elsewhere, but have always been somewhat confusing. I am interested in performing a gene set enrichment analysis of my genes in GSEA, to see if genes from particular pathways are upregulated in given tissue types. To do so, I need to obtain TMM values for each sample and gene. It is my understanding that I can obtain these values 'by hand' simply by calculating the following for each gene:
For every 'i' gene across every 'j' sample. Could I simply obtain this same value by using the cpm function in edgeR? This would then mean that with the cpm I would obtain the TMM value *1M. Is that correct?
I am using: R 3.6 edgeR3.28.1 tximport1.14.2
Thank you so much for your time.
Pietro de Mello