I'm very new to the world of transcriptomic, and I have some questions about normalization of datasets when using tximport to estimate gene level counts from salmon transcript level estimation, in order to work with edgeR.
More specifically, how Salmon TPM (which normalize per transcript size and library size too) is different from the two recommended options and why it is not generally used?
I'm sorry if this question has already been asked in a form or another, I have searched and did not find my answers.
Thanks in advance for your help, tell me if my question is not clear enough,
For 1 and 2, what salmon calculates are transcript-level TPMs. What tximport calculates are gene-level counts corrected for average transcript length in the sample so that a sample which expresses longer or shorter transcripts of a gene compared to another sample does not get more or fewer counts just based on average transcript length. That is the whole point of the tximport method. Say celltype A expresses transcripts of a gene that are very long, and celltype B expresses transcripts of that same gene that are very short. At same expression level larger transcript give more counts, that is how full-length (=standard) RNA-seq works. So one needs to correct for that to avoid that gene being called differential simply due to the length of the expressed transcripts rather than actual expression level, after summing transcript-level counts to the gene level. That is what tximport does. For 3, this correction can basically be done two ways. First, how DESeq2 does it internally is to calculate offsets to use in the model-based analysis to correct for the average transcript length. Alternatively, one can directly correct the gene level counts for average transcript length, that is what scaledTPM / LengthscaledTPM does, see also difference among tximport scaledTPM, lengthScaledTPM and the original TPM output by salmon/kallisto, so you can use the resulting raw counts right away in any downstream analysis. Methods such as limma do not accept offset matrices so you are forced to use scaledTPM / LengthscaledTPM, methods such as DESeq2 (as mentioned above) can use it natively (as it does all the necessary magic under the hood) and for edgeR you have to do a few code lines to combine length offsets and normalization factors into an offset matrix ready for downstream analysis, as shown in the vignette. Does that make sense?