Hi,
I wist to use tximport followed by DeSeq2 for differential expression analysis of RNA-Seq data quantified by Salmon (from fastq files). The libraries are 3' end biased and have limited isoform level information (situation similar to that shown in Figure 1C of the Sonenson et al paper on the tximport package). I am trying to find the best set of parameters for DE analysis. Is it better to use the transcript length offset for normalization in this case?
Thanks
Sunil Sukumaran
Research Associate
Monell Chemical Senses Center
Philadelphia PA
Hi Michael,
I want to dig into this a bit. more. As you mentioned, there is some splicing info in 3' biased data, but the number of reads drops of precipitously as we move towards the 5' end of genes in my data set. When I use featureCounts, normalizing for gene length by FPKM/RPKM really screwed up the analysis and I got very few differentially expressed genes. DESeq2 normalization which does not factor in gene length is more meaningful for this data set. I am aware that salmon models 3' (and 5') bias, but my understanding is that it does so for regular libraries where the bias is only at the very 3' or 5' ends (~200 bp or so). It would be wonderful if salmon can model the 3' bias resulting from RNA amplification, but I suspect this is not the case- perhaps it is asking for too much...
The effective length provided by salmon drops down quite a bit for transcripts that are expressed at very low levels, but it is quite close to the full length when they are even only moderately expressed. So would you recommend dropping the effective length normalization altogether and just sum the counts at gene level?
Thank you.
Sunil
Not related to salmon, but why would the (R|F)PKM screw up your differential expression analysis when you used featureCounts?
I mean, if you ran featureCounts to get counts over your gene features, the most natural thing I could think of doing would be to then feed those counts directly into edgeR, edgeR->voom, or DESeq2 ... no (R|F)PKMs in sight ... know what I mean?
Are you saying that the effective length is incredibly variable across samples and this is something you further want to try to correct for?
Agree with Steve: unless there are differences in effective length across samples, it won't affect the tools that use counts (even using the average transcript length offset from tximport). Whether the effective length is the same as transcript length or much smaller (or larger), differences across genes/transcripts will all be zero-ed out before being imported as an offset for DESeq2. All that remains is the difference for a gene/transcript across samples.
Yes, I agree... Actually I got confused with a comparison of a cuffdiff analysis with the STAR-FeatureCounts-DeSeq2 pipeline I did long back. At that time I interpreted it as an effect of the FPKM normalization, but perhaps it is more because cuffdiff is very conservative compared to DeSeq2. For what it is worth, I run DeSeq2 using Salmon results with/without length normalization and post a summary here.
Thanks!