I am working on a project comparing RNAseq quantification results between Illumina short-reads and Nanopore long-reads and I have a couple questions about comparing the quantification results from these two technologies. More specifically I need some help with figuring out how to normalize the data for the comparisons within samples and between samples. So far I have come up with the following plan:
Using CPM to compare gene/transcript expression within each sample sequenced with nanopore. For example, comparing if gene.X transcripts are more abundant than gene.Y transcripts within sample_1 sequenced with nanopore.
- Using CPM instead of TPM for nanopore seems like a good option since our nanopore runs do not have transcript length bias. Does this sound like a good strategy?
Using TPM to compare gene/transcript expression within each sample sequenced with illumina. For example, comparing if gene.X transcripts are more abundant than gene.Y transcripts within sample_1 sequenced with illumina.
- Using TPM instead of CPM for illumina seems like a good option since illumina has transcript length bias (a single long transcript will have more counts that a single short transcript). Does this sound like a good strategy?
Here is where I am having trouble coming up with a good normalization strategy. Comparing gene/transcript expression between the same sample sequenced with illumina and nanopore. e.g., performing a spearman correlation between gene expression in sample_1 sequenced with illumina and sample_1 sequenced with nanopore. I am not sure what would work here since Illumina has transcript length bias and nanopore does not. Do you have any suggestions?
Any help here will be greatly appreciated.
Best, Bernardo
I suggest to post that over at biostars.org to get a broader audience of long-read people.
Just did that, thank you for the suggestion.
Discussion for this question continuing on Biostars: https://www.biostars.org/p/9552419/
You could have a look at the bambu package which is designed for long read /nanopore RNA-Seq and which handles the normalisation issue for the long read quantification: http://bioconductor.org/packages/release/bioc/html/bambu.html
I am already using Bambu, that is where my Nanopore gene/transcript counts matrix comes from. By the way thank you and your team for the great support on GitHub!
My question is more on how to normalize Illumina and Nanopore data so that the comparison between them (as outlined in the question) is "fair" and has little to no bias introduced by the normalization process.
Great to hear that you are using Bambu. Yes, I think that part of your question does not have one correct answer, there are many ways in which this could be done. If you use for example Salmon or Kallisto, the gene and transcript expression estimates should be comparable with Bambu transcript expression estimates independently from the transcript length, that is already accounted for (but there are other factors that will influence the comparison). You can test this out with spike in data. https://github.com/GoekeLab/sg-nex-data might be a helpful resource
Thank you, the spike in data should help test out some of these scenarios.
Your explanation makes a lot of sense given what we are seeing. We find the best correlation at the gene level between Salmon CPM and Bambu CPM. I was a bit skeptical about this since CPM does not account for transcript length and should not be a good normalization method for Illumina data. But you pointed out that the count estimate output by Salmon already takes into account transcript length, so CPM should do the trick for both Bambu and Salmon.
TL;DR
Would salmon TPM for short-read and oarfish with some sequencing depth normalisation help compared short-read with long-read?
In more verbose terms am I right in thinking that:
Salmon
Gives NumReads, an adjustment balancing the unique and ambiguous reads of a transcript following certain models but isn't normalized for sequencing depth This NumRead is then normalised to give the TPM to normalize for depth and length bias of short-read This TPM should be used for short-read when doing any downstream analysis salmon can be used for long-reads, especially with the --ont model but it would be more correct to use the NumRead count not TPM oarfish
Is designed for long-read only and has a similar way to distribute the reads with ambiguous mapping to transcripts as to salmon. As this only gives NumRead output we should use NumRead for downstream analysis. Do we need a further normalization to account for sequencing depth so we can then compare across samples and/or platforms? I am also trying to compare the sensitivity for transcript detection between some short-read data and long-read data on similarish source of RNA.