Use of pseudoalignment estimated counts as input to edgeR and Limma-Voom
2
0
Entering edit mode
raf4 ▴ 30
@raf4-8249
Last seen 23 months ago
United States

Dear List,

Our sequencing core recently switched from STAR to Kallisto. The genewise output was given as both TPM values and counts. Gordon Smyth has written that TPMs should not be used as input to EdgeR. In a recent paper (F1000 Research 2016,4:1521) Soneson, Love, and Robinson have presented evidence that the method they call simplesum_avextl for deriving reads-per-gene, from numbers-of-transcripts is as good or better for differential expression than alignment+featureCounts. In a comment on that paper, Lior Pachter advocates simply adding the reads assigned to each transcript to get the reads for a gene.

I would greatly appreciate Gordon's input, or that of someone from his group, as to whether there is a proper way to get counts from pseudoalignment-based numbers-or-transcripts as input to edgeR or limma-voom. Or alternatively, is it recommended that I generate the counts from the fastq files with an alignment program, such as Rsubread.

Thanks and best wishes, Rich Richard Friedman, Columbia University

“My father justifies mice. You need to make sure that you have enough mice for an experiment and that you do not have too many. He makes sure that no mouse dies in vain. You are not allowed to use chimps, so you have to use mice.”- Rose Friedman, age 22

RNASeq-quantification edgeR TPM • 3.3k views
ADD COMMENT
5
Entering edit mode
@gordon-smyth
Last seen 32 minutes ago
WEHI, Melbourne, Australia

Dear Rich,

I'll tell you what I use. I use:

  1. Rsubread::align + featureCounts + limma or edgeR-QL for gene level expression analyses

  2. Rsubread::subjunc + featureCounts + limma or edgeR for differential splicing analyses

  3. Salmon (with bootstrapping) + edgeR::catchSalmon for transcript level expression analysis.

The short answer to your question is that getting genewise counts from kallisto is ok as input to either limma and edgeR. Of course you should work from the counts and not the TPMs. Either the tximport genewise pipeline or else just adding up counts over transcript for each gene are both ok from this point of view.

In fact, getting total transcript counts from kallisto is in principle the same as running featureCounts on the same transcriptome annotation except that kallisto is restricted to perfect match alignments so featureCounts should manage to count more reads.

There's a much longer answer, but I don't really want to write at length here or now. I will say that kallisto makes too many assumptions and approximations for my taste and I personally wouldn't hesitate to use Rsubread instead. Just to name one issue of many, more than 10% of Ensembl transcriptome annotation is made up of duplicates, i.e., the exact same transcript sequence but with multiple entries under different names, and kallisto does no checking of this. Genewise inference is robust whereas transcript-level inference is inherently noisy, so using transcripts as an intermediary to get genewise counts seems a peculiar choice to me.

Responding to your comment about Soneson et al (2016), they were not actually able to show any improvement for tximport over featureCounts for any real dataset. Of course their main point is true. If a gene has a two transcripts of very different lengths, and the short transcript is expressed more highly than the long, and the two transcripts are both DE but in opposite directions, then it is possible for total genewise count and total gene expression (sum of transcript expressions) to show fold changes in opposite directions. My understanding is that the tximport refinement is intended to get the genewise counts to better align with sum of transcript expression. This scenario isn't very common in practice but if you run a simulation for which this scenario is the norm then of course you can show a difference. My view is that the tximport refinement isn't a complete solution, as it would be better to detect the discordant DE, and this refinement is in any case dwarfed by other issues with kallisto and imperfect annotation.

ADD COMMENT
0
Entering edit mode

Dear Gordon,'

Thank you as always for your helpful reply. I will ask our sequencing core to align with Rsubread, and if not do it myself.

Best wishes, Rich

ADD REPLY
0
Entering edit mode

If you already have an alignment pipeline you're happy with, I wouldn't bother switching to another aligner. It's generally not going to make much of a difference. Just use featureCounts with your existing aligned BAM files.

ADD REPLY
1
Entering edit mode
@steve-lianoglou-2771
Last seen 22 months ago
United States

I'm not in Gordon's group, but I'll still offer an answer:

Just follow the instructions outlined in the tximport vignette. They provide instructions on how to roll up transcript level counts into gene-level counts for your favorite DGE pipeline ... as long as your favorite is one of edgeR, DESeq2, or limma/voom.

ADD COMMENT
0
Entering edit mode

Steve,

Thanks. I know that procedure. Gordon has written :""In my opinion, there is no good way to do a DE analysis of RNA-seq data starting from the TPM values. TPMs just throw away too much information about the original count sizes."

I am asking if pseudoalignment algorithms also lose that information.

Best wishes, Rich

ADD REPLY
0
Entering edit mode

For DESeq2 and edgeR, the procedure described in the tximport vignette doesn't suffer from this problem, because it converts the TPMs and average gene lengths into counts and offsets. This means you are getting the gene counts that were inferred by Kallisto, even if they might be back-calculated from TPMs.

ADD REPLY
0
Entering edit mode

Dear Ryan,

Thank you for your reply,

Best wishes, Rich

ADD REPLY
0
Entering edit mode

But isn’t it true that the tximport countsFromAbundance options scaledTPM and lengthScaledTPM are not the same as regular TPM? I remember the quote from Dr . Smyth you mentioned but he was referring to attempting to do DE using regular TPMs as this is only a within sample normalization method and therefore not suitable for cross sample comparison. scaledTPM and lengthScaledTPM are more equivalent to gene-level counts and therefore you can use edgeR TMM normalization on these before use with limma-voom.

ADD REPLY

Login before adding your answer.

Traffic: 1017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6