Question: Using Sailfish estimated read-counts for DESeq2 analysis
0
4.8 years ago by
jceledon0
jceledon0 wrote:

Hi there,

Has anyone tried to use Sailfish estimated RPKM as input for DESeq2 for differential gene expression analysis?

Patro, R., Mount, S.M. and Kingsford, C. (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol., 1–6.

http://www.ncbi.nlm.nih.gov/pubmed/24752080

I noticed that the DESeq2 article explicitly says "DESeq2 performs analysis on counts of reads which can be uniquely assigned to genes, while a number of other algorithms [25, 26] perform differential analysis on a probabilistic assignment of reads to transcripts."

And it is not clear tom me how Sailfish RPKMs are calculated.

Thank you!

Jose

modified 4.8 years ago by Simon Anders3.6k • written 4.8 years ago by jceledon0
2
4.8 years ago by
Denali
Steve Lianoglou12k wrote:

You say:

  And it is not clear tom me how Sailfish RPKMs are calculated

This doesn't answer your sailfish question explicitly, but it doesn't really matter how they are calculated for your question: the "official party line" that you'll uncover if you do a bit of searching on here is that RPKMs are not valid inputs to any of the GLM based RNAseq differential expression based approaches, so using DESeq2 on sailfish data is a non-starter.

2

To elaborate on Steve's answer; transforming raw counts into RPKMs discards information about the absolute size of the counts. This information is required to model the mean-variance relationship. If you use RPKMs instead of raw counts as inputs to, e.g., edgeR, you will incorrectly estimate this relationship. In addition, most GLM-based methods use discrete distributions. This means that the continuous nature of RPKMs will not be accurately modelled, especially at low values.

1

hi Jose,

Aaron has laid out the exact point about why count-based methods such as DESeq2, edgeR and others require gene-level raw counts for statistical inference. The library size is accounted for within the model, rather than by transforming the input data. Similarly, sample-specific effects on gene-level counts of covariates such as GC content can be included (if estimated by another package such as cqn or EDASeq) in the model using normalizationFactors in DESeq2 (or an offset in edgeR), and if one has obtained estimates of different average transcript length for each gene across samples (RSEM produces such a column), this can be included in a model for gene-level counts as well using the 'normMatrix' argument of estimateSizeFactors. But the counts slot of the DESeqDataSet needs to contain raw counts for each gene.

(edited to emphasize gene-level counts)

2
4.8 years ago by
Simon Anders3.6k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.6k wrote:

The real issue is this: DESeq2 is meant for testing for differential expression of genes, not isoforms. Quantification of different isoforms for the same genes will always have correlated errors, and unless these correlations are explictely accounted for, any test for differential expression is bound to produce loads of false positives.

So, just counting reads for each isoform independently and running DESeq (or any other testing method) is a big no. (See this old post if it is unclear why: A: DESeq and transcript-wise analysis)

Methods like RSEM (and similarly Sailfish, I suppose) avoid double-counting reads for several isoforms, so the problem is not exactly as described in the post above. Still, the errors (i.e., deviations of RSEM's estimates from the true values) will be correlated between different isoforms of the same gene. RSEM is actually able to estimate and report these correlation matrices, and a method that used this additional information would be able to perform correct inference. However, I am not aware of any tool that is designed to process RSEM's error estimates for purposes of comparing isoform expression.

So, if you are just interested in comparing on the level of genes, use a method that quantifies genes, not isoforms.

If you are interested in comparing on the level of isoforms, use a method that is able to estimate these correlations and account for them in testing. Tools like BitSeq or cuffdiff integrate both the quantification and testing in a single tool or tool-chain, precisely for the reason that the correlation estimates must be passed from the quantification to the testing stage to avoid false positives, and this is typically not possible if you combine a quantification tool from one author with a testing tool from somebody else.

If, however, you are willing to follow our argument that testing on the exon level instead of the isoform level not only avoids all these difficulties but also provides more interpreatble results, use our DEXSeq package (and read our paper on it).

OK. So then, what to do if you want correct, gene-level counts from RNAseq data with a de-novo assembly (e.g. Trinity or Bridger) in which there often multiple isoforms for each gene or component, and reads may map multiply to those isoforms?   IS the solution to keep only the longest isoform (or isoform with longest ORF) ? I know that  much can be lost there, but atleast the counts are more "accurate"?

You're replying to a very old question, and I think it'll probably best to post a new question with the your specific issue.

That having been said, if you want to incorporate the gene length information and such from tools like salmon (and I guess kallisto, although I haven't played with it much), but you still want to test for differential expression at the gene level, then use the tximport package.

If you want to stick with differential expression of isoforms, then you might try using sleuth. It was meant to consume data from kallisto, but the wasabi package should be able to post-process output form sailfish and salmon so that sleuth can consume it, too.