I like to use DESeq2 for making PCA-plots and heat maps, but on a current dataset we only have count values from cufflinks/cummeRbund (exported using count() in cummeRbund). I know DESeq2 needs raw counts, but can I use these counts only for plotting/visualization? And can I perform the rlog-transformation on cufflinks normalized counts?
'Raw counts' from Tuxedo are not really raw counts, they're "raw pseudo counts" - So you won't get the type of data that DESeq2 excepts (short of rounding the values you get out of count in cummeRbund).
CummeRbund offers a method to perform a PCA of FPKM values, however if you want to use the DESeq2 methods, I'd recommend you follow the DESeq2 workflow: htSeq_Count from alignments -> DESeq2, rather than trying to manipulate the output of cummerbund.
We require integers as input to protect against users accidentally inputting FPKM or normalized counts (counts corrected for library size). In both of these cases, the precision has been altered from what is expected by the statistical model, so this really breaks the assumptions of our software. I will say that I've used the EDA and DE routines of DESeq2 on rounded estimated counts before, but only when I made sure that the value is an estimation of the count of fragments assigned to a gene (not transcript), and it has not been divided by a library size correction. One concern with this approach though, is if you use software which distributes fragments which could be assigned to many homologous genes, then if there is DE in one, it could be attributed to all the genes.
I think rounded estimated *gene* counts are fine for DESeq2, but estimated
transcript counts are negatively correlated within a gene -- there is a lot
of additional variance from estimation uncertainty. DESeq2 is not built for
transcript level analysis.
Hello Michael:
Why is it not ok to round the counts assigned to transcripts?
Thanks,
Nik
What methods take that into account? Cuffdiff? ALDEx2?