Hi
I have a normalisation question. I am having some inconsistencies between mRNA abundance estimation from Deseq2 and StringTie-Ballgown. I get that that there are many differences between the two, but if you consider 1 gene that has 1 transcript, and use the same bam input file, the main difference between the 2 algorithms is the normalisation - correct?
Attached is the bamcoverage of such a gene. And below are the rpkm estimated by Deseq2 (gene level) and StringTie-ballgown (transcript level) - commands used are at the end of this post :
AMP (blue track) DLM (green track)
fpkm by Ballgown 40.6 5.1
fpkm by Deseq2 21.3 13.1
The fold change between the 2 conditions according to stringtie is much closer to what you see on the pile up. Is that because StringTie and bamCoverage use the same kind of normalisation algorithm? And if so, which is closer to the "biological truth", Deseq2 or StringTie/read Coverage?
Thanks!
Commands used: StringTie stringtie -e -B -G ${GTF} -o transcripts.gtf -A gene_abundances.tsv input.rmdup.bam
Deseq2 (using featureCounts counts) featureCounts -T $threads -p -F GTF -t exon -g gene_id -s 2 -a ${GTF} -o out.featurecount input.rmdup.bam FPKM values calculated in Deseq2 with: fpkmNormalisedCounts <- as.data.frame(fpkm(analysisObject, robust =TRUE))
Bigwig bamCoverage -b input.rmdup.bam --ignoreDuplicates --effectiveGenomeSize 142573017 --normalizeUsing RPKM --filterRNAstrand forward -of bigwig -o output.bw
Thanks for the fast reply Michael. I understand that but I'm not concerned about the fact that Im getting different values. I'm concerned that Im getting different FC (the gene/transcript length would be the same for both conditions - AMP vs DLM)
Im still confused as to why the Deseq2 fpkm don't match the read coverage? I guess I'm going back to - which is closer to the biological truth?
FPKM is counts of reads scaled by gene length and library size. StringTie and featureCounts don't agree on gene length. Then the DESeq2 part is just library size. If you use
robust=FALSE
you will get classic division by the total sum of counts. If you userobust=TRUE
we adapt to provide a better estimate of library size than the total sum. So you have a variety of different components and ways to compute FPKM here.