Question: tximport discrepancy between TPMs and lengthScaledTPMs
0
4 weeks ago by
igor20
United States
igor20 wrote:

I am importing Salmon output with tximport to get gene-level expression levels using the suggested approach:

txi = tximport(quant_files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")


Generally, this works as expected. However, sometimes I notice a large discrepancy between TPMs (txi$abundance) and counts/lengthScaledTPMs (txi$counts).

For example, I am looking at one sample where the TPMs for the top two genes are 676,935 and 54,165 (this is clearly a problematic library), but the top two counts/lengthScaledTPMs are 661,979 and 3,917. Top 10 genes are 83% of total for TPMs, but 99.7% for counts. 97% of the genes end up with counts between 0 and 0.01. For comparison, in the original Salmon estimates, top 10 are 33% of the total. I am confident in the tximport results, but what would cause such behavior? Can I trust any of the values?

tximport • 78 views
modified 4 weeks ago by Michael Love24k • written 4 weeks ago by igor20
Answer: tximport discrepancy between TPMs and lengthScaledTPMs
0
4 weeks ago by
Michael Love24k
United States
Michael Love24k wrote:

I don’t necessarily see any problem with what you’ve presented above. Counts are not proportional to expression across genes, and they shouldn’t be expected to be.

There may be some issues with this library though. With this library where 67% of the total gene expression is from a single gene, how many genes have original estimated counts of 10 or more?

My concern was that the top gene is 67% of total by TPMs, but 98% by counts. I understand the relative increase or decrease. The part that worries me is that a single gene is soaking up nearly all of the reads. Regardless of any scaling or normalization, it's hard to reconcile all other genes essentially disappearing.

By original estimates, I assume you mean the ones direct from Salmon (without tximport). There are 993 genes with over 10 counts, so the library is substantially more balanced. The top gene is only 42k (less than 10% of total).

Long genes do take all the counts, this is a property of RNA-seq data.

Ok so 993 genes with original counts greater than 10, how many for lengthScaledTPM?

For lengthScaledTPM, only 33 with counts greater than 10.

I think if this sample is peculiar among your other samples as having 67% of the TPM from one gene, then the counts from abundance approach may not work. But you may also need to discard that samples for QC reasons?

The problem is that I have a lot of samples that are not much better, so it's hard to define a cutoff.

I would think I am doing something wrong, but I don't see such huge discrepancies when dealing with good quality samples.

There is probably not a clear answer, but I was hoping to figure out what might be the cause.