Question: tximport discrepancy between TPMs and lengthScaledTPMs
0
gravatar for igor
4 months ago by
igor30
United States
igor30 wrote:

I am importing Salmon output with tximport to get gene-level expression levels using the suggested approach:

txi = tximport(quant_files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")

Generally, this works as expected. However, sometimes I notice a large discrepancy between TPMs (txi$abundance) and counts/lengthScaledTPMs (txi$counts).

For example, I am looking at one sample where the TPMs for the top two genes are 676,935 and 54,165 (this is clearly a problematic library), but the top two counts/lengthScaledTPMs are 661,979 and 3,917. Top 10 genes are 83% of total for TPMs, but 99.7% for counts. 97% of the genes end up with counts between 0 and 0.01. For comparison, in the original Salmon estimates, top 10 are 33% of the total. I am confident in the tximport results, but what would cause such behavior? Can I trust any of the values?

tximport • 158 views
ADD COMMENTlink modified 4 months ago by Michael Love25k • written 4 months ago by igor30
Answer: tximport discrepancy between TPMs and lengthScaledTPMs
0
gravatar for Michael Love
4 months ago by
Michael Love25k
United States
Michael Love25k wrote:

I don’t necessarily see any problem with what you’ve presented above. Counts are not proportional to expression across genes, and they shouldn’t be expected to be.

There may be some issues with this library though. With this library where 67% of the total gene expression is from a single gene, how many genes have original estimated counts of 10 or more?

ADD COMMENTlink written 4 months ago by Michael Love25k

My concern was that the top gene is 67% of total by TPMs, but 98% by counts. I understand the relative increase or decrease. The part that worries me is that a single gene is soaking up nearly all of the reads. Regardless of any scaling or normalization, it's hard to reconcile all other genes essentially disappearing.

By original estimates, I assume you mean the ones direct from Salmon (without tximport). There are 993 genes with over 10 counts, so the library is substantially more balanced. The top gene is only 42k (less than 10% of total).

ADD REPLYlink written 4 months ago by igor30

Long genes do take all the counts, this is a property of RNA-seq data.

Ok so 993 genes with original counts greater than 10, how many for lengthScaledTPM?

ADD REPLYlink written 4 months ago by Michael Love25k

For lengthScaledTPM, only 33 with counts greater than 10.

ADD REPLYlink written 4 months ago by igor30

I think if this sample is peculiar among your other samples as having 67% of the TPM from one gene, then the counts from abundance approach may not work. But you may also need to discard that samples for QC reasons?

ADD REPLYlink written 4 months ago by Michael Love25k

The problem is that I have a lot of samples that are not much better, so it's hard to define a cutoff.

I also tried the more traditional STAR/featureCounts approach and had only 64k total counts. The top gene from Salmon only had 20. Clearly, there is something very odd about this sample.

I would think I am doing something wrong, but I don't see such huge discrepancies when dealing with good quality samples.

There is probably not a clear answer, but I was hoping to figure out what might be the cause.

ADD REPLYlink written 4 months ago by igor30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 344 users visited in the last hour