Why counts to tpm gives different values compared to tpm from kallisto?
Entering edit mode
m93 • 0
Last seen 20 months ago

I have RNA-Seq data with raw counts extracted from bam files using featureCounts. I wanted to convert counts to tpm, because I need TPM for some analysis.

Initially, I got the coding length for each Gene like below:

gtf_txdb <- makeTxDbFromGFF("gencode.v27.annotation.gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(gtf_txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Now, I have two data frames like data is with raw counts and other data2 with gene name and Coding_Length

tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate, na.rm = TRUE) * 1e6

final_tpm <- apply(data, 2, function(x) tpm(x, data2$Coding_Length))

final_tpm using the function for some of the genes:

gene            tpm
RPSAP8          3.79
AL645608.8      4.14
RNF223          1.38

And I also used other function like below on dataframe df which have two columns raw counts and effLength calculated from Coding_Length

df$effLength <- df$Coding_Length - 203.7 + 1

countToTpm <- function(counts, effLen)
  rate <- log(counts) - log(effLen)
  denom <- log(sum(exp(rate)))
  exp(rate - denom + log(1e6))

df$tpm <- with(df, countToTpm(counts, effLength))

With the above way the output of few genes is like below:

gene_name       tpm
RPSAP8          1.57
AL645608.8      1.46
RNF223          0.496

Along with the above way, I also used kallisto to get tpm. And when I compared both results, they doesn't match and not even close.

target_id          length      eff_length        est_counts     tpm
RPSAP8             889          747.358           4.10538       0.0635304
AL645608.8        2086          1944.36              116         0.689984
RNF223            1902          1760.36           50.0024        0.328508

I did the sanity check, the results from both functions give sum to one million.

I understand that kallisto tpm is better, But I have the raw counts and instead of running kallisto again on all 300 samples, I wanted to convert raw counts to tpm.

Which of these functions is to be used for conversion of counts to tpm? Why they don't match to kallisto tpm?

rnaseq counts tpm GenomicFeatures kallisto • 1.3k views
Entering edit mode

Michael Love I saw your comment DESeq2: Is it possible to convert read counts to expression values via TPM and return these values? May I know your thoughts about my doubts. thank you.

Entering edit mode

This doesn’t involve any software I develop...

In tximport the counts, effective length and TPM are related via multiplication/division of terms.


Login before adding your answer.

Traffic: 468 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6