Why counts to tpm gives different values compared to tpm from kallisto?
0
0
Entering edit mode
m93 • 0
@m93-20797
Last seen 20 months ago

I have RNA-Seq data with raw counts extracted from bam files using featureCounts. I wanted to convert counts to tpm, because I need TPM for some analysis.

Initially, I got the coding length for each Gene like below:

library(GenomicFeatures)
gtf_txdb <- makeTxDbFromGFF("gencode.v27.annotation.gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(gtf_txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))


Now, I have two data frames like data is with raw counts and other data2 with gene name and Coding_Length

tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate, na.rm = TRUE) * 1e6
}

final_tpm <- apply(data, 2, function(x) tpm(x, data2$Coding_Length))  final_tpm using the function for some of the genes: gene tpm RPSAP8 3.79 AL645608.8 4.14 RNF223 1.38  And I also used other function like below on dataframe df which have two columns raw counts and effLength calculated from Coding_Length df$effLength <- df$Coding_Length - 203.7 + 1 countToTpm <- function(counts, effLen) { rate <- log(counts) - log(effLen) denom <- log(sum(exp(rate))) exp(rate - denom + log(1e6)) } df$tpm <- with(df, countToTpm(counts, effLength))


With the above way the output of few genes is like below:

gene_name       tpm
RPSAP8          1.57
AL645608.8      1.46
RNF223          0.496


Along with the above way, I also used kallisto to get tpm. And when I compared both results, they doesn't match and not even close.

target_id          length      eff_length        est_counts     tpm
RPSAP8             889          747.358           4.10538       0.0635304
AL645608.8        2086          1944.36              116         0.689984
RNF223            1902          1760.36           50.0024        0.328508


I did the sanity check, the results from both functions give sum to one million.

I understand that kallisto tpm is better, But I have the raw counts and instead of running kallisto again on all 300 samples, I wanted to convert raw counts to tpm.

Which of these functions is to be used for conversion of counts to tpm? Why they don't match to kallisto tpm?

rnaseq counts tpm GenomicFeatures kallisto • 1.3k views
0
Entering edit mode

Michael Love I saw your comment DESeq2: Is it possible to convert read counts to expression values via TPM and return these values? May I know your thoughts about my doubts. thank you.

0
Entering edit mode

This doesn’t involve any software I develop...

In tximport the counts, effective length and TPM are related via multiplication/division of terms.