Hi @codden.c,
Question 1
Regarding the first question, I know that there's this https://github.com/sonali-bioc/UncertaintyRNA/blob/master/03CreatingTPMTCGAData_objects.Rmd#L206 which was used in https://www.biorxiv.org/content/10.1101/445601v2 by Sonali Arora et al. That is, they used the formula
TPM = FPKM / (sum of FPKM over all genes/transcripts) * 10^6
Arora et al mention in their code that the formula comes from https://doi.org/10.1093/bioinformatics/btp692; specifically 1.1.1 Comparison to RPKM estimation
where they mention an important assumption: Under the assumption of uniformly distributed reads, we note that RPKM measures are estimates of 109 × νi/ℓi, which is an unnormalized value of τi.
There's also a blog post by Harold Pimentel explaining the relationship between FPKM and TPM https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/.
To compute the TPMs from the gene-level RPKMs, Arora et al used:
So you can estimate TPMs from gene-level RPKMs as they did by using:
## Now part of recount
getTPM <- function(rse, length_var = 'bp_length', mapped_var = NULL) {
# From https://support.bioconductor.org/p/124265/
rpkm <- getRPKM(rse, length_var = length_var, mapped_var = mapped_var)
apply(rpkm, 2, function(x) { (x / sum(x)) * 10^6 })
}
Note that using mapped_var = "mapped_read_count"
like Arora et al did, is equivalent to using the default mapped_var = NULL
from getRPKM()
when you haven't subsetted the rse_gene
object.
library('testthat')
## What Arora et al did:
tpm1 <- getTPM(rse_gene_SRP009615, mapped_var = "mapped_read_count")
## The default in getTPM(), to keep the same defaults as getRPKM()
tpm2 <- getTPM(rse_gene_SRP009615, mapped_var = NULL)
expect_equivalent(tpm1, tpm2)
The function getTPM()
is now 1.11.13 in BioC 3.10 (current devel) and 1.10.13 in BioC 3.9 (current release).
Question 2
Jack Fu @jmfu, the lead author of https://doi.org/10.1101/247346 might have an answer for you. In any case, https://support.bioconductor.org/p/114956/ is related I believe.
For Jack Fu et al:
Jack, I guess that we can go from transcript counts to TPMs using things like:
sum(width(rowRanges(rse_tx)))
colSums(assays(rse_tx)$fragments, na.rm = TRUE)
Actually, based on Pimentel's blog post https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ I think that this would work:
getTPM_tx <- function(rse_tx) {
count_div_len <- sweep(assays(rse_tx)$fragments, 1, sum(width(rowRanges(rse_tx))), '/')
count_div_len_sums <- colSums(count_div_len, na.rm = TRUE)
tpm <- sweep(count_div_len, 2, 1e6 / count_div_len_sums, '*')
## TPM can have NAs
## sums add up to 1 million
# colSums(tpm, na.rm = TRUE)
return(tpm)
}
Does that seem right to you?
Best,
Leonardo
Very helpful and just what I was looking for, thanks!