Question

Obtaining recount project data in TPM?

0

Entering edit mode

codden.c • 0

@coddenc-21751

Last seen 4.6 years ago

1) Using the Bioconductor recount package, is there a way to get recount project data, say for a hundred SRAs, in the form of transcripts per million (TPM)?

I haven’t seen other examples of this online, but it seems possible to get TPM formatting within recount, possibly from rse_tx objects which I’ve been able to load.

2) Also within recount, is it recommend to scale rsetx objects (e.g., scalecounts(rse_tx)) as is done with gene, exon, and junction rse objects prior to further analysis?

recount • 1.6k views

ADD COMMENT • link updated 4.6 years ago by Leonardo Collado Torres ★ 1.0k • written 4.6 years ago by codden.c • 0

score 0 · Answer 1 · 2019-09-03

Hi @codden.c,

Question 1

Regarding the first question, I know that there's this https://github.com/sonali-bioc/UncertaintyRNA/blob/master/03CreatingTPMTCGAData_objects.Rmd#L206 which was used in https://www.biorxiv.org/content/10.1101/445601v2 by Sonali Arora et al. That is, they used the formula

TPM = FPKM / (sum of FPKM over all genes/transcripts) * 10^6

Arora et al mention in their code that the formula comes from https://doi.org/10.1093/bioinformatics/btp692; specifically 1.1.1 Comparison to RPKM estimation where they mention an important assumption: Under the assumption of uniformly distributed reads, we note that RPKM measures are estimates of 109 × νi/ℓi, which is an unnormalized value of τi. There's also a blog post by Harold Pimentel explaining the relationship between FPKM and TPM https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/.

To compute the TPMs from the gene-level RPKMs, Arora et al used:

recount::getRPKM(rse_gene, mapped_var = "mapped_read_count") as in https://github.com/sonali-bioc/UncertaintyRNA/blob/master/01AcquiringTCGADatavarious_sources.Rmd#L501-L510
apply(rpkm_mat, 2, function(x){ (x/sum(x))*10^6 }) as in https://github.com/sonali-bioc/UncertaintyRNA/blob/master/03CreatingTPMTCGAData_objects.Rmd#L236-L238

So you can estimate TPMs from gene-level RPKMs as they did by using:

## Now part of recount
getTPM <- function(rse, length_var = 'bp_length', mapped_var = NULL) {
    # From https://support.bioconductor.org/p/124265/
    rpkm <- getRPKM(rse, length_var = length_var, mapped_var = mapped_var)
    apply(rpkm, 2, function(x) { (x / sum(x)) * 10^6 })
}

Note that using mapped_var = "mapped_read_count" like Arora et al did, is equivalent to using the default mapped_var = NULL from getRPKM() when you haven't subsetted the rse_gene object.

library('testthat')
## What Arora et al did:
tpm1 <- getTPM(rse_gene_SRP009615, mapped_var = "mapped_read_count")
## The default in getTPM(), to keep the same defaults as getRPKM()
tpm2 <- getTPM(rse_gene_SRP009615, mapped_var = NULL)
expect_equivalent(tpm1, tpm2)

The function getTPM() is now 1.11.13 in BioC 3.10 (current devel) and 1.10.13 in BioC 3.9 (current release).

Question 2

Jack Fu @jmfu, the lead author of https://doi.org/10.1101/247346 might have an answer for you. In any case, https://support.bioconductor.org/p/114956/ is related I believe.

For Jack Fu et al:

Jack, I guess that we can go from transcript counts to TPMs using things like:

sum(width(rowRanges(rse_tx)))
colSums(assays(rse_tx)$fragments, na.rm = TRUE)

Actually, based on Pimentel's blog post https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ I think that this would work:

getTPM_tx <- function(rse_tx) {
    count_div_len <- sweep(assays(rse_tx)$fragments, 1, sum(width(rowRanges(rse_tx))), '/')
    count_div_len_sums <- colSums(count_div_len, na.rm = TRUE)

    tpm <- sweep(count_div_len, 2, 1e6 / count_div_len_sums, '*')

    ## TPM can have NAs
    ## sums add up to 1 million
    # colSums(tpm, na.rm = TRUE)

    return(tpm)
}

Does that seem right to you?

Best, Leonardo