Obtaining recount project data in TPM?
1
0
Entering edit mode
codden.c • 0
@coddenc-21751
Last seen 5.3 years ago

1) Using the Bioconductor recount package, is there a way to get recount project data, say for a hundred SRAs, in the form of transcripts per million (TPM)?

I haven’t seen other examples of this online, but it seems possible to get TPM formatting within recount, possibly from rse_tx objects which I’ve been able to load.

2) Also within recount, is it recommend to scale rsetx objects (e.g., scalecounts(rse_tx)) as is done with gene, exon, and junction rse objects prior to further analysis?

recount • 2.0k views
ADD COMMENT
0
Entering edit mode
@lcolladotor
Last seen 6 weeks ago
United States

Hi @codden.c,

Question 1

Regarding the first question, I know that there's this https://github.com/sonali-bioc/UncertaintyRNA/blob/master/03CreatingTPMTCGAData_objects.Rmd#L206 which was used in https://www.biorxiv.org/content/10.1101/445601v2 by Sonali Arora et al. That is, they used the formula

TPM = FPKM / (sum of FPKM over all genes/transcripts) * 10^6

Arora et al mention in their code that the formula comes from https://doi.org/10.1093/bioinformatics/btp692; specifically 1.1.1 Comparison to RPKM estimation where they mention an important assumption: Under the assumption of uniformly distributed reads, we note that RPKM measures are estimates of 109 × νi/ℓi, which is an unnormalized value of τi. There's also a blog post by Harold Pimentel explaining the relationship between FPKM and TPM https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/.

To compute the TPMs from the gene-level RPKMs, Arora et al used:

So you can estimate TPMs from gene-level RPKMs as they did by using:

## Now part of recount
getTPM <- function(rse, length_var = 'bp_length', mapped_var = NULL) {
    # From https://support.bioconductor.org/p/124265/
    rpkm <- getRPKM(rse, length_var = length_var, mapped_var = mapped_var)
    apply(rpkm, 2, function(x) { (x / sum(x)) * 10^6 })
}

Note that using mapped_var = "mapped_read_count" like Arora et al did, is equivalent to using the default mapped_var = NULL from getRPKM() when you haven't subsetted the rse_gene object.

library('testthat')
## What Arora et al did:
tpm1 <- getTPM(rse_gene_SRP009615, mapped_var = "mapped_read_count")
## The default in getTPM(), to keep the same defaults as getRPKM()
tpm2 <- getTPM(rse_gene_SRP009615, mapped_var = NULL)
expect_equivalent(tpm1, tpm2)

The function getTPM() is now 1.11.13 in BioC 3.10 (current devel) and 1.10.13 in BioC 3.9 (current release).

Question 2

Jack Fu @jmfu, the lead author of https://doi.org/10.1101/247346 might have an answer for you. In any case, https://support.bioconductor.org/p/114956/ is related I believe.

For Jack Fu et al:

Jack, I guess that we can go from transcript counts to TPMs using things like:

sum(width(rowRanges(rse_tx)))
colSums(assays(rse_tx)$fragments, na.rm = TRUE)

Actually, based on Pimentel's blog post https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ I think that this would work:

getTPM_tx <- function(rse_tx) {
    count_div_len <- sweep(assays(rse_tx)$fragments, 1, sum(width(rowRanges(rse_tx))), '/')
    count_div_len_sums <- colSums(count_div_len, na.rm = TRUE)

    tpm <- sweep(count_div_len, 2, 1e6 / count_div_len_sums, '*')

    ## TPM can have NAs
    ## sums add up to 1 million
    # colSums(tpm, na.rm = TRUE)

    return(tpm)
}

Does that seem right to you?

Best, Leonardo

ADD COMMENT
0
Entering edit mode

Very helpful and just what I was looking for, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6