Question

Number of reads from recount are greater than target library size (40 million) for TCGA and GTEx projects

1

Entering edit mode

Sheldon Pang ▴ 20

@sheldon-pang-20309

Last seen 4.5 years ago

I just quickly check the data from TCGA project and found that some number of reads are greater than the target library size 40 million. Any ideas? Similarly for GTEx too.

The code is as follows:

library("recount")
accession = "TCGA"
downloadstudy(accession, type = "rse-gene")
load(file.path(accession, "rsegene.Rdata"))
rsegenescaledTCGA <- scalecounts(rsegene, round = FALSE)
summary(colSums(assays(rsegenescaledTCGA)$counts))/1e6
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.809 38.455 39.433 38.582 40.092 42.734

GTEx Min. 1st Qu. Median Mean 3rd Qu. Max.
10.60 35.96 37.40 37.01 38.40 41.60

recount • 946 views

ADD COMMENT • link updated 5.0 years ago by Leonardo Collado Torres ★ 1.0k • written 5.1 years ago by Sheldon Pang ▴ 20

score 0 · Answer 1 · 2019-04-11

Hi Sheldon,

Apologies for the delay in answering. I was busy trying to finish a paper myself.

The answer is a bit complicated. The short version is that some base pairs of the genome are counted more than once in recount2's gene level files. Within a given gene, a base pair is counted only once as illustrated in Figures 5 , 6, 7 and 8 of https://f1000research.com/articles/6-1558/v1. However, exonic base pairs from two or more genes can be overlapping and thus counted twice (or more). That's because GenomicRanges::disjoin() works that way on a GRangesList object. For ongoing work, we plan to move to a more fine-grained representation using GenomicFeatures::exonicParts() from which you can re-construct the disjoint exons.

Princy Parsana, a JHU CS PhD student in Alexis Battle's lab, asked me the same question offline a few months ago. I explored this in more detail back then leading to the following gist https://gist.github.com/lcolladotor/55d3d52de3505b0bfdb5c42e0bb88b2a. There you can see that if you use the exon annotation information, you can find which genes do not overlap at all. If you exclude those genes, then you will get not get values about 40 million after scaling.

library('recount')
gene_info <- reproduce_ranges('both')
ov_gene_by_exon <- countOverlaps(gene_info$exon)

nread_tcga_noov2 <- colSums(assays(rse_scaled_tcga)$counts[ov_gene_by_exon == 1, ]) / 1e6
table(nread_tcga_noov2 > 40)
summary(nread_tcga_noov2)

# > table(nread_tcga_noov2 > 40)
#
# FALSE
# 11284
# > summary(nread_tcga_noov2)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   6.284  31.240  32.374  31.668  33.185  36.570

So currently, if you use all the genes in the recount2 RSE gene objects and scale the counts, if you get more than 40 million you definitely have some double counting (which you can be ok with and still use edgeR or some other package to take this into account for your DE analysis). If you want to avoid this, you can exclude the genes that overlap each other as shown above. If you want to re-quantify everything using exonicParts() yourself, that's doable following the options I described in this other response https://support.bioconductor.org/p/119384/#120010.

Princy Parsana might have other suggestions.

Best, Leo