Question: Number of reads from recount are greater than target library size (40 million) for TCGA and GTEx projects
1
gravatar for Sheldon Pang
26 days ago by
Sheldon Pang10
Sheldon Pang10 wrote:

I just quickly check the data from TCGA project and found that some number of reads are greater than the target library size 40 million. Any ideas? Similarly for GTEx too.

The code is as follows:

library("recount")
accession = "TCGA"
downloadstudy(accession, type = "rse-gene")
load(file.path(accession, "rse
gene.Rdata"))
rsegenescaledTCGA <- scalecounts(rsegene, round = FALSE)
summary(colSums(assays(rse
genescaledTCGA)$counts))/1e6
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.809 38.455 39.433 38.582 40.092 42.734

GTEx Min. 1st Qu. Median Mean 3rd Qu. Max.
10.60 35.96 37.40 37.01 38.40 41.60

recount • 55 views
ADD COMMENTlink modified 11 days ago by Leonardo Collado Torres610 • written 26 days ago by Sheldon Pang10
Answer: Number of reads from recount are greater than target library size (40 million)
0
gravatar for Leonardo Collado Torres
11 days ago by
United States
Leonardo Collado Torres610 wrote:

Hi Sheldon,

Apologies for the delay in answering. I was busy trying to finish a paper myself.

The answer is a bit complicated. The short version is that some base pairs of the genome are counted more than once in recount2's gene level files. Within a given gene, a base pair is counted only once as illustrated in Figures 5 , 6, 7 and 8 of https://f1000research.com/articles/6-1558/v1. However, exonic base pairs from two or more genes can be overlapping and thus counted twice (or more). That's because GenomicRanges::disjoin() works that way on a GRangesList object. For ongoing work, we plan to move to a more fine-grained representation using GenomicFeatures::exonicParts() from which you can re-construct the disjoint exons.

Princy Parsana, a JHU CS PhD student in Alexis Battle's lab, asked me the same question offline a few months ago. I explored this in more detail back then leading to the following gist https://gist.github.com/lcolladotor/55d3d52de3505b0bfdb5c42e0bb88b2a. There you can see that if you use the exon annotation information, you can find which genes do not overlap at all. If you exclude those genes, then you will get not get values about 40 million after scaling.

library('recount')
gene_info <- reproduce_ranges('both')
ov_gene_by_exon <- countOverlaps(gene_info$exon)

nread_tcga_noov2 <- colSums(assays(rse_scaled_tcga)$counts[ov_gene_by_exon == 1, ]) / 1e6
table(nread_tcga_noov2 > 40)
summary(nread_tcga_noov2)

# > table(nread_tcga_noov2 > 40)
#
# FALSE
# 11284
# > summary(nread_tcga_noov2)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   6.284  31.240  32.374  31.668  33.185  36.570

So currently, if you use all the genes in the recount2 RSE gene objects and scale the counts, if you get more than 40 million you definitely have some double counting (which you can be ok with and still use edgeR or some other package to take this into account for your DE analysis). If you want to avoid this, you can exclude the genes that overlap each other as shown above. If you want to re-quantify everything using exonicParts() yourself, that's doable following the options I described in this other response https://support.bioconductor.org/p/119384/#120010.

Princy Parsana might have other suggestions.

Best, Leo

ADD COMMENTlink modified 11 days ago • written 11 days ago by Leonardo Collado Torres610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 252 users visited in the last hour