9 weeks ago by
Apologies for the delay in answering. I was busy trying to finish a paper myself.
The answer is a bit complicated. The short version is that some base pairs of the genome are counted more than once in recount2's gene level files. Within a given gene, a base pair is counted only once as illustrated in Figures 5 , 6, 7 and 8 of https://f1000research.com/articles/6-1558/v1. However, exonic base pairs from two or more genes can be overlapping and thus counted twice (or more). That's because
GenomicRanges::disjoin() works that way on a
GRangesList object. For ongoing work, we plan to move to a more fine-grained representation using
GenomicFeatures::exonicParts() from which you can re-construct the disjoint exons.
Princy Parsana, a JHU CS PhD student in Alexis Battle's lab, asked me the same question offline a few months ago. I explored this in more detail back then leading to the following gist https://gist.github.com/lcolladotor/55d3d52de3505b0bfdb5c42e0bb88b2a. There you can see that if you use the exon annotation information, you can find which genes do not overlap at all. If you exclude those genes, then you will get not get values about 40 million after scaling.
gene_info <- reproduce_ranges('both')
ov_gene_by_exon <- countOverlaps(gene_info$exon)
nread_tcga_noov2 <- colSums(assays(rse_scaled_tcga)$counts[ov_gene_by_exon == 1, ]) / 1e6
table(nread_tcga_noov2 > 40)
# > table(nread_tcga_noov2 > 40)
# > summary(nread_tcga_noov2)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 6.284 31.240 32.374 31.668 33.185 36.570
So currently, if you use all the genes in the recount2 RSE gene objects and scale the counts, if you get more than 40 million you definitely have some double counting (which you can be ok with and still use edgeR or some other package to take this into account for your DE analysis). If you want to avoid this, you can exclude the genes that overlap each other as shown above. If you want to re-quantify everything using
exonicParts() yourself, that's doable following the options I described in this other response https://support.bioconductor.org/p/119384/#120010.
Princy Parsana might have other suggestions.