I work with Leo on recount-related projects.
He informed me of your question as it's connected to a project I primarily work on called Snaptron.
Leo's answer is a reasonable solution, but he thought I might be interested in offering an alternative approach using Snaptron.
Snaptron complements recount2 in that it applies gene, exon, splice junction, and base coverage indexing to the same data that recount2 provides.
A beta-feature of Snaptron currently is the ability to query the raw coverage (derived from the recount2 BigWigs) of any region of the genome using chromosome coordinates (e.g.
chr2:29446395-29446500). The result is a tab delimited file where the rows are contiguous bases (1 row per base) and the columns are raw read counts per sample in the original study (e.g. TCGA’s ~11K samples).
I think this might be applicable to your case, since you’re interested in a single gene that’s not currently annotated. You could query Snaptron’s base coverage using the
chromosome:start-end coordinates of the disjoint exons from your gene of interest, sum across all the bases for all exons for each sample to get the gene sum, and then load that into a RangedSummarizedExperiment in R.
It’s far from a perfect solution, and lacks the R support that Leo’s approach has,
but it’s quite possible that it’ll be more efficient than accessing every BigWig from TCGA.
Snaptron can be accessed directly via the Web (using REST web services).
An example query on the command line accessing the base coverage of TCGA would be:
curl "http://snaptron.cs.jhu.edu/tcga/bases?regions=chr2:29446395-29446500" | gzip > coverage.tsv.gz
Where you’d substitute a single exon’s coordinates for
And then do that for all disjoint exons in your gene.
The sample columns are labeled by the rail_id in the output, which are mapped to other IDs/accessions/metadata in this file:
In any case feel free to follow up if you want, either here or on the Snaptron gitter channel: