Question

Recount data for unannotated genes

2

Entering edit mode

francesco.gandolfi ▴ 20

@francescogandolfi-20307

Last seen 6.9 years ago

Hi everybody, I'm writing here for a specific issue concerning the TCGA RNAseq data provided on the Recount2 website. Briefly, we would need of gene counts of a particular gene for the same set of TCGA individuals provided as RangedSummarizedExperiment objects. The issue concerning this is that the gene is not annotated yet in the Ensembl/Gencode database (it's the gene of a novel lncRNA) , thus it's not possibile to find it across Recount data. For the moment, the only solution I have in my mind is to work directly with the raw fastq files and map them following the Recount pipeline, but including annotation for the novel gene. I understand that this can be very difficult (especially for getting access to TCGA raw data). I was asking if maybe there are some other strategies avoding fastq mapping but , yes, unfortunately I realize our goal is not so easy. Thank you so much in advance if someone can help me in this. Best regards Francesco

recount RNAseq gencode lncRNA gene annotation • 2.1k views

ADD COMMENT • link updated 6.8 years ago by chris.wilks ▴ 70 • written 6.9 years ago by francesco.gandolfi ▴ 20

score 1 · Answer 1 · 2019-04-16

Hi Francesco,

I work with Leo on recount-related projects.

He informed me of your question as it's connected to a project I primarily work on called Snaptron.
Leo's answer is a reasonable solution, but he thought I might be interested in offering an alternative approach using Snaptron.

Snaptron complements recount2 in that it applies gene, exon, splice junction, and base coverage indexing to the same data that recount2 provides.

A beta-feature of Snaptron currently is the ability to query the raw coverage (derived from the recount2 BigWigs) of any region of the genome using chromosome coordinates (e.g. chr2:29446395-29446500). The result is a tab delimited file where the rows are contiguous bases (1 row per base) and the columns are raw read counts per sample in the original study (e.g. TCGA’s ~11K samples).

I think this might be applicable to your case, since you’re interested in a single gene that’s not currently annotated. You could query Snaptron’s base coverage using the chromosome:start-end coordinates of the disjoint exons from your gene of interest, sum across all the bases for all exons for each sample to get the gene sum, and then load that into a RangedSummarizedExperiment in R.

It’s far from a perfect solution, and lacks the R support that Leo’s approach has, but it’s quite possible that it’ll be more efficient than accessing every BigWig from TCGA.

Snaptron can be accessed directly via the Web (using REST web services).

An example query on the command line accessing the base coverage of TCGA would be:

curl "http://snaptron.cs.jhu.edu/tcga/bases?regions=chr2:29446395-29446500" | gzip > coverage.tsv.gz

Where you’d substitute a single exon’s coordinates for chr2:29446395-29446500.
And then do that for all disjoint exons in your gene.

The sample columns are labeled by the rail_id in the output, which are mapped to other IDs/accessions/metadata in this file:

http://snaptron.cs.jhu.edu/data/tcga/samples.tsv

In any case feel free to follow up if you want, either here or on the Snaptron gitter channel:

https://gitter.im/snaptron/Lobby

Chris

score 0 · Answer 2 · 2019-04-10

Hi Francesco,

Apologies for the delay in answering. I was busy trying to finish a paper myself.

The recount2 project includes base pair coverage BigWig files (see https://f1000research.com/articles/6-1558/v1 for more details on this) that you can leverage to explore any new annotation or un-annotated features with hg38 coordinates. We currently provide two options for doing so. One of them is the recount::coverage_matrix() function that requires a GRanges object with the coordinates of interest that you want to quantify. That function uses derfinder behind the scenes to quantify the information and ultimately relies on rtracklayer::import.bw() for querying the BigWig files from the web. Because of the way import.bw() is written, it cannot run on Windows. Now, since this involves data transfers from the web, you might want to download the BigWig files before hand. The recount package vignette talks about coverage_matrix() under the "using another/newer annotation" section at http://bioconductor.org/packages/release/bioc/vignettes/recount/inst/doc/recount-quickstart.html#61_using_anothernewer_annotation.

If you do download the BigWig files, you can also use https://github.com/LieberInstitute/recount.bwtool which runs bwtool locally (or you can write your own scripts to do so). This is much faster and I've occasionally run recount.bwtool for collaborators, for example https://doi.org/10.3389/fimmu.2018.02679.

You can alternatively use SciServer where you can access the BigWig files "locally" and run recount::coverage_matrix() there or install bwtool there as well. See the recount vignette for more information about using SciServer http://bioconductor.org/packages/release/bioc/vignettes/recount/inst/doc/recount-quickstart.html#10_accessing_recount_via_sciserver. The recount.bwtool includes some code Ben Langmead used to install bwtool on SciServer https://github.com/LieberInstitute/recount.bwtool#install-bwtool-on-sciserver a while back.

Best, Leo