1
1
Entering edit mode
soevertje ▴ 20
@2670ceb4
Last seen 6 weeks ago
Germany

Hi all,

I have been using the R package Recount3 for some time now. For some calculations I am interested in the read length for each sample in the projects included in the databases GTEx and TCGA. I previously used the column 'gtex.smrdlgth', expecting that this was the read length used for that specific sample. However, for TCGA projects the equivalent column 'tcga.smrdlgth' does not exist. Therefore, my question is, where in the metadata can I find the read length used in the projects of GTEx and TCGA?

Seline

library('recount3')
library('dplyr')

selected_project <- 'BRCA'

human_recount3_projects <- available_projects()

project_rse <- create_rse(human_recount3_projects%>%
filter(project==selected_project))

colData(project_rse)

recountWorkflow recount3 • 284 views
0
Entering edit mode
Last seen 14 days ago
United States

Hi soevertje,

Thank you for your interest in recount3. You might find the documentation at http://rna.recount.bio/docs/quality-check-fields.html useful.

Best, Leo

0
Entering edit mode

Hi again,

I have a bit more time right now. So the issue here is that each data source (GTEx, SRA, TCGA) can have metadata variables unique to that data source. So the one you were using is only present in GTEx datasets. However, since Monorail does create common metadata variables that include information about read length, you might just want to use those instead. Like:

min_len: minimum read length


or

average_input_read_length: Average length of a read
average_mapped_length: Average length of an alignment
deletion_average_length: Average length of a genomic deletion, i.e. genomic gaps
deletion_rate_per_base: Genomic deletions per mapped base
insertion_average_length: Average length of a genomic insertion, i.e. read gaps
insertion_rate_per_base: Genomics insertions per mapped base


as documented at http://rna.recount.bio/docs/quality-check-fields.html.

Best, Leo

1
Entering edit mode

Hi Leo,

Thank you for the detailed explanation, for now I am using the 'avg_len'. The identified junctions only seem to show Intrachromosomal junctions since they are notated as contig:start-end, allowing the notation of only one contig. Is there also an option to look at junctions that span over different chromosomes, interchromosomal junctions? Expecially in the TCGA datasource these might be interesting.

Best, Seline

2
Entering edit mode

Hi Seline,

Leo forwarded me your question as I'm the one who generated the junctions for recount3/snaptron.

I agree that having interchromosomal (chimeric) junctions would be useful, especially for cancer (as you point out).

That said, I chose not to include them in the official set of recount3/snaptron junctions to streamline processing/indexing. Related to this, STAR 2.7.3a reports interchromosomal junctions separately from intrachromosomal junctions, which we were focused on.

I recognize this is potentially suboptimal for your use case (and others). In the design and running of Monorail to produce recount3/snaptron we made a number of tradeoffs, some of which were influenced by the size of the datasets involved and the general nature of our target audience.

I can't promise anything for the future, but if we were to do it again we'd consider making the interchromosomal junctions available as well.

Chris

0
Entering edit mode

Thanks Chris for the detailed reply!

0
Entering edit mode

Hi Seline,

These past months have been challenging for me with my mom's cancer diagnosis. She passed away on Thursday last week.

I've forwarded your message to others who might know the answer to your last question, though you might be more on your own for this one.

Sorry, Leo