I have been using the R package Recount3 for some time now. For some calculations I am interested in the read length for each sample in the projects included in the databases GTEx and TCGA. I previously used the column 'gtex.smrdlgth', expecting that this was the read length used for that specific sample. However, for TCGA projects the equivalent column 'tcga.smrdlgth' does not exist. Therefore, my question is, where in the metadata can I find the read length used in the projects of GTEx and TCGA?
Thanks in advance
library('recount3') library('dplyr') selected_project <- 'BRCA' human_recount3_projects <- available_projects() project_rse <- create_rse(human_recount3_projects%>% filter(project==selected_project)) colData(project_rse)
I have a bit more time right now. So the issue here is that each data source (GTEx, SRA, TCGA) can have metadata variables unique to that data source. So the one you were using is only present in GTEx datasets. However, since Monorail does create common metadata variables that include information about read length, you might just want to use those instead. Like:
as documented at http://rna.recount.bio/docs/quality-check-fields.html.
Thank you for the detailed explanation, for now I am using the 'avg_len'. The identified junctions only seem to show Intrachromosomal junctions since they are notated as contig:start-end, allowing the notation of only one contig. Is there also an option to look at junctions that span over different chromosomes, interchromosomal junctions? Expecially in the TCGA datasource these might be interesting.
Leo forwarded me your question as I'm the one who generated the junctions for recount3/snaptron.
I agree that having interchromosomal (chimeric) junctions would be useful, especially for cancer (as you point out).
That said, I chose not to include them in the official set of recount3/snaptron junctions to streamline processing/indexing. Related to this, STAR 2.7.3a reports interchromosomal junctions separately from intrachromosomal junctions, which we were focused on.
I recognize this is potentially suboptimal for your use case (and others). In the design and running of Monorail to produce recount3/snaptron we made a number of tradeoffs, some of which were influenced by the size of the datasets involved and the general nature of our target audience.
I can't promise anything for the future, but if we were to do it again we'd consider making the interchromosomal junctions available as well.
Thanks Chris for the detailed reply!
These past months have been challenging for me with my mom's cancer diagnosis. She passed away on Thursday last week.
I've forwarded your message to others who might know the answer to your last question, though you might be more on your own for this one.