Question: How to check the read length for all the TCGA data samples in GDC?
6
gravatar for Biologist
12 months ago by
Biologist70
Biologist70 wrote:

Hi,

I would like to check the read length, Is Paired end information for all the samples for TCGA Lung data. For that I can go to the data portal and should check each sample. Instead of that is there way to get all the samples with read length and other information at a time?

Is there any package for that to get the information?

Thanks

tcga wes gdc gdc data • 604 views
ADD COMMENTlink modified 12 months ago by Sean Davis21k • written 12 months ago by Biologist70

@Sean Davis Is there a way to get the information using "Genomic Data Commons" package? I looked into the tutorial but couldn't find anything about this. Could you please help me in this?

ADD REPLYlink modified 12 months ago • written 12 months ago by Biologist70
Answer: How to check the read length for all the TCGA data samples in GDC?
4
gravatar for Sean Davis
12 months ago by
Sean Davis21k
United States
Sean Davis21k wrote:

Try:

library(GenomicDataCommons)
# You may want to change the filter criteria, but you 
# get the idea....
z = files() %>% 
  filter(~ data_format=='BAM') %>% 
  select('file_id') %>% 
  expand('analysis.metadata.read_groups') %>%
  results()
# You can view the data interactively
library(listviewer)
jsonedit(z)

The read length is burried at analysis.metadata.read_groups. To get at it, try this:

sapply(z$analysis$metadata$read_groups,'[[','read_length')
ADD COMMENTlink modified 12 months ago by Martin Morgan ♦♦ 23k • written 12 months ago by Sean Davis21k

Thanks a lot for the reply. I tried in the following way and I have all the file_id's but dont have read length. Little confused in getting that information. Need your help.

q = files() %>%
  GenomicDataCommons::select(available_fields('files')) %>%
  filter(~ cases.project.project_id == 'TCGA-COAD' &
           data_type == 'Aligned Reads' &
           experimental_strategy == 'WGS' &
           data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()

In this way I have all the 2175 file_ids but don't have read_length. Ofcourse I know that I have to use analysis.metadata.read_groups to get read_length, but not sure how to. Could you please help me in this?

Thank you

ADD REPLYlink modified 12 months ago • written 12 months ago by Biologist70
2
q = files() %>% 
  filter(~ cases.project.project_id == 'TCGA-BRCA' 
           & data_type == 'Aligned Reads' 
           & experimental_strategy == 'WXS' 
           & data_format == 'BAM') %>% select('file_id') %>% 
  expand('analysis.metadata.read_groups') 
file_ids = ids(q) 
z = results_all(q) 
read_length_list = sapply(z$analysis$metadata$read_groups,'[[','read_length')
ADD REPLYlink modified 12 months ago • written 12 months ago by Sean Davis21k
1

Thank you again. This is fine But after the last step, I see a "list of 2175" with only read_length. Is there a way to make a dataframe for "z$analysis$metadata$read_groups" where I can see the sample name, Is paired end, and read_length information also.

ADD REPLYlink written 12 months ago by Biologist70
2

Sure.

library(dplyr) 
rg_info = bind_rows(z$analysis$metadata$read_groups)

You may need to play with things a bit to get exactly what you want. The str() function can be quite useful, as are things like class() to see how things are stored in these complex structures.

ADD REPLYlink modified 12 months ago • written 12 months ago by Sean Davis21k
2

Wow! That's amazing! I ended with

> z$analysis$metadata$read_groups %>% bind_rows() %>% as_tibble()
# A tibble: 4,408 x 18
   library_name           is_paired_end library_strategy updated_datetime      
   <chr>                  <lgl>         <chr>            <chr>                 
 1 H_LS-AO-A1KR-10A-01D-… TRUE          WXS              2017-03-05T20:49:00.8…
 2 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 3 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 4 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 5 H_LS-BH-A0BF-11A-31D-… TRUE          WXS              2017-03-05T21:50:13.6…
 6 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-05T19:45:05.4…
 7 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-05T19:45:05.4…
 8 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-04T16:41:36.4…
 9 H_LS-D8-A1JL-01A-11D-… TRUE          WXS              2017-03-05T19:45:05.4…
10 H_LS-D8-A1JL-01A-11D-… TRUE          WXS              2017-03-05T19:45:05.4…
# ... with 4,398 more rows, and 14 more variables: created_datetime <chr>,
#   read_length <int>, target_capture_kit_target_region <chr>,
#   read_group_id <chr>, state <chr>, submitter_id <chr>, platform <chr>,
#   experiment_name <chr>, target_capture_kit_name <chr>,
#   target_capture_kit_catalog_number <chr>, read_group_name <chr>,
#   target_capture_kit_vendor <chr>, sequencing_date <chr>,
#   sequencing_center <chr>
ADD REPLYlink written 12 months ago by Martin Morgan ♦♦ 23k

Thanks, Martin, for completing the thought!

ADD REPLYlink written 12 months ago by Sean Davis21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 158 users visited in the last hour