How to check the read length for all the TCGA data samples in GDC?
1
7
Entering edit mode
Biologist ▴ 120
@biologist-9801
Last seen 4.8 years ago

Hi,

I would like to check the read length, Is Paired end information for all the samples for TCGA Lung data. For that I can go to the data portal and should check each sample. Instead of that is there way to get all the samples with read length and other information at a time?

Is there any package for that to get the information?

Thanks

gdc tcga wes gdc data • 3.8k views
ADD COMMENT
0
Entering edit mode

@Sean Davis Is there a way to get the information using "Genomic Data Commons" package? I looked into the tutorial but couldn't find anything about this. Could you please help me in this?

ADD REPLY
5
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States

Try:

library(GenomicDataCommons)
# You may want to change the filter criteria, but you 
# get the idea....
z = files() %>% 
  filter(~ data_format=='BAM') %>% 
  select('file_id') %>% 
  expand('analysis.metadata.read_groups') %>%
  results()
# You can view the data interactively
library(listviewer)
jsonedit(z)

The read length is burried at analysis.metadata.read_groups. To get at it, try this:

sapply(z$analysis$metadata$read_groups,'[[','read_length')
ADD COMMENT
0
Entering edit mode

Thanks a lot for the reply. I tried in the following way and I have all the file_id's but dont have read length. Little confused in getting that information. Need your help.

q = files() %>%
  GenomicDataCommons::select(available_fields('files')) %>%
  filter(~ cases.project.project_id == 'TCGA-COAD' &
           data_type == 'Aligned Reads' &
           experimental_strategy == 'WGS' &
           data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()

In this way I have all the 2175 file_ids but don't have read_length. Ofcourse I know that I have to use analysis.metadata.read_groups to get read_length, but not sure how to. Could you please help me in this?

Thank you

ADD REPLY
2
Entering edit mode
q = files() %>% 
  filter(~ cases.project.project_id == 'TCGA-BRCA' 
           & data_type == 'Aligned Reads' 
           & experimental_strategy == 'WXS' 
           & data_format == 'BAM') %>% select('file_id') %>% 
  expand('analysis.metadata.read_groups') 
file_ids = ids(q) 
z = results_all(q) 
read_length_list = sapply(z$analysis$metadata$read_groups,'[[','read_length')
ADD REPLY
1
Entering edit mode

Thank you again. This is fine But after the last step, I see a "list of 2175" with only read_length. Is there a way to make a dataframe for "z$analysis$metadata$read_groups" where I can see the sample name, Is paired end, and read_length information also.

ADD REPLY
2
Entering edit mode

Sure.

library(dplyr) 
rg_info = bind_rows(z$analysis$metadata$read_groups)

You may need to play with things a bit to get exactly what you want. The str() function can be quite useful, as are things like class() to see how things are stored in these complex structures.

ADD REPLY
2
Entering edit mode

Wow! That's amazing! I ended with

> z$analysis$metadata$read_groups %>% bind_rows() %>% as_tibble()
# A tibble: 4,408 x 18
   library_name           is_paired_end library_strategy updated_datetime      
   <chr>                  <lgl>         <chr>            <chr>                 
 1 H_LS-AO-A1KR-10A-01D-… TRUE          WXS              2017-03-05T20:49:00.8…
 2 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 3 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 4 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 5 H_LS-BH-A0BF-11A-31D-… TRUE          WXS              2017-03-05T21:50:13.6…
 6 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-05T19:45:05.4…
 7 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-05T19:45:05.4…
 8 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-04T16:41:36.4…
 9 H_LS-D8-A1JL-01A-11D-… TRUE          WXS              2017-03-05T19:45:05.4…
10 H_LS-D8-A1JL-01A-11D-… TRUE          WXS              2017-03-05T19:45:05.4…
# ... with 4,398 more rows, and 14 more variables: created_datetime <chr>,
#   read_length <int>, target_capture_kit_target_region <chr>,
#   read_group_id <chr>, state <chr>, submitter_id <chr>, platform <chr>,
#   experiment_name <chr>, target_capture_kit_name <chr>,
#   target_capture_kit_catalog_number <chr>, read_group_name <chr>,
#   target_capture_kit_vendor <chr>, sequencing_date <chr>,
#   sequencing_center <chr>
ADD REPLY
0
Entering edit mode

Thanks, Martin, for completing the thought!

ADD REPLY

Login before adding your answer.

Traffic: 428 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6