Question

How to check the read length for all the TCGA data samples in GDC?

7

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 5.8 years ago

Hi,

I would like to check the read length, Is Paired end information for all the samples for TCGA Lung data. For that I can go to the data portal and should check each sample. Instead of that is there way to get all the samples with read length and other information at a time?

Is there any package for that to get the information?

Thanks

gdc tcga wes gdc data • 4.6k views

ADD COMMENT • link updated 7.7 years ago by Sean Davis 21k • written 7.7 years ago by Biologist ▴ 120

0

Entering edit mode

@Sean Davis Is there a way to get the information using "Genomic Data Commons" package? I looked into the tutorial but couldn't find anything about this. Could you please help me in this?

ADD REPLY • link 7.7 years ago Biologist ▴ 120

Martin Morgan · Answer 1 · 2018-04-17

5

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 10 months ago

United States

Try:

library(GenomicDataCommons)
# You may want to change the filter criteria, but you 
# get the idea....
z = files() %>% 
  filter(~ data_format=='BAM') %>% 
  select('file_id') %>% 
  expand('analysis.metadata.read_groups') %>%
  results()
# You can view the data interactively
library(listviewer)
jsonedit(z)

The read length is burried at analysis.metadata.read_groups. To get at it, try this:

sapply(z$analysis$metadata$read_groups,'[[','read_length')

ADD COMMENT • link updated 7.7 years ago by Martin Morgan 25k • written 7.7 years ago by Sean Davis 21k

0

Entering edit mode

Thanks a lot for the reply. I tried in the following way and I have all the file_id's but dont have read length. Little confused in getting that information. Need your help.

q = files() %>%
  GenomicDataCommons::select(available_fields('files')) %>%
  filter(~ cases.project.project_id == 'TCGA-COAD' &
           data_type == 'Aligned Reads' &
           experimental_strategy == 'WGS' &
           data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()

In this way I have all the 2175 file_ids but don't have read_length. Ofcourse I know that I have to use analysis.metadata.read_groups to get read_length, but not sure how to. Could you please help me in this?

Thank you

ADD REPLY • link 7.7 years ago Biologist ▴ 120

2

Entering edit mode

q = files() %>% 
  filter(~ cases.project.project_id == 'TCGA-BRCA' 
           & data_type == 'Aligned Reads' 
           & experimental_strategy == 'WXS' 
           & data_format == 'BAM') %>% select('file_id') %>% 
  expand('analysis.metadata.read_groups') 
file_ids = ids(q) 
z = results_all(q) 
read_length_list = sapply(z$analysis$metadata$read_groups,'[[','read_length')

ADD REPLY • link 7.7 years ago Sean Davis 21k

1

Entering edit mode

Thank you again. This is fine But after the last step, I see a "list of 2175" with only read_length. Is there a way to make a dataframe for "z$analysis$metadata$read_groups" where I can see the sample name, Is paired end, and read_length information also.

ADD REPLY • link 7.7 years ago Biologist ▴ 120

2

Entering edit mode

Sure.

library(dplyr) 
rg_info = bind_rows(z$analysis$metadata$read_groups)

You may need to play with things a bit to get exactly what you want. The str() function can be quite useful, as are things like class() to see how things are stored in these complex structures.

ADD REPLY • link 7.7 years ago Sean Davis 21k

2

Entering edit mode

Wow! That's amazing! I ended with

> z$analysis$metadata$read_groups %>% bind_rows() %>% as_tibble()
# A tibble: 4,408 x 18
   library_name           is_paired_end library_strategy updated_datetime      
   <chr>                  <lgl>         <chr>            <chr>                 
 1 H_LS-AO-A1KR-10A-01D-… TRUE          WXS              2017-03-05T20:49:00.8…
 2 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 3 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 4 H_LS-AN-A0FV-10A-01W-… TRUE          WXS              2017-03-05T20:49:00.8…
 5 H_LS-BH-A0BF-11A-31D-… TRUE          WXS              2017-03-05T21:50:13.6…
 6 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-05T19:45:05.4…
 7 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-05T19:45:05.4…
 8 H_LS-BH-A0GY-01A-11W-… TRUE          WXS              2017-03-04T16:41:36.4…
 9 H_LS-D8-A1JL-01A-11D-… TRUE          WXS              2017-03-05T19:45:05.4…
10 H_LS-D8-A1JL-01A-11D-… TRUE          WXS              2017-03-05T19:45:05.4…
# ... with 4,398 more rows, and 14 more variables: created_datetime <chr>,
#   read_length <int>, target_capture_kit_target_region <chr>,
#   read_group_id <chr>, state <chr>, submitter_id <chr>, platform <chr>,
#   experiment_name <chr>, target_capture_kit_name <chr>,
#   target_capture_kit_catalog_number <chr>, read_group_name <chr>,
#   target_capture_kit_vendor <chr>, sequencing_date <chr>,
#   sequencing_center <chr>