2 questions: why there is nothing for file_ids? How could I generate a manifest file with filtering of Race and Ethnicity?
1
1
Entering edit mode
@xiaofeiwang18266-13498
Last seen 8 months ago
Singapore

Dear there,

I am learning GenomicDataCommons to generate a manifest file. I got 2 questions and would really appreciate for any help.

First, here are the code as below I am trying to be familiar with GenomicDataCommons following the manual. Why there is nothing for my "file_ids"? Is it right for the output of manifest(q), only 10 rows?

Second, if I'd like to create a manifest file with filtering of race and ethnicity, how could I reflect that in coding of filter?

I tried use the codes as below, but got error.


q = files() %>%
+      filter(~ cases.project.project_id == 'TCGA-COAD' &
+                 data_type == 'Aligned Reads' &
+                 experimental_strategy == 'RNA-Seq' &
+                 data_format == 'BAM' &
+                 demographic.race == 'white')
manifest(q)
Error: No method asJSON S3 class: formula

Thanks a lot!

q = files() %>%
+     GenomicDataCommons::select(available_fields('files')) %>%
+     filter(~ cases.project.project_id == 'TCGA-GBM' &
+                data_type == 'Aligned Reads' &
+                experimental_strategy == 'RNA-Seq' &
+                data_format == 'BAM')
manifest(q)
Rows: 10 Columns: 5                                                                                                                                                          
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): id, filename, md5, state
dbl (1): size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 5
   id                                   filename                                                  md5                          size state 
 * <chr>                                <chr>                                                     <chr>                       <dbl> <chr> 
 1 dc5dd9bb-8571-48b7-8535-26eeec64faa9 fddd420e-1b94-4b88-993e-355291d8d74a_gdc_realn_rehead.bam 3218d630f398aa953e086…    1.20e10 relea…
 2 7475c0d7-c776-48ae-aba6-b4c38692bf75 fe622e5b-335d-4775-801f-45fd21a18574_gdc_realn_rehead.bam 2d11cfd5873ba9c84de86…    9.82e 9 relea…
 3 31c1ec3f-8fb6-41b8-9c89-59ca3ab2d36c 454c9f5b-ab50-452b-931c-0779c1051ea6_gdc_realn_rehead.bam 2773a569071c454a7a9d4…    1.04e10 relea…
 4 23768895-9695-4f95-a075-d7a52ac3545f 1116895b-c2c7-4e8f-b648-4a86e984696c_gdc_realn_rehead.bam 601d084b69f034dbcfb4e…    1.37e10 relea…
 5 bb39b827-c4f4-4d13-88cb-01f258d06843 cf83d3ed-12b5-4ff0-9f18-af066c54b6f4_gdc_realn_rehead.bam d6be345dd8ea7156f9ba6…    8.95e 9 relea…
 6 7e26b92e-629a-45c1-9c00-f8f0da99c317 959f6dd3-f4c8-4134-8a27-f535cfae096f_gdc_realn_rehead.bam 995c1e0b98a89c34dd197…    1.11e10 relea…
 7 c7c72460-42c0-45dc-900f-7c7e36a5344c 2d43a193-e00f-4502-88ee-1d13f9acb04b_gdc_realn_rehead.bam 5d47613879ece69138e28…    7.84e 9 relea…
 8 a50ad948-b240-491e-b5c1-bfe82cbb881a d9218a68-50c9-4557-a53e-ff1835f1e68f_gdc_realn_rehead.bam 20e2f573a9f25ff00d64c…    8.34e 9 relea…
 9 eb9a1990-3d15-4d78-a118-2ba30df48ac2 67488c82-715d-4f6d-bbe8-43169ba1f4e6_gdc_realn_rehead.bam 056ae2109ad28067d710e…    9.62e 9 relea…
10 aab06154-3863-4249-8904-a332f5306893 8538437d-ae45-4c6d-a861-9f899448ca0f_gdc_realn_rehead.bam 13a398431bdd4a28f6bd7…    1.17e10 relea…
file_ids = q %>% response_all() %>% ids()
file_ids
character(0)

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicDataCommons_1.14.0 magrittr_2.0.1           

loaded via a namespace (and not attached):
 [1] compiler_4.0.3              pillar_1.6.2                GenomeInfoDb_1.26.7         XVector_0.30.0             
 [5] MatrixGenerics_1.2.1        bitops_1.0-7                tools_4.0.3                 zlibbioc_1.36.0            
 [9] bit_4.0.4                   jsonlite_1.7.2              lifecycle_1.0.0             tibble_3.1.3               
[13] lattice_0.20-44             pkgconfig_2.0.3             rlang_0.4.11                Matrix_1.3-4               
[17] cli_3.0.1                   DelayedArray_0.16.3         DBI_1.1.1                   curl_4.3.2                 
[21] parallel_4.0.3              GenomeInfoDbData_1.2.4      withr_2.4.2                 xml2_1.3.2                 
[25] dplyr_1.0.7                 httr_1.4.2                  hms_1.1.0                   generics_0.1.0             
[29] S4Vectors_0.28.1            vctrs_0.3.8                 IRanges_2.24.1              rappdirs_0.3.3             
[33] bit64_4.0.5                 stats4_4.0.3                grid_4.0.3                  tidyselect_1.1.1           
[37] glue_1.4.2                  Biobase_2.50.0              R6_2.5.0                    fansi_0.5.0                
[41] vroom_1.5.4                 tzdb_0.1.2                  readr_2.0.0                 purrr_0.3.4                
[45] matrixStats_0.60.0          ellipsis_0.3.2              BiocGenerics_0.36.1         GenomicRanges_1.42.0       
[49] assertthat_0.2.1            SummarizedExperiment_1.20.0 utf8_1.2.2                  RCurl_1.98-1.3             
[53] crayon_1.4.1
GenomicDataCommons • 1.2k views
ADD COMMENT
1
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
library(GenomicDataCommons,quietly = TRUE)
## 
## Attaching package: 'GenomicDataCommons'
## The following object is masked from 'package:stats':
## 
##     filter

I made a change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the ~ in the filter expression. So for recent versions of GenomicDataCommons, proceed like this:

q = files() %>%
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')

And get a count of the results:

count(q)
## [1] 521

And the manifest.

manifest(q)
## # A tibble: 521 x 5
##    id                filename                    md5                 size state 
##  * <chr>             <chr>                       <chr>              <dbl> <chr> 
##  1 58bc1f92-80bf-4b… f846cd96-7f2b-475d-9303-9b… 74cb17bc01108c…  5.03e 9 relea…
##  2 3f26333a-9c85-4f… d2759841-bf5f-4fc8-a714-c2… 6a4c257f66cf57…  4.17e 9 relea…
##  3 4de88051-0d80-41… 678b82f0-66bc-44b0-8cec-de… 789f1499f61a2e…  7.25e 9 relea…
##  4 d2a3396b-b2f3-47… 1119cfc0-3447-4d66-abce-91… 2cf397d48e2f01…  4.93e 9 relea…
##  5 2341bc87-50c7-4f… cccefb79-0e9e-43be-a760-7e… 38e155a1badd28…  9.01e 8 relea…
##  6 488b0103-138b-40… 8d225c49-bc80-4e37-8388-5a… 069aaf00339735…  9.91e 9 relea…
##  7 f61b9cd4-288e-4a… 8127208d-6aac-46de-a96a-ad… 4bd8c02061f1a8…  1.01e10 relea…
##  8 a36dda91-00aa-47… 79da1cd2-e9c9-49f0-9d8a-1b… 08fed660c806d0…  1.95e 9 relea…
##  9 ad1cc824-7bbc-48… 453479a2-8b52-4ac7-bc91-44… c50a823483dde9…  2.38e 9 relea…
## 10 739f4d3b-d73e-4c… c4ab7acf-ad5c-4071-a0bc-e6… 2ea057dbf48617…  1.78e 9 relea…
## # … with 511 more rows

Your question about race and ethnicity is a good one.

all_fields = available_fields(files())

And we can grep for race or ethnic to get potential matching fields to look at.

grep('race|ethnic',all_fields,value=TRUE)
## [1] "cases.demographic.ethnicity"                 
## [2] "cases.demographic.race"                      
## [3] "cases.follow_ups.hormonal_contraceptive_type"
## [4] "cases.follow_ups.hormonal_contraceptive_use" 
## [5] "cases.follow_ups.scan_tracer_used"

Now, we can check available values for each field to determine how to complete our filter expressions.

available_values('files',"cases.demographic.ethnicity")
## [1] "not hispanic or latino" "not reported"           "hispanic or latino"    
## [4] "unknown"                "_missing"
available_values('files',"cases.demographic.race")
##  [1] "white"                                    
##  [2] "not reported"                             
##  [3] "black or african american"                
##  [4] "asian"                                    
##  [5] "unknown"                                  
##  [6] "other"                                    
##  [7] "american indian or alaska native"         
##  [8] "native hawaiian or other pacific islander"
##  [9] "not allowed to collect"                   
## [10] "_missing"

We can complete our filter expression now to limit to white race only.

q_white_only = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)
## [1] 249
manifest(q_white_only)
## # A tibble: 249 x 5
##    id                filename                    md5                 size state 
##  * <chr>             <chr>                       <chr>              <dbl> <chr> 
##  1 58bc1f92-80bf-4b… f846cd96-7f2b-475d-9303-9b… 74cb17bc01108c…  5.03e 9 relea…
##  2 3f26333a-9c85-4f… d2759841-bf5f-4fc8-a714-c2… 6a4c257f66cf57…  4.17e 9 relea…
##  3 4de88051-0d80-41… 678b82f0-66bc-44b0-8cec-de… 789f1499f61a2e…  7.25e 9 relea…
##  4 d2a3396b-b2f3-47… 1119cfc0-3447-4d66-abce-91… 2cf397d48e2f01…  4.93e 9 relea…
##  5 488b0103-138b-40… 8d225c49-bc80-4e37-8388-5a… 069aaf00339735…  9.91e 9 relea…
##  6 f61b9cd4-288e-4a… 8127208d-6aac-46de-a96a-ad… 4bd8c02061f1a8…  1.01e10 relea…
##  7 ad1cc824-7bbc-48… 453479a2-8b52-4ac7-bc91-44… c50a823483dde9…  2.38e 9 relea…
##  8 69ea4b44-0ae7-42… d68c3434-a04c-4096-9d2c-9b… e5a9c176c10e8c…  6.25e 9 relea…
##  9 b12289f8-8259-42… 81023237-0f6d-43e1-9868-72… a389bb4770a7c0…  8.92e 9 relea…
## 10 fda49a14-df20-4f… 2fa46723-da52-4bab-bd0a-ad… ae0ffffcd54434…  7.19e 9 relea…
## # … with 239 more rows
ADD COMMENT
1
Entering edit mode

Hi Sean Davis ,

Thank you so much, it is very clear to understand for each step. But, I still have a question about the numbers of rows after filtering. Why it is different with the webpage?

  1. There are 461 cases for COAD project, why there are 521 rows for the manifest file?

  2. By the code below, the count is 244 for white but not hispanic or latino. But, from the website, why it is 209 as the screenshot

below

or the link here https://portal.gdc.cancer.gov/exploration?facetTab=clinical&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.demographic.ethnicity%22%2C%22value%22%3A%5B%22not%20hispanic%20or%20latino%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.demographic.race%22%2C%22value%22%3A%5B%22white%22%5D%7D%7D%2C%7B%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-COAD%22%5D%7D%2C%22op%22%3A%22in%22%7D%5D%7D .

q_white_notHis_notLa = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white' &
  cases.demographic.ethnicity == 'not hispanic or latino'
  )
count(q_white_notHis_notLa)
# [1] 244

For the african american, it is 66 but 59 from https://portal.gdc.cancer.gov/exploration?facetTab=clinical&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.demographic.race%22%2C%22value%22%3A%5B%22black%20or%20african%20american%22%5D%7D%7D%2C%7B%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-COAD%22%5D%7D%2C%22op%22%3A%22in%22%7D%5D%7D .

```q_afr = files() %>% GenomicDataCommons::filter( cases.project.project_id == 'TCGA-COAD' & data_type == 'Aligned Reads' & experimental_strategy == 'RNA-Seq' & data_format == 'BAM' & cases.demographic.race=='black or african american' )

count(q_afr)

1 66

```

enter image description here

ADD REPLY
1
Entering edit mode

If you want a count of unique cases, you can either post-process in R or start with the cases() endpoint. Note that the reason for the difference is that cases can have 0 or more files, so if a case has more than one file associated, the number of files (after filtering by case) will be greater than the number of cases.

cases() %>% 
  GenomicDataCommons::filter(
    project.project_id == 'TCGA-COAD' & 
    demographic.race=='white' & 
    demographic.ethnicity=='not hispanic or latino') %>% 
  count()
## 209
ADD REPLY
0
Entering edit mode

Thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6