1 66

Question

2 questions: why there is nothing for file_ids? How could I generate a manifest file with filtering of Race and Ethnicity?

1

Entering edit mode

xiaofeiwang18266 ▴ 60

@xiaofeiwang18266-13498

Last seen 12 months ago

Singapore

Dear there,

I am learning GenomicDataCommons to generate a manifest file. I got 2 questions and would really appreciate for any help.

First, here are the code as below I am trying to be familiar with GenomicDataCommons following the manual. Why there is nothing for my "file_ids"? Is it right for the output of manifest(q), only 10 rows?

Second, if I'd like to create a manifest file with filtering of race and ethnicity, how could I reflect that in coding of filter?

I tried use the codes as below, but got error.


q = files() %>%
+      filter(~ cases.project.project_id == 'TCGA-COAD' &
+                 data_type == 'Aligned Reads' &
+                 experimental_strategy == 'RNA-Seq' &
+                 data_format == 'BAM' &
+                 demographic.race == 'white')
manifest(q)
Error: No method asJSON S3 class: formula

Thanks a lot!

q = files() %>%
+     GenomicDataCommons::select(available_fields('files')) %>%
+     filter(~ cases.project.project_id == 'TCGA-GBM' &
+                data_type == 'Aligned Reads' &
+                experimental_strategy == 'RNA-Seq' &
+                data_format == 'BAM')
manifest(q)
Rows: 10 Columns: 5                                                                                                                                                          
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): id, filename, md5, state
dbl (1): size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 5
   id                                   filename                                                  md5                          size state 
 * <chr>                                <chr>                                                     <chr>                       <dbl> <chr> 
 1 dc5dd9bb-8571-48b7-8535-26eeec64faa9 fddd420e-1b94-4b88-993e-355291d8d74a_gdc_realn_rehead.bam 3218d630f398aa953e086…    1.20e10 relea…
 2 7475c0d7-c776-48ae-aba6-b4c38692bf75 fe622e5b-335d-4775-801f-45fd21a18574_gdc_realn_rehead.bam 2d11cfd5873ba9c84de86…    9.82e 9 relea…
 3 31c1ec3f-8fb6-41b8-9c89-59ca3ab2d36c 454c9f5b-ab50-452b-931c-0779c1051ea6_gdc_realn_rehead.bam 2773a569071c454a7a9d4…    1.04e10 relea…
 4 23768895-9695-4f95-a075-d7a52ac3545f 1116895b-c2c7-4e8f-b648-4a86e984696c_gdc_realn_rehead.bam 601d084b69f034dbcfb4e…    1.37e10 relea…
 5 bb39b827-c4f4-4d13-88cb-01f258d06843 cf83d3ed-12b5-4ff0-9f18-af066c54b6f4_gdc_realn_rehead.bam d6be345dd8ea7156f9ba6…    8.95e 9 relea…
 6 7e26b92e-629a-45c1-9c00-f8f0da99c317 959f6dd3-f4c8-4134-8a27-f535cfae096f_gdc_realn_rehead.bam 995c1e0b98a89c34dd197…    1.11e10 relea…
 7 c7c72460-42c0-45dc-900f-7c7e36a5344c 2d43a193-e00f-4502-88ee-1d13f9acb04b_gdc_realn_rehead.bam 5d47613879ece69138e28…    7.84e 9 relea…
 8 a50ad948-b240-491e-b5c1-bfe82cbb881a d9218a68-50c9-4557-a53e-ff1835f1e68f_gdc_realn_rehead.bam 20e2f573a9f25ff00d64c…    8.34e 9 relea…
 9 eb9a1990-3d15-4d78-a118-2ba30df48ac2 67488c82-715d-4f6d-bbe8-43169ba1f4e6_gdc_realn_rehead.bam 056ae2109ad28067d710e…    9.62e 9 relea…
10 aab06154-3863-4249-8904-a332f5306893 8538437d-ae45-4c6d-a861-9f899448ca0f_gdc_realn_rehead.bam 13a398431bdd4a28f6bd7…    1.17e10 relea…
file_ids = q %>% response_all() %>% ids()
file_ids
character(0)

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicDataCommons_1.14.0 magrittr_2.0.1           

loaded via a namespace (and not attached):
 [1] compiler_4.0.3              pillar_1.6.2                GenomeInfoDb_1.26.7         XVector_0.30.0             
 [5] MatrixGenerics_1.2.1        bitops_1.0-7                tools_4.0.3                 zlibbioc_1.36.0            
 [9] bit_4.0.4                   jsonlite_1.7.2              lifecycle_1.0.0             tibble_3.1.3               
[13] lattice_0.20-44             pkgconfig_2.0.3             rlang_0.4.11                Matrix_1.3-4               
[17] cli_3.0.1                   DelayedArray_0.16.3         DBI_1.1.1                   curl_4.3.2                 
[21] parallel_4.0.3              GenomeInfoDbData_1.2.4      withr_2.4.2                 xml2_1.3.2                 
[25] dplyr_1.0.7                 httr_1.4.2                  hms_1.1.0                   generics_0.1.0             
[29] S4Vectors_0.28.1            vctrs_0.3.8                 IRanges_2.24.1              rappdirs_0.3.3             
[33] bit64_4.0.5                 stats4_4.0.3                grid_4.0.3                  tidyselect_1.1.1           
[37] glue_1.4.2                  Biobase_2.50.0              R6_2.5.0                    fansi_0.5.0                
[41] vroom_1.5.4                 tzdb_0.1.2                  readr_2.0.0                 purrr_0.3.4                
[45] matrixStats_0.60.0          ellipsis_0.3.2              BiocGenerics_0.36.1         GenomicRanges_1.42.0       
[49] assertthat_0.2.1            SummarizedExperiment_1.20.0 utf8_1.2.2                  RCurl_1.98-1.3             
[53] crayon_1.4.1

GenomicDataCommons • 2.3k views

ADD COMMENT • link 4.4 years ago xiaofeiwang18266 ▴ 60

score 1 · Answer 1 · 2021-08-09

library(GenomicDataCommons,quietly = TRUE)

## 
## Attaching package: 'GenomicDataCommons'

## The following object is masked from 'package:stats':
## 
##     filter

I made a change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the ~ in the filter expression. So for recent versions of GenomicDataCommons, proceed like this:

q = files() %>%
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')

And get a count of the results:

count(q)

## [1] 521

And the manifest.

manifest(q)

## # A tibble: 521 x 5
##    id                filename                    md5                 size state 
##  * <chr>             <chr>                       <chr>              <dbl> <chr> 
##  1 58bc1f92-80bf-4b… f846cd96-7f2b-475d-9303-9b… 74cb17bc01108c…  5.03e 9 relea…
##  2 3f26333a-9c85-4f… d2759841-bf5f-4fc8-a714-c2… 6a4c257f66cf57…  4.17e 9 relea…
##  3 4de88051-0d80-41… 678b82f0-66bc-44b0-8cec-de… 789f1499f61a2e…  7.25e 9 relea…
##  4 d2a3396b-b2f3-47… 1119cfc0-3447-4d66-abce-91… 2cf397d48e2f01…  4.93e 9 relea…
##  5 2341bc87-50c7-4f… cccefb79-0e9e-43be-a760-7e… 38e155a1badd28…  9.01e 8 relea…
##  6 488b0103-138b-40… 8d225c49-bc80-4e37-8388-5a… 069aaf00339735…  9.91e 9 relea…
##  7 f61b9cd4-288e-4a… 8127208d-6aac-46de-a96a-ad… 4bd8c02061f1a8…  1.01e10 relea…
##  8 a36dda91-00aa-47… 79da1cd2-e9c9-49f0-9d8a-1b… 08fed660c806d0…  1.95e 9 relea…
##  9 ad1cc824-7bbc-48… 453479a2-8b52-4ac7-bc91-44… c50a823483dde9…  2.38e 9 relea…
## 10 739f4d3b-d73e-4c… c4ab7acf-ad5c-4071-a0bc-e6… 2ea057dbf48617…  1.78e 9 relea…
## # … with 511 more rows

Your question about race and ethnicity is a good one.

all_fields = available_fields(files())

And we can grep for race or ethnic to get potential matching fields to look at.

grep('race|ethnic',all_fields,value=TRUE)

## [1] "cases.demographic.ethnicity"                 
## [2] "cases.demographic.race"                      
## [3] "cases.follow_ups.hormonal_contraceptive_type"
## [4] "cases.follow_ups.hormonal_contraceptive_use" 
## [5] "cases.follow_ups.scan_tracer_used"

Now, we can check available values for each field to determine how to complete our filter expressions.

available_values('files',"cases.demographic.ethnicity")

## [1] "not hispanic or latino" "not reported"           "hispanic or latino"    
## [4] "unknown"                "_missing"

available_values('files',"cases.demographic.race")

##  [1] "white"                                    
##  [2] "not reported"                             
##  [3] "black or african american"                
##  [4] "asian"                                    
##  [5] "unknown"                                  
##  [6] "other"                                    
##  [7] "american indian or alaska native"         
##  [8] "native hawaiian or other pacific islander"
##  [9] "not allowed to collect"                   
## [10] "_missing"

We can complete our filter expression now to limit to white race only.

q_white_only = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)

## [1] 249

manifest(q_white_only)

## # A tibble: 249 x 5
##    id                filename                    md5                 size state 
##  * <chr>             <chr>                       <chr>              <dbl> <chr> 
##  1 58bc1f92-80bf-4b… f846cd96-7f2b-475d-9303-9b… 74cb17bc01108c…  5.03e 9 relea…
##  2 3f26333a-9c85-4f… d2759841-bf5f-4fc8-a714-c2… 6a4c257f66cf57…  4.17e 9 relea…
##  3 4de88051-0d80-41… 678b82f0-66bc-44b0-8cec-de… 789f1499f61a2e…  7.25e 9 relea…
##  4 d2a3396b-b2f3-47… 1119cfc0-3447-4d66-abce-91… 2cf397d48e2f01…  4.93e 9 relea…
##  5 488b0103-138b-40… 8d225c49-bc80-4e37-8388-5a… 069aaf00339735…  9.91e 9 relea…
##  6 f61b9cd4-288e-4a… 8127208d-6aac-46de-a96a-ad… 4bd8c02061f1a8…  1.01e10 relea…
##  7 ad1cc824-7bbc-48… 453479a2-8b52-4ac7-bc91-44… c50a823483dde9…  2.38e 9 relea…
##  8 69ea4b44-0ae7-42… d68c3434-a04c-4096-9d2c-9b… e5a9c176c10e8c…  6.25e 9 relea…
##  9 b12289f8-8259-42… 81023237-0f6d-43e1-9868-72… a389bb4770a7c0…  8.92e 9 relea…
## 10 fda49a14-df20-4f… 2fa46723-da52-4bab-bd0a-ad… ae0ffffcd54434…  7.19e 9 relea…
## # … with 239 more rows