Dear there,
I am learning GenomicDataCommons to generate a manifest file. I got 2 questions and would really appreciate for any help.
First, here are the code as below I am trying to be familiar with GenomicDataCommons following the manual. Why there is nothing for my "file_ids"? Is it right for the output of manifest(q), only 10 rows?
Second, if I'd like to create a manifest file with filtering of race and ethnicity, how could I reflect that in coding of filter?
I tried use the codes as below, but got error.
q = files() %>%
+ filter(~ cases.project.project_id == 'TCGA-COAD' &
+ data_type == 'Aligned Reads' &
+ experimental_strategy == 'RNA-Seq' &
+ data_format == 'BAM' &
+ demographic.race == 'white')
manifest(q)
Error: No method asJSON S3 class: formula
Thanks a lot!
q = files() %>%
+ GenomicDataCommons::select(available_fields('files')) %>%
+ filter(~ cases.project.project_id == 'TCGA-GBM' &
+ data_type == 'Aligned Reads' &
+ experimental_strategy == 'RNA-Seq' &
+ data_format == 'BAM')
manifest(q)
Rows: 10 Columns: 5
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): id, filename, md5, state
dbl (1): size
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10 × 5
id filename md5 size state
* <chr> <chr> <chr> <dbl> <chr>
1 dc5dd9bb-8571-48b7-8535-26eeec64faa9 fddd420e-1b94-4b88-993e-355291d8d74a_gdc_realn_rehead.bam 3218d630f398aa953e086… 1.20e10 relea…
2 7475c0d7-c776-48ae-aba6-b4c38692bf75 fe622e5b-335d-4775-801f-45fd21a18574_gdc_realn_rehead.bam 2d11cfd5873ba9c84de86… 9.82e 9 relea…
3 31c1ec3f-8fb6-41b8-9c89-59ca3ab2d36c 454c9f5b-ab50-452b-931c-0779c1051ea6_gdc_realn_rehead.bam 2773a569071c454a7a9d4… 1.04e10 relea…
4 23768895-9695-4f95-a075-d7a52ac3545f 1116895b-c2c7-4e8f-b648-4a86e984696c_gdc_realn_rehead.bam 601d084b69f034dbcfb4e… 1.37e10 relea…
5 bb39b827-c4f4-4d13-88cb-01f258d06843 cf83d3ed-12b5-4ff0-9f18-af066c54b6f4_gdc_realn_rehead.bam d6be345dd8ea7156f9ba6… 8.95e 9 relea…
6 7e26b92e-629a-45c1-9c00-f8f0da99c317 959f6dd3-f4c8-4134-8a27-f535cfae096f_gdc_realn_rehead.bam 995c1e0b98a89c34dd197… 1.11e10 relea…
7 c7c72460-42c0-45dc-900f-7c7e36a5344c 2d43a193-e00f-4502-88ee-1d13f9acb04b_gdc_realn_rehead.bam 5d47613879ece69138e28… 7.84e 9 relea…
8 a50ad948-b240-491e-b5c1-bfe82cbb881a d9218a68-50c9-4557-a53e-ff1835f1e68f_gdc_realn_rehead.bam 20e2f573a9f25ff00d64c… 8.34e 9 relea…
9 eb9a1990-3d15-4d78-a118-2ba30df48ac2 67488c82-715d-4f6d-bbe8-43169ba1f4e6_gdc_realn_rehead.bam 056ae2109ad28067d710e… 9.62e 9 relea…
10 aab06154-3863-4249-8904-a332f5306893 8538437d-ae45-4c6d-a861-9f899448ca0f_gdc_realn_rehead.bam 13a398431bdd4a28f6bd7… 1.17e10 relea…
file_ids = q %>% response_all() %>% ids()
file_ids
character(0)
# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session
sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicDataCommons_1.14.0 magrittr_2.0.1
loaded via a namespace (and not attached):
[1] compiler_4.0.3 pillar_1.6.2 GenomeInfoDb_1.26.7 XVector_0.30.0
[5] MatrixGenerics_1.2.1 bitops_1.0-7 tools_4.0.3 zlibbioc_1.36.0
[9] bit_4.0.4 jsonlite_1.7.2 lifecycle_1.0.0 tibble_3.1.3
[13] lattice_0.20-44 pkgconfig_2.0.3 rlang_0.4.11 Matrix_1.3-4
[17] cli_3.0.1 DelayedArray_0.16.3 DBI_1.1.1 curl_4.3.2
[21] parallel_4.0.3 GenomeInfoDbData_1.2.4 withr_2.4.2 xml2_1.3.2
[25] dplyr_1.0.7 httr_1.4.2 hms_1.1.0 generics_0.1.0
[29] S4Vectors_0.28.1 vctrs_0.3.8 IRanges_2.24.1 rappdirs_0.3.3
[33] bit64_4.0.5 stats4_4.0.3 grid_4.0.3 tidyselect_1.1.1
[37] glue_1.4.2 Biobase_2.50.0 R6_2.5.0 fansi_0.5.0
[41] vroom_1.5.4 tzdb_0.1.2 readr_2.0.0 purrr_0.3.4
[45] matrixStats_0.60.0 ellipsis_0.3.2 BiocGenerics_0.36.1 GenomicRanges_1.42.0
[49] assertthat_0.2.1 SummarizedExperiment_1.20.0 utf8_1.2.2 RCurl_1.98-1.3
[53] crayon_1.4.1
Hi Sean Davis ,
Thank you so much, it is very clear to understand for each step. But, I still have a question about the numbers of rows after filtering. Why it is different with the webpage?
There are 461 cases for COAD project, why there are 521 rows for the manifest file?
By the code below, the count is 244 for white but not hispanic or latino. But, from the website, why it is 209 as the screenshot
below
or the link here https://portal.gdc.cancer.gov/exploration?facetTab=clinical&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.demographic.ethnicity%22%2C%22value%22%3A%5B%22not%20hispanic%20or%20latino%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.demographic.race%22%2C%22value%22%3A%5B%22white%22%5D%7D%7D%2C%7B%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-COAD%22%5D%7D%2C%22op%22%3A%22in%22%7D%5D%7D .
For the african american, it is 66 but 59 from https://portal.gdc.cancer.gov/exploration?facetTab=clinical&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.demographic.race%22%2C%22value%22%3A%5B%22black%20or%20african%20american%22%5D%7D%7D%2C%7B%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-COAD%22%5D%7D%2C%22op%22%3A%22in%22%7D%5D%7D .
```q_afr = files() %>% GenomicDataCommons::filter( cases.project.project_id == 'TCGA-COAD' & data_type == 'Aligned Reads' & experimental_strategy == 'RNA-Seq' & data_format == 'BAM' & cases.demographic.race=='black or african american' )
count(q_afr)
1 66
```
If you want a count of unique cases, you can either post-process in R or start with the cases() endpoint. Note that the reason for the difference is that cases can have 0 or more files, so if a case has more than one file associated, the number of files (after filtering by case) will be greater than the number of cases.
Thanks a lot!