Getting number of cases added by date to TCGA project with `GenomicDataCommons`
1
0
Entering edit mode
Ezgi ▴ 60
@ezgi-24130
Last seen 2.1 years ago
United States

I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases?

Here's my attempt:

First I can check the number of cases by experiment with:

library(tidyverse)
library(GenomicDataCommons)

cases() %>% 
  GenomicDataCommons::filter(~ project.program.name=='TCGA') %>% 
  facet("files.experimental_strategy") %>% 
  aggregations() %>%
  .[[1]]

Giving me:

   doc_count               key
1      10533               WXS
2      11126  Genotyping Array
3      10239           RNA-Seq
4      10250         miRNA-Seq
5      10903      Tissue Slide
6      10943 Methylation Array
7       9641  Diagnostic Slide
8        404          ATAC-Seq
9         99               WGS
10        16          _missing

Now let's try get to files created by date just for the RNA-seq as an example:

cases() %>% 
  GenomicDataCommons::filter(~ project.program.name=='TCGA' & 
                               files.experimental_strategy=='RNA-Seq') %>% 
  facet(c("files.created_datetime")) %>% 
  aggregations() %>% 
  .[[1]] %>% as_tibble()

Giving me:

   doc_count key                             
       <int> <chr>                           
 1      1092 2018-05-21t16:07:40.645885-05:00
 2       555 2018-05-22t05:20:16.511251-05:00
 3       515 2018-05-21t23:28:30.907184-05:00
 4       501 2018-05-21t20:04:16.087009-05:00
 5     10156 2016-10-27t21:58:12.297090-05:00
 6       502 2018-05-22t04:32:59.083508-05:00
 7       530 2018-05-21t21:03:27.523651-05:00
 8       456 2018-05-21t18:19:54.595818-05:00
 9       512 2018-05-21t22:14:34.066432-05:00
10       501 2018-05-22t00:17:01.373309-05:00

If I was to compute cumulative sum of these doc_counts by date, the final number is way greater than the number I get when I simply look at available cases by experiment. For example already in the header above, you can see that even just the row 5 has a value of 10156, which is already very close to 10239 which makes me think that these values are the number of file and not cases. How can I get the number of cases instead?

Thanks

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.7.10          forcats_0.5.1             stringr_1.4.0            
 [4] dplyr_1.0.5               purrr_0.3.4               readr_1.4.0              
 [7] tidyr_1.1.3               tibble_3.1.0              ggplot2_3.3.3            
[10] tidyverse_1.3.0           GenomicDataCommons_1.14.0 magrittr_2.0.1           
[13] pool_0.1.6                DBI_1.1.1                

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41             digest_0.6.27              
 [4] assertthat_0.2.1            utf8_1.2.1                  R6_2.5.0                   
 [7] GenomeInfoDb_1.26.4         cellranger_1.1.0            backports_1.2.1            
[10] reprex_1.0.0                stats4_4.0.4                httr_1.4.2                 
[13] pillar_1.5.1                zlibbioc_1.36.0             rlang_0.4.10               
[16] curl_4.3                    readxl_1.3.1                rstudioapi_0.13            
[19] S4Vectors_0.28.1            Matrix_1.3-2                labeling_0.4.2             
[22] RMySQL_0.10.21              RCurl_1.98-1.3              munsell_0.5.0              
[25] DelayedArray_0.16.2         broom_0.7.5                 compiler_4.0.4             
[28] modelr_0.1.8                pkgconfig_2.0.3             BiocGenerics_0.36.0        
[31] tidyselect_1.1.0            SummarizedExperiment_1.20.0 GenomeInfoDbData_1.2.4     
[34] IRanges_2.24.1              matrixStats_0.58.0          viridisLite_0.3.0          
[37] fansi_0.4.2                 withr_2.4.1                 crayon_1.4.1               
[40] dbplyr_2.1.0                later_1.1.0.1               bitops_1.0-6               
[43] rappdirs_0.3.3              grid_4.0.4                  jsonlite_1.7.2             
[46] gtable_0.3.0                lifecycle_1.0.0             scales_1.1.1               
[49] cli_2.3.1                   stringi_1.5.3               farver_2.1.0               
[52] XVector_0.30.0              renv_0.13.1                 fs_1.5.0                   
[55] xml2_1.3.2                  ellipsis_0.3.1              generics_0.1.0             
[58] vctrs_0.3.6                 tools_4.0.4                 Biobase_2.50.0             
[61] glue_1.4.2                  hms_1.0.0                   MatrixGenerics_1.2.1       
[64] parallel_4.0.4              colorspace_2.0-0            BiocManager_1.30.10        
[67] GenomicRanges_1.42.0        rvest_1.0.0                 haven_2.3.1
TCGA GenomicDataCommons • 800 views
ADD COMMENT
0
Entering edit mode
Ezgi ▴ 60
@ezgi-24130
Last seen 2.1 years ago
United States

Actually changing the facet step to facet("samples.created_datetime") gives the case counts.
But the dates seem to be quite off, there's only a huge spike around 2018-05-17 and almost nothing else.
I'm also thinking this must be the date added to NCI GDC and not the actual project? TCGA data has been around since 2009-2010?
I would still appreciate if anyone could point me to the right resource for this.

ADD COMMENT

Login before adding your answer.

Traffic: 694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6