Getting number of cases added by date to TCGA project with `GenomicDataCommons`
I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases?

Here's my attempt:

First I can check the number of cases by experiment with:


cases() %>% 
  GenomicDataCommons::filter(~'TCGA') %>% 
  facet("files.experimental_strategy") %>% 
  aggregations() %>%

Giving me:

   doc_count               key
1      10533               WXS
2      11126  Genotyping Array
3      10239           RNA-Seq
4      10250         miRNA-Seq
5      10903      Tissue Slide
6      10943 Methylation Array
7       9641  Diagnostic Slide
8        404          ATAC-Seq
9         99               WGS
10        16          _missing

Now let's try get to files created by date just for the RNA-seq as an example:

cases() %>% 
  GenomicDataCommons::filter(~'TCGA' & 
                               files.experimental_strategy=='RNA-Seq') %>% 
  facet(c("files.created_datetime")) %>% 
  aggregations() %>% 
  .[[1]] %>% as_tibble()

Giving me:

   doc_count key                             
       <int> <chr>                           
 1      1092 2018-05-21t16:07:40.645885-05:00
 2       555 2018-05-22t05:20:16.511251-05:00
 3       515 2018-05-21t23:28:30.907184-05:00
 4       501 2018-05-21t20:04:16.087009-05:00
 5     10156 2016-10-27t21:58:12.297090-05:00
 6       502 2018-05-22t04:32:59.083508-05:00
 7       530 2018-05-21t21:03:27.523651-05:00
 8       456 2018-05-21t18:19:54.595818-05:00
 9       512 2018-05-21t22:14:34.066432-05:00
10       501 2018-05-22t00:17:01.373309-05:00

If I was to compute cumulative sum of these doc_counts by date, the final number is way greater than the number I get when I simply look at available cases by experiment. For example already in the header above, you can see that even just the row 5 has a value of 10156, which is already very close to 10239 which makes me think that these values are the number of file and not cases. How can I get the number of cases instead?


> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.7.10          forcats_0.5.1             stringr_1.4.0            
 [4] dplyr_1.0.5               purrr_0.3.4               readr_1.4.0              
 [7] tidyr_1.1.3               tibble_3.1.0              ggplot2_3.3.3            
[10] tidyverse_1.3.0           GenomicDataCommons_1.14.0 magrittr_2.0.1           
[13] pool_0.1.6                DBI_1.1.1                

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41             digest_0.6.27              
 [4] assertthat_0.2.1            utf8_1.2.1                  R6_2.5.0                   
 [7] GenomeInfoDb_1.26.4         cellranger_1.1.0            backports_1.2.1            
[10] reprex_1.0.0                stats4_4.0.4                httr_1.4.2                 
[13] pillar_1.5.1                zlibbioc_1.36.0             rlang_0.4.10               
[16] curl_4.3                    readxl_1.3.1                rstudioapi_0.13            
[19] S4Vectors_0.28.1            Matrix_1.3-2                labeling_0.4.2             
[22] RMySQL_0.10.21              RCurl_1.98-1.3              munsell_0.5.0              
[25] DelayedArray_0.16.2         broom_0.7.5                 compiler_4.0.4             
[28] modelr_0.1.8                pkgconfig_2.0.3             BiocGenerics_0.36.0        
[31] tidyselect_1.1.0            SummarizedExperiment_1.20.0 GenomeInfoDbData_1.2.4     
[34] IRanges_2.24.1              matrixStats_0.58.0          viridisLite_0.3.0          
[37] fansi_0.4.2                 withr_2.4.1                 crayon_1.4.1               
[40] dbplyr_2.1.0                later_1.1.0.1               bitops_1.0-6               
[43] rappdirs_0.3.3              grid_4.0.4                  jsonlite_1.7.2             
[46] gtable_0.3.0                lifecycle_1.0.0             scales_1.1.1               
[49] cli_2.3.1                   stringi_1.5.3               farver_2.1.0               
[52] XVector_0.30.0              renv_0.13.1                 fs_1.5.0                   
[55] xml2_1.3.2                  ellipsis_0.3.1              generics_0.1.0             
[58] vctrs_0.3.6                 tools_4.0.4                 Biobase_2.50.0             
[61] glue_1.4.2                  hms_1.0.0                   MatrixGenerics_1.2.1       
[64] parallel_4.0.4              colorspace_2.0-0            BiocManager_1.30.10        
[67] GenomicRanges_1.42.0        rvest_1.0.0                 haven_2.3.1
Actually changing the facet step to facet("samples.created_datetime") gives the case counts.
But the dates seem to be quite off, there's only a huge spike around 2018-05-17 and almost nothing else.
I'm also thinking this must be the date added to NCI GDC and not the actual project? TCGA data has been around since 2009-2010?
I would still appreciate if anyone could point me to the right resource for this.


