I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type.
I attempted to get this data via GenomicDataCommons
package, but it is giving me I believe the number of files for a given experiment type rather than number cases.
How can I get the number of cases?
Here's my attempt:
First I can check the number of cases by experiment with:
library(tidyverse)
library(GenomicDataCommons)
cases() %>%
GenomicDataCommons::filter(~ project.program.name=='TCGA') %>%
facet("files.experimental_strategy") %>%
aggregations() %>%
.[[1]]
Giving me:
doc_count key
1 10533 WXS
2 11126 Genotyping Array
3 10239 RNA-Seq
4 10250 miRNA-Seq
5 10903 Tissue Slide
6 10943 Methylation Array
7 9641 Diagnostic Slide
8 404 ATAC-Seq
9 99 WGS
10 16 _missing
Now let's try get to files created by date just for the RNA-seq as an example:
cases() %>%
GenomicDataCommons::filter(~ project.program.name=='TCGA' &
files.experimental_strategy=='RNA-Seq') %>%
facet(c("files.created_datetime")) %>%
aggregations() %>%
.[[1]] %>% as_tibble()
Giving me:
doc_count key
<int> <chr>
1 1092 2018-05-21t16:07:40.645885-05:00
2 555 2018-05-22t05:20:16.511251-05:00
3 515 2018-05-21t23:28:30.907184-05:00
4 501 2018-05-21t20:04:16.087009-05:00
5 10156 2016-10-27t21:58:12.297090-05:00
6 502 2018-05-22t04:32:59.083508-05:00
7 530 2018-05-21t21:03:27.523651-05:00
8 456 2018-05-21t18:19:54.595818-05:00
9 512 2018-05-21t22:14:34.066432-05:00
10 501 2018-05-22t00:17:01.373309-05:00
If I was to compute cumulative sum of these doc_count
s by date, the final number is way greater than the number I get when I simply look at available cases by experiment. For example already in the header above, you can see that even just the row 5 has a value of 10156
, which is already very close to 10239
which makes me think that these values are the number of file and not cases. How can I get the number of cases instead?
Thanks
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] lubridate_1.7.10 forcats_0.5.1 stringr_1.4.0
[4] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0
[7] tidyr_1.1.3 tibble_3.1.0 ggplot2_3.3.3
[10] tidyverse_1.3.0 GenomicDataCommons_1.14.0 magrittr_2.0.1
[13] pool_0.1.6 DBI_1.1.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41 digest_0.6.27
[4] assertthat_0.2.1 utf8_1.2.1 R6_2.5.0
[7] GenomeInfoDb_1.26.4 cellranger_1.1.0 backports_1.2.1
[10] reprex_1.0.0 stats4_4.0.4 httr_1.4.2
[13] pillar_1.5.1 zlibbioc_1.36.0 rlang_0.4.10
[16] curl_4.3 readxl_1.3.1 rstudioapi_0.13
[19] S4Vectors_0.28.1 Matrix_1.3-2 labeling_0.4.2
[22] RMySQL_0.10.21 RCurl_1.98-1.3 munsell_0.5.0
[25] DelayedArray_0.16.2 broom_0.7.5 compiler_4.0.4
[28] modelr_0.1.8 pkgconfig_2.0.3 BiocGenerics_0.36.0
[31] tidyselect_1.1.0 SummarizedExperiment_1.20.0 GenomeInfoDbData_1.2.4
[34] IRanges_2.24.1 matrixStats_0.58.0 viridisLite_0.3.0
[37] fansi_0.4.2 withr_2.4.1 crayon_1.4.1
[40] dbplyr_2.1.0 later_1.1.0.1 bitops_1.0-6
[43] rappdirs_0.3.3 grid_4.0.4 jsonlite_1.7.2
[46] gtable_0.3.0 lifecycle_1.0.0 scales_1.1.1
[49] cli_2.3.1 stringi_1.5.3 farver_2.1.0
[52] XVector_0.30.0 renv_0.13.1 fs_1.5.0
[55] xml2_1.3.2 ellipsis_0.3.1 generics_0.1.0
[58] vctrs_0.3.6 tools_4.0.4 Biobase_2.50.0
[61] glue_1.4.2 hms_1.0.0 MatrixGenerics_1.2.1
[64] parallel_4.0.4 colorspace_2.0-0 BiocManager_1.30.10
[67] GenomicRanges_1.42.0 rvest_1.0.0 haven_2.3.1