Hi everyone,
I'm wondering how to retrieve the most accurate tumor stage at diagnosis information for the TCGA-SKCM dataset using TCGAbiolinks. I need stage information only in format stage I-IV.
I retrieved the variable 'tumor_stage' as part of the indexed clinical data (using ‚GDCquery_clinic
‘, see full reprex below). However, there is also a variable ‚CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE‘
as part of data provided by the original paper describing the TCGA-SKCM cohort (doi:10.1016/j.cell.2015.05.044), which TCGAbiolinks can pull using the query ‚TCGAquery_subtype
‘.
While these two variables largely concur there are some cases with a valid entry in the 'CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE
' and 'not reported' in 'tumor_stage
' (e.g., patient 'TCGA-D9-A148'). Should I generate a new variable combining these two or is there a specific reason for this happening?
Thanks very much,
Ben
library(TCGAbiolinks) library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union ## Download indexed clinical data clinical <- GDCquery_clinic(project = "TCGA-SKCM", type = "clinical") ## Download curated stage information curated_stages <- TCGAquery_subtype(tumor = "skcm") %>% dplyr::rename(., bcr_patient_barcode = patient) #> skcm subtype information from:doi:10.1016/j.cell.2015.05.044 ## join clinical data from GDCquery_clinic and TCGAquery_subtype retrievals clinical_joined <- left_join(clinical, curated_stages, by = "bcr_patient_barcode") %>% dplyr::select(bcr_patient_barcode, CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE, tumor_stage) ## show specific case with non-matching entries clinical_joined[clinical_joined$bcr_patient_barcode == "TCGA-D9-A148", ] #> bcr_patient_barcode CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE #> 102 TCGA-D9-A148 Stage IV #> tumor_stage #> 102 not reported sessionInfo() #> R version 3.4.4 (2018-03-15) #> Platform: x86_64-apple-darwin15.6.0 (64-bit) #> Running under: macOS High Sierra 10.13.4 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] dplyr_0.7.4 TCGAbiolinks_2.6.12 #> #> loaded via a namespace (and not attached): #> [1] colorspace_1.3-2 selectr_0.4-0 #> [3] rjson_0.2.15 hwriter_1.3.2 #> [5] rprojroot_1.3-2 circlize_0.4.3 #> [7] XVector_0.18.0 GenomicRanges_1.30.3 #> [9] GlobalOptions_0.0.13 ggpubr_0.1.6 #> [11] matlab_1.0.2 ggrepel_0.7.0 #> [13] bit64_0.9-7 AnnotationDbi_1.40.0 #> [15] xml2_1.2.0 codetools_0.2-15 #> [17] splines_3.4.4 R.methodsS3_1.7.1 #> [19] mnormt_1.5-5 doParallel_1.0.11 #> [21] DESeq_1.30.0 geneplotter_1.56.0 #> [23] knitr_1.20 jsonlite_1.5 #> [25] Rsamtools_1.30.0 km.ci_0.5-2 #> [27] broom_0.4.4 annotate_1.56.2 #> [29] cluster_2.0.7 R.oo_1.21.0 #> [31] readr_1.1.1 compiler_3.4.4 #> [33] httr_1.3.1 backports_1.1.2 #> [35] assertthat_0.2.0 Matrix_1.2-12 #> [37] lazyeval_0.2.1 limma_3.34.9 #> [39] formatR_1.5 htmltools_0.3.6 #> [41] prettyunits_1.0.2 tools_3.4.4 #> [43] bindrcpp_0.2.2 gtable_0.2.0 #> [45] glue_1.2.0 GenomeInfoDbData_1.0.0 #> [47] reshape2_1.4.3 ggthemes_3.4.0 #> [49] ShortRead_1.36.1 Rcpp_0.12.16 #> [51] Biobase_2.38.0 Biostrings_2.46.0 #> [53] nlme_3.1-131.1 rtracklayer_1.38.3 #> [55] iterators_1.0.9 psych_1.8.3.3 #> [57] stringr_1.3.0 rvest_0.3.2 #> [59] XML_3.98-1.10 edgeR_3.20.9 #> [61] zoo_1.8-1 zlibbioc_1.24.0 #> [63] scales_0.5.0 aroma.light_3.8.0 #> [65] hms_0.4.2 parallel_3.4.4 #> [67] SummarizedExperiment_1.8.1 RColorBrewer_1.1-2 #> [69] curl_3.2 ComplexHeatmap_1.17.1 #> [71] yaml_2.1.18 memoise_1.1.0 #> [73] gridExtra_2.3 KMsurv_0.1-5 #> [75] ggplot2_2.2.1 downloader_0.4 #> [77] biomaRt_2.34.2 latticeExtra_0.6-28 #> [79] stringi_1.1.7 RSQLite_2.1.0 #> [81] genefilter_1.60.0 S4Vectors_0.16.0 #> [83] foreach_1.4.4 RMySQL_0.10.14 #> [85] GenomicFeatures_1.30.3 BiocGenerics_0.24.0 #> [87] BiocParallel_1.12.0 shape_1.4.4 #> [89] GenomeInfoDb_1.14.0 rlang_0.2.0 #> [91] pkgconfig_2.0.1 matrixStats_0.53.1 #> [93] bitops_1.0-6 evaluate_0.10.1 #> [95] lattice_0.20-35 purrr_0.2.4 #> [97] bindr_0.1.1 cmprsk_2.2-7 #> [99] GenomicAlignments_1.14.2 bit_1.1-12 #> [101] plyr_1.8.4 magrittr_1.5 #> [103] R6_2.2.2 IRanges_2.12.0 #> [105] DelayedArray_0.4.1 DBI_0.8 #> [107] mgcv_1.8-23 foreign_0.8-69 #> [109] pillar_1.2.1 survival_2.41-3 #> [111] RCurl_1.95-4.10 tibble_1.4.2 #> [113] EDASeq_2.12.0 survMisc_0.5.4 #> [115] rmarkdown_1.9 GetoptLong_0.1.6 #> [117] progress_1.1.2 locfit_1.5-9.1 #> [119] grid_3.4.4 sva_3.26.0 #> [121] data.table_1.10.4-3 blob_1.1.1 #> [123] ConsensusClusterPlus_1.42.0 digest_0.6.15 #> [125] xtable_1.8-2 tidyr_0.8.0 #> [127] R.utils_2.6.0 stats4_3.4.4 #> [129] munsell_0.4.3 survminer_0.4.2
Hi Benjamin,
It would be good to have Tiago Silva's input on this.
It seems like, at least for this case, the curation process considered the TNM stages and coded for Stage IV
Here are a few other resources that you can use to check these variables (many of which yield
NA
forpathologic_stage
):Best regards, Marcel
Thanks for your pointers, Marcel - I'll check these and compare them!
TCGAbiolinks has 3 options to get the clinical data: XML files, indexed GDC files (which are populated using the XML files), and the curated data from the papers. If no information is in the XML files or in the papers (there might be the case authors are able to get updated clinical information from the submitter center, this data might be outdated, missing, sometimes wrong and not fixed)
Also, I believe any other sources should have the same data as GDC. So, if it NA and it is not in the XML files the only way would be asking the submitter center if they have that info.
Here is a report for the pathologic_stage: http://rpubs.com/tiagochst/TCGA-SKCM