Retrieve accurate tumor stage data with TCGAbiolinks
0
0
Entering edit mode
@benjamin-ostendorf-15752
Last seen 3.3 years ago
United States

Hi everyone, 

I'm wondering how to retrieve the most accurate tumor stage at diagnosis information for the TCGA-SKCM dataset using TCGAbiolinks. I need stage information only in format stage I-IV. 

I retrieved the variable 'tumor_stage' as part of the indexed clinical data (using ‚GDCquery_clinic‘, see full reprex below). However, there is also a variable ‚CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE‘ as part of data provided by the original paper describing the TCGA-SKCM cohort (doi:10.1016/j.cell.2015.05.044), which TCGAbiolinks can pull using the query ‚TCGAquery_subtype‘. 

While these two variables largely concur there are some cases with a valid entry in the 'CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE' and 'not reported'  in 'tumor_stage' (e.g., patient 'TCGA-D9-A148'). Should I generate a new variable combining these two or is there a specific reason for this happening? 

Thanks very much,
Ben

 

library(TCGAbiolinks)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

## Download indexed clinical data
clinical <- GDCquery_clinic(project = "TCGA-SKCM", type = "clinical")

## Download curated stage information
curated_stages <- TCGAquery_subtype(tumor = "skcm") %>% dplyr::rename(., bcr_patient_barcode = patient)
#> skcm subtype information from:doi:10.1016/j.cell.2015.05.044

## join clinical data from GDCquery_clinic and TCGAquery_subtype retrievals
clinical_joined <- left_join(clinical, curated_stages, by = "bcr_patient_barcode") %>% 
  dplyr::select(bcr_patient_barcode, CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE, 
    tumor_stage)

## show specific case with non-matching entries
clinical_joined[clinical_joined$bcr_patient_barcode == "TCGA-D9-A148", ]
#>     bcr_patient_barcode CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE
#> 102        TCGA-D9-A148                                           Stage IV
#>      tumor_stage
#> 102 not reported

sessionInfo()
#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_0.7.4         TCGAbiolinks_2.6.12
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_1.3-2            selectr_0.4-0              
#>   [3] rjson_0.2.15                hwriter_1.3.2              
#>   [5] rprojroot_1.3-2             circlize_0.4.3             
#>   [7] XVector_0.18.0              GenomicRanges_1.30.3       
#>   [9] GlobalOptions_0.0.13        ggpubr_0.1.6               
#>  [11] matlab_1.0.2                ggrepel_0.7.0              
#>  [13] bit64_0.9-7                 AnnotationDbi_1.40.0       
#>  [15] xml2_1.2.0                  codetools_0.2-15           
#>  [17] splines_3.4.4               R.methodsS3_1.7.1          
#>  [19] mnormt_1.5-5                doParallel_1.0.11          
#>  [21] DESeq_1.30.0                geneplotter_1.56.0         
#>  [23] knitr_1.20                  jsonlite_1.5               
#>  [25] Rsamtools_1.30.0            km.ci_0.5-2                
#>  [27] broom_0.4.4                 annotate_1.56.2            
#>  [29] cluster_2.0.7               R.oo_1.21.0                
#>  [31] readr_1.1.1                 compiler_3.4.4             
#>  [33] httr_1.3.1                  backports_1.1.2            
#>  [35] assertthat_0.2.0            Matrix_1.2-12              
#>  [37] lazyeval_0.2.1              limma_3.34.9               
#>  [39] formatR_1.5                 htmltools_0.3.6            
#>  [41] prettyunits_1.0.2           tools_3.4.4                
#>  [43] bindrcpp_0.2.2              gtable_0.2.0               
#>  [45] glue_1.2.0                  GenomeInfoDbData_1.0.0     
#>  [47] reshape2_1.4.3              ggthemes_3.4.0             
#>  [49] ShortRead_1.36.1            Rcpp_0.12.16               
#>  [51] Biobase_2.38.0              Biostrings_2.46.0          
#>  [53] nlme_3.1-131.1              rtracklayer_1.38.3         
#>  [55] iterators_1.0.9             psych_1.8.3.3              
#>  [57] stringr_1.3.0               rvest_0.3.2                
#>  [59] XML_3.98-1.10               edgeR_3.20.9               
#>  [61] zoo_1.8-1                   zlibbioc_1.24.0            
#>  [63] scales_0.5.0                aroma.light_3.8.0          
#>  [65] hms_0.4.2                   parallel_3.4.4             
#>  [67] SummarizedExperiment_1.8.1  RColorBrewer_1.1-2         
#>  [69] curl_3.2                    ComplexHeatmap_1.17.1      
#>  [71] yaml_2.1.18                 memoise_1.1.0              
#>  [73] gridExtra_2.3               KMsurv_0.1-5               
#>  [75] ggplot2_2.2.1               downloader_0.4             
#>  [77] biomaRt_2.34.2              latticeExtra_0.6-28        
#>  [79] stringi_1.1.7               RSQLite_2.1.0              
#>  [81] genefilter_1.60.0           S4Vectors_0.16.0           
#>  [83] foreach_1.4.4               RMySQL_0.10.14             
#>  [85] GenomicFeatures_1.30.3      BiocGenerics_0.24.0        
#>  [87] BiocParallel_1.12.0         shape_1.4.4                
#>  [89] GenomeInfoDb_1.14.0         rlang_0.2.0                
#>  [91] pkgconfig_2.0.1             matrixStats_0.53.1         
#>  [93] bitops_1.0-6                evaluate_0.10.1            
#>  [95] lattice_0.20-35             purrr_0.2.4                
#>  [97] bindr_0.1.1                 cmprsk_2.2-7               
#>  [99] GenomicAlignments_1.14.2    bit_1.1-12                 
#> [101] plyr_1.8.4                  magrittr_1.5               
#> [103] R6_2.2.2                    IRanges_2.12.0             
#> [105] DelayedArray_0.4.1          DBI_0.8                    
#> [107] mgcv_1.8-23                 foreign_0.8-69             
#> [109] pillar_1.2.1                survival_2.41-3            
#> [111] RCurl_1.95-4.10             tibble_1.4.2               
#> [113] EDASeq_2.12.0               survMisc_0.5.4             
#> [115] rmarkdown_1.9               GetoptLong_0.1.6           
#> [117] progress_1.1.2              locfit_1.5-9.1             
#> [119] grid_3.4.4                  sva_3.26.0                 
#> [121] data.table_1.10.4-3         blob_1.1.1                 
#> [123] ConsensusClusterPlus_1.42.0 digest_0.6.15              
#> [125] xtable_1.8-2                tidyr_0.8.0                
#> [127] R.utils_2.6.0               stats4_3.4.4               
#> [129] munsell_0.4.3               survminer_0.4.2

 

tcgabiolinks tcga • 1.6k views
ADD COMMENT
2
Entering edit mode

Hi Benjamin,

It would be good to have Tiago Silva's input on this.

It seems like, at least for this case, the curation process considered the TNM stages and coded for Stage IV

Here are a few other resources that you can use to check these variables (many of which yield NA for pathologic_stage):

GenomicDataCommons
RTCGAToolbox
curatedTCGAData

Best regards, Marcel

ADD REPLY
1
Entering edit mode

Thanks for your pointers, Marcel -  I'll check these and compare them! 

ADD REPLY
0
Entering edit mode

TCGAbiolinks has 3 options to get the clinical data: XML files, indexed GDC files (which are populated using the XML files), and the curated data from the papers. If no information is in the XML files or in the papers (there might be the case authors are able to get updated clinical information from the submitter center, this data might be outdated, missing, sometimes wrong and not fixed)

Also, I believe any other sources should have the same data as GDC. So, if it NA and it is not in the XML files the only way would be asking the submitter center if they have that info.

Here is a report for the pathologic_stage:  http://rpubs.com/tiagochst/TCGA-SKCM

 

 

ADD REPLY

Login before adding your answer.

Traffic: 545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6