Question: Retrieve accurate tumor stage data with TCGAbiolinks
0
gravatar for Benjamin Ostendorf
12 months ago by
United States
Benjamin Ostendorf80 wrote:

Hi everyone, 

I'm wondering how to retrieve the most accurate tumor stage at diagnosis information for the TCGA-SKCM dataset using TCGAbiolinks. I need stage information only in format stage I-IV. 

I retrieved the variable 'tumor_stage' as part of the indexed clinical data (using ‚GDCquery_clinic‘, see full reprex below). However, there is also a variable ‚CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE‘ as part of data provided by the original paper describing the TCGA-SKCM cohort (doi:10.1016/j.cell.2015.05.044), which TCGAbiolinks can pull using the query ‚TCGAquery_subtype‘. 

While these two variables largely concur there are some cases with a valid entry in the 'CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE' and 'not reported'  in 'tumor_stage' (e.g., patient 'TCGA-D9-A148'). Should I generate a new variable combining these two or is there a specific reason for this happening? 

Thanks very much,
Ben

 

library(TCGAbiolinks)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

## Download indexed clinical data
clinical <- GDCquery_clinic(project = "TCGA-SKCM", type = "clinical")

## Download curated stage information
curated_stages <- TCGAquery_subtype(tumor = "skcm") %>% dplyr::rename(., bcr_patient_barcode = patient)
#> skcm subtype information from:doi:10.1016/j.cell.2015.05.044

## join clinical data from GDCquery_clinic and TCGAquery_subtype retrievals
clinical_joined <- left_join(clinical, curated_stages, by = "bcr_patient_barcode") %>% 
  dplyr::select(bcr_patient_barcode, CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE, 
    tumor_stage)

## show specific case with non-matching entries
clinical_joined[clinical_joined$bcr_patient_barcode == "TCGA-D9-A148", ]
#>     bcr_patient_barcode CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE
#> 102        TCGA-D9-A148                                           Stage IV
#>      tumor_stage
#> 102 not reported

sessionInfo()
#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_0.7.4         TCGAbiolinks_2.6.12
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_1.3-2            selectr_0.4-0              
#>   [3] rjson_0.2.15                hwriter_1.3.2              
#>   [5] rprojroot_1.3-2             circlize_0.4.3             
#>   [7] XVector_0.18.0              GenomicRanges_1.30.3       
#>   [9] GlobalOptions_0.0.13        ggpubr_0.1.6               
#>  [11] matlab_1.0.2                ggrepel_0.7.0              
#>  [13] bit64_0.9-7                 AnnotationDbi_1.40.0       
#>  [15] xml2_1.2.0                  codetools_0.2-15           
#>  [17] splines_3.4.4               R.methodsS3_1.7.1          
#>  [19] mnormt_1.5-5                doParallel_1.0.11          
#>  [21] DESeq_1.30.0                geneplotter_1.56.0         
#>  [23] knitr_1.20                  jsonlite_1.5               
#>  [25] Rsamtools_1.30.0            km.ci_0.5-2                
#>  [27] broom_0.4.4                 annotate_1.56.2            
#>  [29] cluster_2.0.7               R.oo_1.21.0                
#>  [31] readr_1.1.1                 compiler_3.4.4             
#>  [33] httr_1.3.1                  backports_1.1.2            
#>  [35] assertthat_0.2.0            Matrix_1.2-12              
#>  [37] lazyeval_0.2.1              limma_3.34.9               
#>  [39] formatR_1.5                 htmltools_0.3.6            
#>  [41] prettyunits_1.0.2           tools_3.4.4                
#>  [43] bindrcpp_0.2.2              gtable_0.2.0               
#>  [45] glue_1.2.0                  GenomeInfoDbData_1.0.0     
#>  [47] reshape2_1.4.3              ggthemes_3.4.0             
#>  [49] ShortRead_1.36.1            Rcpp_0.12.16               
#>  [51] Biobase_2.38.0              Biostrings_2.46.0          
#>  [53] nlme_3.1-131.1              rtracklayer_1.38.3         
#>  [55] iterators_1.0.9             psych_1.8.3.3              
#>  [57] stringr_1.3.0               rvest_0.3.2                
#>  [59] XML_3.98-1.10               edgeR_3.20.9               
#>  [61] zoo_1.8-1                   zlibbioc_1.24.0            
#>  [63] scales_0.5.0                aroma.light_3.8.0          
#>  [65] hms_0.4.2                   parallel_3.4.4             
#>  [67] SummarizedExperiment_1.8.1  RColorBrewer_1.1-2         
#>  [69] curl_3.2                    ComplexHeatmap_1.17.1      
#>  [71] yaml_2.1.18                 memoise_1.1.0              
#>  [73] gridExtra_2.3               KMsurv_0.1-5               
#>  [75] ggplot2_2.2.1               downloader_0.4             
#>  [77] biomaRt_2.34.2              latticeExtra_0.6-28        
#>  [79] stringi_1.1.7               RSQLite_2.1.0              
#>  [81] genefilter_1.60.0           S4Vectors_0.16.0           
#>  [83] foreach_1.4.4               RMySQL_0.10.14             
#>  [85] GenomicFeatures_1.30.3      BiocGenerics_0.24.0        
#>  [87] BiocParallel_1.12.0         shape_1.4.4                
#>  [89] GenomeInfoDb_1.14.0         rlang_0.2.0                
#>  [91] pkgconfig_2.0.1             matrixStats_0.53.1         
#>  [93] bitops_1.0-6                evaluate_0.10.1            
#>  [95] lattice_0.20-35             purrr_0.2.4                
#>  [97] bindr_0.1.1                 cmprsk_2.2-7               
#>  [99] GenomicAlignments_1.14.2    bit_1.1-12                 
#> [101] plyr_1.8.4                  magrittr_1.5               
#> [103] R6_2.2.2                    IRanges_2.12.0             
#> [105] DelayedArray_0.4.1          DBI_0.8                    
#> [107] mgcv_1.8-23                 foreign_0.8-69             
#> [109] pillar_1.2.1                survival_2.41-3            
#> [111] RCurl_1.95-4.10             tibble_1.4.2               
#> [113] EDASeq_2.12.0               survMisc_0.5.4             
#> [115] rmarkdown_1.9               GetoptLong_0.1.6           
#> [117] progress_1.1.2              locfit_1.5-9.1             
#> [119] grid_3.4.4                  sva_3.26.0                 
#> [121] data.table_1.10.4-3         blob_1.1.1                 
#> [123] ConsensusClusterPlus_1.42.0 digest_0.6.15              
#> [125] xtable_1.8-2                tidyr_0.8.0                
#> [127] R.utils_2.6.0               stats4_3.4.4               
#> [129] munsell_0.4.3               survminer_0.4.2

 

tcga tcgabiolinks • 300 views
ADD COMMENTlink written 12 months ago by Benjamin Ostendorf80
2

Hi Benjamin,

It would be good to have Tiago Silva's input on this.

It seems like, at least for this case, the curation process considered the TNM stages and coded for Stage IV

Here are a few other resources that you can use to check these variables (many of which yield NA for pathologic_stage):

GenomicDataCommons
RTCGAToolbox
curatedTCGAData

Best regards, Marcel

ADD REPLYlink modified 12 months ago • written 12 months ago by Marcel Ramos ♦♦ 360
1

Thanks for your pointers, Marcel -  I'll check these and compare them! 

ADD REPLYlink written 12 months ago by Benjamin Ostendorf80

TCGAbiolinks has 3 options to get the clinical data: XML files, indexed GDC files (which are populated using the XML files), and the curated data from the papers. If no information is in the XML files or in the papers (there might be the case authors are able to get updated clinical information from the submitter center, this data might be outdated, missing, sometimes wrong and not fixed)

Also, I believe any other sources should have the same data as GDC. So, if it NA and it is not in the XML files the only way would be asking the submitter center if they have that info.

Here is a report for the pathologic_stage:  http://rpubs.com/tiagochst/TCGA-SKCM

 

 

ADD REPLYlink modified 11 months ago • written 11 months ago by Tiago Chedraoui Silva190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 233 users visited in the last hour