Question: Retrieve accurate tumor stage data with TCGAbiolinks
0
gravatar for Benjamin Ostendorf
19 months ago by
United States
Benjamin Ostendorf90 wrote:

Hi everyone, 

I'm wondering how to retrieve the most accurate tumor stage at diagnosis information for the TCGA-SKCM dataset using TCGAbiolinks. I need stage information only in format stage I-IV. 

I retrieved the variable 'tumor_stage' as part of the indexed clinical data (using ‚GDCquery_clinic‘, see full reprex below). However, there is also a variable ‚CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE‘ as part of data provided by the original paper describing the TCGA-SKCM cohort (doi:10.1016/j.cell.2015.05.044), which TCGAbiolinks can pull using the query ‚TCGAquery_subtype‘. 

While these two variables largely concur there are some cases with a valid entry in the 'CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE' and 'not reported'  in 'tumor_stage' (e.g., patient 'TCGA-D9-A148'). Should I generate a new variable combining these two or is there a specific reason for this happening? 

Thanks very much,
Ben

 

library(TCGAbiolinks)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

## Download indexed clinical data
clinical <- GDCquery_clinic(project = "TCGA-SKCM", type = "clinical")

## Download curated stage information
curated_stages <- TCGAquery_subtype(tumor = "skcm") %>% dplyr::rename(., bcr_patient_barcode = patient)
#> skcm subtype information from:doi:10.1016/j.cell.2015.05.044

## join clinical data from GDCquery_clinic and TCGAquery_subtype retrievals
clinical_joined <- left_join(clinical, curated_stages, by = "bcr_patient_barcode") %>% 
  dplyr::select(bcr_patient_barcode, CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE, 
    tumor_stage)

## show specific case with non-matching entries
clinical_joined[clinical_joined$bcr_patient_barcode == "TCGA-D9-A148", ]
#>     bcr_patient_barcode CURATED_PATHOLOGIC_STAGE_AJCC7_AT_DIAGNOSIS_SIMPLE
#> 102        TCGA-D9-A148                                           Stage IV
#>      tumor_stage
#> 102 not reported

sessionInfo()
#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_0.7.4         TCGAbiolinks_2.6.12
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_1.3-2            selectr_0.4-0              
#>   [3] rjson_0.2.15                hwriter_1.3.2              
#>   [5] rprojroot_1.3-2             circlize_0.4.3             
#>   [7] XVector_0.18.0              GenomicRanges_1.30.3       
#>   [9] GlobalOptions_0.0.13        ggpubr_0.1.6               
#>  [11] matlab_1.0.2                ggrepel_0.7.0              
#>  [13] bit64_0.9-7                 AnnotationDbi_1.40.0       
#>  [15] xml2_1.2.0                  codetools_0.2-15           
#>  [17] splines_3.4.4               R.methodsS3_1.7.1          
#>  [19] mnormt_1.5-5                doParallel_1.0.11          
#>  [21] DESeq_1.30.0                geneplotter_1.56.0         
#>  [23] knitr_1.20                  jsonlite_1.5               
#>  [25] Rsamtools_1.30.0            km.ci_0.5-2                
#>  [27] broom_0.4.4                 annotate_1.56.2            
#>  [29] cluster_2.0.7               R.oo_1.21.0                
#>  [31] readr_1.1.1                 compiler_3.4.4             
#>  [33] httr_1.3.1                  backports_1.1.2            
#>  [35] assertthat_0.2.0            Matrix_1.2-12              
#>  [37] lazyeval_0.2.1              limma_3.34.9               
#>  [39] formatR_1.5                 htmltools_0.3.6            
#>  [41] prettyunits_1.0.2           tools_3.4.4                
#>  [43] bindrcpp_0.2.2              gtable_0.2.0               
#>  [45] glue_1.2.0                  GenomeInfoDbData_1.0.0     
#>  [47] reshape2_1.4.3              ggthemes_3.4.0             
#>  [49] ShortRead_1.36.1            Rcpp_0.12.16               
#>  [51] Biobase_2.38.0              Biostrings_2.46.0          
#>  [53] nlme_3.1-131.1              rtracklayer_1.38.3         
#>  [55] iterators_1.0.9             psych_1.8.3.3              
#>  [57] stringr_1.3.0               rvest_0.3.2                
#>  [59] XML_3.98-1.10               edgeR_3.20.9               
#>  [61] zoo_1.8-1                   zlibbioc_1.24.0            
#>  [63] scales_0.5.0                aroma.light_3.8.0          
#>  [65] hms_0.4.2                   parallel_3.4.4             
#>  [67] SummarizedExperiment_1.8.1  RColorBrewer_1.1-2         
#>  [69] curl_3.2                    ComplexHeatmap_1.17.1      
#>  [71] yaml_2.1.18                 memoise_1.1.0              
#>  [73] gridExtra_2.3               KMsurv_0.1-5               
#>  [75] ggplot2_2.2.1               downloader_0.4             
#>  [77] biomaRt_2.34.2              latticeExtra_0.6-28        
#>  [79] stringi_1.1.7               RSQLite_2.1.0              
#>  [81] genefilter_1.60.0           S4Vectors_0.16.0           
#>  [83] foreach_1.4.4               RMySQL_0.10.14             
#>  [85] GenomicFeatures_1.30.3      BiocGenerics_0.24.0        
#>  [87] BiocParallel_1.12.0         shape_1.4.4                
#>  [89] GenomeInfoDb_1.14.0         rlang_0.2.0                
#>  [91] pkgconfig_2.0.1             matrixStats_0.53.1         
#>  [93] bitops_1.0-6                evaluate_0.10.1            
#>  [95] lattice_0.20-35             purrr_0.2.4                
#>  [97] bindr_0.1.1                 cmprsk_2.2-7               
#>  [99] GenomicAlignments_1.14.2    bit_1.1-12                 
#> [101] plyr_1.8.4                  magrittr_1.5               
#> [103] R6_2.2.2                    IRanges_2.12.0             
#> [105] DelayedArray_0.4.1          DBI_0.8                    
#> [107] mgcv_1.8-23                 foreign_0.8-69             
#> [109] pillar_1.2.1                survival_2.41-3            
#> [111] RCurl_1.95-4.10             tibble_1.4.2               
#> [113] EDASeq_2.12.0               survMisc_0.5.4             
#> [115] rmarkdown_1.9               GetoptLong_0.1.6           
#> [117] progress_1.1.2              locfit_1.5-9.1             
#> [119] grid_3.4.4                  sva_3.26.0                 
#> [121] data.table_1.10.4-3         blob_1.1.1                 
#> [123] ConsensusClusterPlus_1.42.0 digest_0.6.15              
#> [125] xtable_1.8-2                tidyr_0.8.0                
#> [127] R.utils_2.6.0               stats4_3.4.4               
#> [129] munsell_0.4.3               survminer_0.4.2

 

tcga tcgabiolinks • 411 views
ADD COMMENTlink written 19 months ago by Benjamin Ostendorf90
2

Hi Benjamin,

It would be good to have Tiago Silva's input on this.

It seems like, at least for this case, the curation process considered the TNM stages and coded for Stage IV

Here are a few other resources that you can use to check these variables (many of which yield NA for pathologic_stage):

GenomicDataCommons
RTCGAToolbox
curatedTCGAData

Best regards, Marcel

ADD REPLYlink modified 19 months ago • written 19 months ago by Marcel Ramos ♦♦ 410
1

Thanks for your pointers, Marcel -  I'll check these and compare them! 

ADD REPLYlink written 19 months ago by Benjamin Ostendorf90

TCGAbiolinks has 3 options to get the clinical data: XML files, indexed GDC files (which are populated using the XML files), and the curated data from the papers. If no information is in the XML files or in the papers (there might be the case authors are able to get updated clinical information from the submitter center, this data might be outdated, missing, sometimes wrong and not fixed)

Also, I believe any other sources should have the same data as GDC. So, if it NA and it is not in the XML files the only way would be asking the submitter center if they have that info.

Here is a report for the pathologic_stage:  http://rpubs.com/tiagochst/TCGA-SKCM

 

 

ADD REPLYlink modified 18 months ago • written 18 months ago by Tiago Chedraoui Silva240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 209 users visited in the last hour