Hi,
This is a follow question related to Getting the metadata for a RNA-seq sample from TCGA when you have the uuid.
I have a list of barcodes and would like to get all the metadata possible for them in a flat table (or maybe a DataFrame with some list columns) from TCGA. The issue is that the _indexed_ data available from TCGA is incomplete so downloading a json file via the GDC api or with `TCGA::GDCquery_clinic()` results in incomplete data. Trying to do so via Firebrowse leaves a few (42) barcodes missing as described in detail at https://groups.google.com/a/broadinstitute.org/forum/?hl=en#!topic/gdac-users/JjH6vMEJHpQ plus tables with different numbers of columns. Actually, 38 of those 42 have no clinical data on the GDC website and matches the results from using `TCGAbiolinks` on those 42 barcodes (see gist.github.com/lcolladotor/ 18c60a7e42b290cabd555aeda3e19ea7 -- I broke up the link so it won't embed the whole thing here). As a note, it would be nice if `GDCquery()` could say that the barcode exists but that there is no data, or that you made a typo on the project id.
library('TCGAbiolinks') projects <- getGDCprojects() clin <- lapply(projects$project_id[!grepl('TARGET', projects$project_id)], GDCquery_clinic, type = "clinical") clin <- do.call(rbind, clin)
The above works for getting a large table of clinical data, but it's the _indexed_ data (using the terms from the TCGAbiolinks vignette). This data is missing information about recurrence as noted by one of our collaborators:
> dim(clin) [1] 11160 52 > summary(clin$days_to_recurrence) Mode NA's logical 11160 > table(clin$progression_or_recurrence, useNA = 'ifany') not reported 11160
Using the objects from the google group link, I can get the barcodes for some samples that do have recurrence information in Firebrowse. For example:
> meta.list$PRAD$tcga_participant_barcode[meta.list$PRAD$biochemical_recurrence == 'yes'][1:6] [1] "TCGA-V1-A9ZI" "TCGA-2A-A8W3" "TCGA-EJ-A65F" "TCGA-EJ-A8FP" NA [6] "TCGA-YL-A8S8"
The first one leads to https://gdc-portal.nci.nih.gov/cases/3d92b872-a76a-49cd-9fd5-7993e78d2fb0 which does not show the recurrence information. However, if you find the clinical xml file https://gdc-portal.nci.nih.gov/files/11e640f2-a7a6-4693-95a3-2137274ef0b1 you'll see it there. With `TCGAbiolinks` I can get this information, which is nice:
> query <- GDCquery(project = 'TCGA-PRAD', 'Clinical', barcode = 'TCGA-V1-A9ZI') Accessing GDC. This might take a while... > GDCdownload(query) GDCdownload will download: 32.875 KB Downloading as: nationwidechildrens.org_clinical.TCGA-V1-A9ZI.xml |===========================================================================================================================================================================| 100%[1] 1 > clinical$biochemical_recurrence [1] YES Levels: YES
From the `TCGAbiolinks` vignette at http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery_clinic-and-gdcprepare_clinic-working-with-clinical-data. if you find the example for `clinical.patient.followup` you'll notice that it has 2 rows. That would make it hard to flatten out.
Is there code in `TCGAbiolinks` for creating this flat metadata table from the XML files? If not, do you think it's feasible? Ideally we want all the clinical and biospecimen information.
Note that the `clin` object has 11,160 rows, which is different from the 11,285 we get with our GDC api query but more than the number of unique barcodes we have. That's fine as it turns out that only 38 ids are missing with TGCAquery (those 38 that don't have clinical info) and we could leave them as NAs.
> dim(meta) [1] 11285 222 > length(unique(meta$cases.submitter_id)) [1] 10341 > table(!is.na(match(unique(meta$cases.submitter_id), clin$submitter_id))) FALSE TRUE 38 10303
Thanks,
Leo
> packageVersion('TCGAbiolinks') [1] ‘2.2.5’ > R.version _ platform x86_64-pc-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status Patched major 3 minor 3.1 year 2016 month 09 day 30 svn rev 71426 language R version.string R version 3.3.1 Patched (2016-09-30 r71426) nickname Bug in Your Hair
I don't know if this helps, but we created an tab-separated version of much of the TCGA clinical data for this project: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
Thanks Stephen! I've seen it before but saw that you required biotab files https://wiki.nci.nih.gov/display/tcga/biotab which I don't think are available anymore. Seeing your code again at https://github.com/srp33/TCGA_RNASeq_Clinical/blob/master/Codes/ProcessClinicalData.R got me thinking that maybe I could get it to work with the XML files with the help of TCGAbiolinks.
I'll wait a bit for now and see what others reply.