Question

Getting a flat metadata table from the XML files in TCGA: is it feasible?

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 1 day ago

United States

Hi,

This is a follow question related to Getting the metadata for a RNA-seq sample from TCGA when you have the uuid.

I have a list of barcodes and would like to get all the metadata possible for them in a flat table (or maybe a DataFrame with some list columns) from TCGA. The issue is that the _indexed_ data available from TCGA is incomplete so downloading a json file via the GDC api or with `TCGA::GDCquery_clinic()` results in incomplete data. Trying to do so via Firebrowse leaves a few (42) barcodes missing as described in detail at https://groups.google.com/a/broadinstitute.org/forum/?hl=en#!topic/gdac-users/JjH6vMEJHpQ plus tables with different numbers of columns. Actually, 38 of those 42 have no clinical data on the GDC website and matches the results from using `TCGAbiolinks` on those 42 barcodes (see gist.github.com/lcolladotor/ 18c60a7e42b290cabd555aeda3e19ea7 -- I broke up the link so it won't embed the whole thing here). As a note, it would be nice if `GDCquery()` could say that the barcode exists but that there is no data, or that you made a typo on the project id.

library('TCGAbiolinks')
projects <- getGDCprojects()
clin <- lapply(projects$project_id[!grepl('TARGET', projects$project_id)], GDCquery_clinic, type = "clinical")
clin <- do.call(rbind, clin)

The above works for getting a large table of clinical data, but it's the _indexed_ data (using the terms from the TCGAbiolinks vignette). This data is missing information about recurrence as noted by one of our collaborators:

> dim(clin)
[1] 11160    52
> summary(clin$days_to_recurrence)
   Mode    NA's 
logical   11160 
> table(clin$progression_or_recurrence, useNA = 'ifany')

not reported 
       11160

Using the objects from the google group link, I can get the barcodes for some samples that do have recurrence information in Firebrowse. For example:

> meta.list$PRAD$tcga_participant_barcode[meta.list$PRAD$biochemical_recurrence == 'yes'][1:6]
[1] "TCGA-V1-A9ZI" "TCGA-2A-A8W3" "TCGA-EJ-A65F" "TCGA-EJ-A8FP" NA
[6] "TCGA-YL-A8S8"

The first one leads to https://gdc-portal.nci.nih.gov/cases/3d92b872-a76a-49cd-9fd5-7993e78d2fb0 which does not show the recurrence information. However, if you find the clinical xml file https://gdc-portal.nci.nih.gov/files/11e640f2-a7a6-4693-95a3-2137274ef0b1 you'll see it there. With `TCGAbiolinks` I can get this information, which is nice:

> query <- GDCquery(project = 'TCGA-PRAD', 'Clinical', barcode = 'TCGA-V1-A9ZI')
Accessing GDC. This might take a while...
> GDCdownload(query)
GDCdownload will download: 32.875 KB
Downloading as: nationwidechildrens.org_clinical.TCGA-V1-A9ZI.xml
  |===========================================================================================================================================================================| 100%[1] 1

> clinical$biochemical_recurrence
[1] YES
Levels: YES

From the `TCGAbiolinks` vignette at http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery_clinic-and-gdcprepare_clinic-working-with-clinical-data. if you find the example for `clinical.patient.followup` you'll notice that it has 2 rows. That would make it hard to flatten out.

Is there code in `TCGAbiolinks` for creating this flat metadata table from the XML files? If not, do you think it's feasible? Ideally we want all the clinical and biospecimen information.

Note that the `clin` object has 11,160 rows, which is different from the 11,285 we get with our GDC api query but more than the number of unique barcodes we have. That's fine as it turns out that only 38 ids are missing with TGCAquery (those 38 that don't have clinical info) and we could leave them as NAs.

> dim(meta)
[1] 11285   222
> length(unique(meta$cases.submitter_id))
[1] 10341
> table(!is.na(match(unique(meta$cases.submitter_id), clin$submitter_id)))

FALSE  TRUE
   38 10303

Thanks,

Leo

> packageVersion('TCGAbiolinks')
[1] ‘2.2.5’

> R.version
               _
platform       x86_64-pc-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status         Patched
major          3
minor          3.1
year           2016
month          09
day            30
svn rev        71426
language       R
version.string R version 3.3.1 Patched (2016-09-30 r71426)
nickname       Bug in Your Hair

tcgabiolinks • 4.1k views

ADD COMMENT • link updated 8.1 years ago by Tiago Chedraoui Silva ▴ 260 • written 8.1 years ago by Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

I don't know if this helps, but we created an tab-separated version of much of the TCGA clinical data for this project: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944

ADD REPLY • link 8.1 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Thanks Stephen! I've seen it before but saw that you required biotab files https://wiki.nci.nih.gov/display/tcga/biotab which I don't think are available anymore. Seeing your code again at https://github.com/srp33/TCGA_RNASeq_Clinical/blob/master/Codes/ProcessClinicalData.R got me thinking that maybe I could get it to work with the XML files with the help of TCGAbiolinks.

I'll wait a bit for now and see what others reply.

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

score 2 · Accepted Answer · 2016-11-11

2

Entering edit mode

Tiago Chedraoui Silva ▴ 260

@tiago-chedraoui-silva-8877

Last seen 4.3 years ago

Brazil - University of São Paulo/ Los A…

Hi Leo,

I believe TCGA analysis team has the big matrix in synapse (https://www.synapse.org/) the easiest way is to get there (if you have access).

Yes, the indexed clinical data from GDC has some problems as you said.

With TCGAbiolinks you can get what was each txt file from the old TCGA. We decided not to create a big version of the matrix

because the relation is not 1 to 1 for each content. One patient might have multiple follow ups, multiple radiations therapies etc...which is quite difficult to manage in a big flat table (that's why xml and json are better representations).

You could merge the contents duplicating the rows. Something like this:

But I'm not sure this will solve your problem.

ADD COMMENT • link 8.1 years ago Tiago Chedraoui Silva ▴ 260

0

Entering edit mode

Thanks Tiago for your example code. Using `TCGAbiolinks` version 2.3.8 and the following code I was able to get the main (`patient`) clinical information. The code for merging the other types of information will be useful as well, but for our use in `recount` we'll let users decide how to merge.

The code below sometimes fails for a given project, but if you rerun the loop for that project it eventually works. Note that `TCGA-LAML` did not work with the Bioc-release version of `TCGAbiolinks` (2.2.5) due to an issue with the internal variable `i`. My impression is that it did work but failed at the end when a message is printed on how to get the rest of the information. In any case, that bug is not present in Bioc-devel.

library('TCGAbiolinks')
library('plyr')
library('devtools')
projects <- TCGAbiolinks:::getGDCprojects()$project_id
projects <- projects[grepl('^TCGA', projects, perl = TRUE)]

clin <- lapply(projects, function(p) {
    message(paste(Sys.time(), 'processing project', p))
    result <- tryCatch({
        query <- GDCquery(project = p, data.category = 'Clinical')
        GDCdownload(query)
        GDCprepare_clinic(query, clinical.info = 'patient')
    }, error = function(e) {
        message(paste0('Error clinical: ', p))
        return(NULL)
    })
    return(result)
})
names(clin) <- projects

## Merge all
clin_all <- rbind.fill(clin)

## Fix columns that have '' that should be NAs
for(j in seq_len(ncol(clin_all))) {
    i <- which(clin_all[, j] == '')
    if(length(i) > 0) clin_all[i, j] <- NA
}

save(clin_all, file = 'clin_all.Rdata')

write.table(clin_all, file = 'clin_all.tsv', quote = FALSE, row.names = FALSE,
    sep = '\t')

## Reproducibility info
Sys.time()
options(width = 120)
session_info()

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Thanks, TCGA-LAML apparently does not have all the clinical information in the XML. I rewrote the code to improve it and also handle if there is a problem in the connection to GDC.