Getting the metadata for a RNA-seq sample from TCGA when you have the uuid
1
2
Entering edit mode
@lcolladotor
Last seen 3 days ago
United States

Hi,

I have a bunch of uuid's from TCGA RNA-seq samples and would like to get the metadata for them. Apparently you can get some basic info by going to https://gdc-portal.nci.nih.gov/search/c?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.program.name%22,%22value%22:%5B%22TCGA%22%5D%7D%7D%5D%7D&facetTab=cases and clicking on export, but that has failed for me the past two days. If you follow the links for one sample, you can end up at https://gdc-portal.nci.nih.gov/cases/0004d251-3f70-4395-b175-c94c2f5b1b81 where you can download the clinical info for a sample.

I've looked around at TCGAbiolinks and couldn't find a way to query GDC when I have a uuid such as FFA5FFF7-6301-4CD8-8E63-A4D8294D1B0E. Is there a way to do so with TCGAbiolinks? If not, how do you suggest I should proceed?

Thank you,

Leo

> packageVersion('TCGAbiolinks')
[1] ‘2.2.1’

tcgabiolinks tcgadownload • 4.9k views
ADD COMMENT
1
Entering edit mode
@tiago-chedraoui-silva-8877
Last seen 4.3 years ago
Brazil - University of São Paulo/ Los A…

Hi,

No, we don't have this functionality in the package (we had it with TCGA, but they removed the API that mapped BARCODE/UUID). 

However I coouldn't find the FFA5FFF7-6301-4CD8-8E63-A4D8294D1B0E UUID neither in the legacy nor harmonized database, in which field do you look ?

Best regards,

Tiago

ADD COMMENT
0
Entering edit mode

Yes, that is the old API. It is not working anymore.

For the  FFA5FFF7-6301-4CD8-8E63-A4D8294D1B0E If you have this

https://github.com/nellore/runs/blob/105c86de2ef91846f015f5b8285a7d6e29e0fcfc/tcga/tcga_batch_0.manifest#L236

You can use this to get to the file (which is the submitter_id) and search in GDC. 

/Datasets/tcga/TCGA-COAD/28033279-cc74-4775-afdf-2497f6ddb55c/analysis/154aa297-0890-4fde-a8c1-2058a4c65b28/data/UNCID_2212217.4a01323f-408b-4e74-8686-ee6d4d076ee8.110302_UNC6-RDR300211_00066_FC_62J5EAAXX_3.tar.gz 0 FFA5FFF7-6301-4CD8-8E63-A4D8294D1B0E

There is no function to map UUID to BARCODE in TCGAbiolinks, but as they mapped the UUID to the file id. We could create a table, but I believe that is too much work. Did you send an email to GDC team (https://gdc.cancer.gov/contact-us) they might have a solution?

 

ADD REPLY
0
Entering edit mode

I was able to create a function to map to barcode, map that helps you.

library(httr)
library(jsonlite)
getBarcode <- function(uuid, legacy = TRUE){
# Get manifest using the API
uuid <- tolower(uuid)
baseURL <- ifelse(legacy,"https://gdc-api.nci.nih.gov/legacy/files/?","https://gdc-api.nci.nih.gov/files/?")
options.pretty <- "pretty=true"
options.expand <- "expand=cases.samples.portions.analytes.aliquots"
options.field <- "fields=cases.samples.portions.analytes.aliquots.submitter_id"
option.size <- paste0("size=",length(uuid))
option.format <- paste0("format=JSON")
options.filter <- paste0("filters=",
URLencode('{"op":"and","content":[{"op":"in","content":{"field":"files.file_id","value":['),
paste0('"',paste(uuid,collapse = '","')),
URLencode('"]}}]}'))
url <- paste0(baseURL,paste(options.pretty, options.expand,option.size,
options.filter, options.field,
option.format, sep = "&"))
json <- tryCatch(
fromJSON(url, simplifyDataFrame = TRUE),
error = function(e) {
fromJSON(content(GET(url), as = "text", encoding = "UTF-8"), simplifyDataFrame = TRUE)
}
)
df <- stack(unlist(json$data$hits))
barcode <- df[grep("TCGA",df[,1]),1]
df <- data.frame(uuid = uuid, barcode = barcode)
return(df)
}
getBarcode("ffa5fff7-6301-4cd8-8e63-a4d8294d1b0e", legacy = TRUE)
getBarcode("D04B63DE-03BA-4A63-92CA-D8054C3E238C", legacy = TRUE)
getBarcode(c("D04B63DE-03BA-4A63-92CA-D8054C3E238C","ffa5fff7-6301-4cd8-8e63-a4d8294d1b0e"), legacy = TRUE)
view raw uuid2barcode.R hosted with ❤ by GitHub

What type of metadata do you want?

ADD REPLY
0
Entering edit mode

Awesome! Thanks!

I'm not super familiar with TCGA, but well, basically we would like to get all the metadata associated with a given RNA-seq sample. That is, information about the person (clinical?) and the RNA-seq sample itself if there is any. Is there other information you think might be useful?

ADD REPLY
0
Entering edit mode

Actually, you are getting all that is available, but there are some mark papers that have already make some studies on some samples. Maybe you can use it.

 

ADD REPLY
0
Entering edit mode

Thanks for the help... Just to update for others who may need this, the following line of code has changed from line 8 to:

baseURL <- ifelse(legacy,"https://api.gdc.cancer.gov/legacy/files/?","https://api.gdc.cancer.gov/files/?")

* EDIT * this code currently does not accurately translate legacy UUIDs to barcodes. I manually checked using the GDC legacy archive. Please use the code explained in Sean Davis' blog (https://seandavi.github.io/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/) for accurate translation of legacy IDs to barcodes.

ADD REPLY

Login before adding your answer.

Traffic: 496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6