Question: TCGA biolinks - Samples with multiple profiles
0
gravatar for franceschini.gianmarco
10 months ago by

Good morning,

I am struggling to complete a simple task using TCGA related packages.

I would need to obtain a manifest with all TGCA sample IDs (normal and primary tumor) and patient IDs which satisfy the following conditions:

Whole exome sequencing + DNA methlation profile (450k) + Gene expression profile (RNA-seq)

Is it possible to submit a comprehensive query for such a request, without going through a series of single manifests, filtering and merging?

Moreover, the usage of case UUID and their relationship with TCGA barcodes adopted in the past is still a little bit tricky for me, so if anyone can point me a good resource to get to know this detail better it would be great.

Thank you for your attention and help,

Gian


 

 

ADD COMMENTlink written 10 months ago by franceschini.gianmarco0
1

The way TCGAbiolinks was structured it is not possible to do the query requested. You would need to go through a series of single manifests, filtering and merging.I'm not sure which WXS data you wanted, or which RNA-seq but the code below can be easily modified.

 

The Barcode was supposed to be readable and give information about the samples (center, TSS, etc), but the UUID would not give any information. So far, I don't know if there is a trivial transformation between one and another. 

 

ADD REPLYlink written 10 months ago by Tiago Chedraoui Silva240
1

I'm not sure if this will be helpful to you, but the "sampleMap" files in ExperimentHub (created for use by curatedTCGAData) contain the aliquot and case barcodes, and the TCGAutils package provides a utility for filtering by sample type. There's a warning in the example below because some barcodes are truncated (these get kept in the sample type filtering):

library(ExperimentHub)
eh <- ExperimentHub()
mapsEH <- query(eh, "sampleMap")
suppressMessages(maps <- lapply(seq_along(mapsEH), function(i) mapsEH[[i]]))
levels(maps[[1]]$assay)
#> [1] "ACC_RNASeq2GeneNorm-20160128"         
#> [2] "ACC_miRNASeqGene-20160128"            
#> [3] "ACC_CNASNP-20160128"                  
#> [4] "ACC_CNVSNP-20160128"                  
#> [5] "ACC_Methylation-20160128"             
#> [6] "ACC_RPPAArray-20160128"               
#> [7] "ACC_Mutation-20160128"                
#> [8] "ACC_GISTIC_AllByGene-20160128"        
#> [9] "ACC_GISTIC_ThresholdedByGene-20160128"

mapdf <- do.call(rbind, maps)
mapdf <- mapdf[grepl("Mutation|Methylation|RNASeq2GeneNorm", mapdf$assay), ]

## TCGAutils to filter by sample type
suppressPackageStartupMessages(library(TCGAutils))
TCGAutils::sampleTypes
#>    Code                                        Definition
#> 1    01                               Primary Solid Tumor
#> 2    02                             Recurrent Solid Tumor
#> 3    03   Primary Blood Derived Cancer - Peripheral Blood
#> 4    04      Recurrent Blood Derived Cancer - Bone Marrow
#> 5    05                          Additional - New Primary
#> 6    06                                        Metastatic
#> 7    07                             Additional Metastatic
#> 8    08                        Human Tumor Original Cells
#> 9    09        Primary Blood Derived Cancer - Bone Marrow
#> 10   10                              Blood Derived Normal
#> 11   11                               Solid Tissue Normal
#> 12   12                                Buccal Cell Normal
#> 13   13                           EBV Immortalized Normal
#> 14   14                                Bone Marrow Normal
#> 15   15                                    sample type 15
#> 16   16                                    sample type 16
#> 17   20                                   Control Analyte
#> 18   40 Recurrent Blood Derived Cancer - Peripheral Blood
#> 19   50                                        Cell Lines
#> 20   60                          Primary Xenograft Tissue
#> 21   61                Cell Line Derived Xenograft Tissue
#> 22   99                                    sample type 99
#>    Short.Letter.Code
#> 1                 TP
#> 2                 TR
#> 3                 TB
#> 4               TRBM
#> 5                TAP
#> 6                 TM
#> 7                TAM
#> 8               THOC
#> 9                TBM
#> 10                NB
#> 11                NT
#> 12               NBC
#> 13              NEBV
#> 14               NBM
#> 15              15SH
#> 16              16SH
#> 17             CELLC
#> 18               TRB
#> 19              CELL
#> 20                XP
#> 21               XCL
#> 22              99SH
keepers <- TCGAutils::TCGAsampleSelect(mapdf$colname, c("01", "03", "10", "11"))
#> Warning in TCGAutils::TCGAsampleSelect(mapdf$colname, c("01", "03", "10", :
#> Inconsistent barcode lengths: 28, 15
#> Warning in if (!sampleCode %in% sampleTypes[["Code"]]) stop("'sampleCode'
#> not in look up table"): the condition has length > 1 and only the first
#> element will be used
#> Selecting 'Primary Solid TumorPrimary Blood Derived Cancer - Peripheral BloodBlood Derived NormalSolid Tissue Normal' samples
#> Warning in (function (..., deparse.level = 1) : number of columns of result
#> is not a multiple of vector length (arg 19456)
#> Warning in barcodeSamples == sampleCode: longer object length is not a
#> multiple of shorter object length
mapdf <- mapdf[keepers, ]
head(mapdf)
#> DataFrame with 6 rows and 3 columns
#>                          assay      primary                      colname
#>                       <factor>  <character>                  <character>
#> 1 ACC_RNASeq2GeneNorm-20160128 TCGA-OR-A5J1 TCGA-OR-A5J1-01A-11R-A29S-07
#> 2 ACC_RNASeq2GeneNorm-20160128 TCGA-OR-A5J6 TCGA-OR-A5J6-01A-31R-A29S-07
#> 3 ACC_RNASeq2GeneNorm-20160128 TCGA-OR-A5JA TCGA-OR-A5JA-01A-11R-A29S-07
#> 4 ACC_RNASeq2GeneNorm-20160128 TCGA-OR-A5JE TCGA-OR-A5JE-01A-11R-A29S-07
#> 5 ACC_RNASeq2GeneNorm-20160128 TCGA-OR-A5JJ TCGA-OR-A5JJ-01A-11R-A29S-07
#> 6 ACC_RNASeq2GeneNorm-20160128 TCGA-OR-A5JO TCGA-OR-A5JO-01A-11R-A29S-07
ADD REPLYlink written 10 months ago by Levi Waldron950
1

One more note, TCGAutils also provides simplified utilities for mapping barcode <--> UUID and UUID <--> UUID, including:

UUIDtoBarcode(id_vector, id_type = c("case_id", "file_id"), end_point = "participant", legacy = FALSE) 
UUIDtoUUID(id_vector, to_type = c("case_id", "file_id"), legacy = FALSE) 
barcodeToUUID(barcodes, id_type = c("case_id", "file_id"), legacy = FALSE) 
filenameToBarcode(filenames, legacy = FALSE)

For example (note, these functions use the GDC REST API, and not all kinks have been worked out, so if you try this on all barcodes some will return errors):

barcodeToUUID(head(mapdf$colname), id_type="file_id")
#>   cases.samples.portions.analytes.aliquots.submitter_id
#> 1                          TCGA-OR-A5J1-01A-11R-A29S-07
#> 2                          TCGA-OR-A5J6-01A-31R-A29S-07
#> 3                          TCGA-OR-A5JA-01A-11R-A29S-07
#> 4                          TCGA-OR-A5JE-01A-11R-A29S-07
#> 5                          TCGA-OR-A5JJ-01A-11R-A29S-07
#> 6                          TCGA-OR-A5JO-01A-11R-A29S-07
#>                                file_id
#> 1 35ed7a20-6bff-4a88-9b9e-567a5c34bcce
#> 2 a9cb071e-56e9-4827-8854-0a532822ebc1
#> 3 1d590469-3460-4c4a-8d89-916ab3fe7125
#> 4 5c3fd7e2-7fa1-4a10-82bd-57089e9705f6
#> 5 1df9d660-aebf-4b54-a4d9-d9f271a1f0b8
#> 6 6de4db8e-9b9b-4080-9ccc-e9a92344e944
ADD REPLYlink modified 10 months ago • written 10 months ago by Levi Waldron950

Thank you very much for both your answers, that was exactly what I needed!
Best

ADD REPLYlink modified 10 months ago • written 10 months ago by franceschini.gianmarco0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 147 users visited in the last hour