Querying specific TCGA datasets through the MultiAssayExperiment R package
1
1
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 4 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on a recent machine learning approach in cancer transcriptomic data, my basic goal is to utilize mainly various mutliomics experiments concerning a TCGA dataset (for example COAD and expand to other cancers), for 2 main analyses, based on the R packages curatedCRCData and MultiAssayExperiment, which include various types of omics data for each of TCGA cancers:

A) for a selected signature of 12 genes, to perform an analysis like a Multivariate Cox regression including RNA-seq, copy number, and pathology, and test for any one of these genes for any significant results-like the EZN2 gene as desribed following tutorial:

B) The second-also very important part-is to perform a correlation analysis for RNA-Seq data with copy number variation data, again for this subset of genes-and rank, these genes which indeed show the highest and most important correlation.

Thus:

1) For initial investigation of the COAD dataset with the R package MultiAssayExperiment, i only found this link:

Or there are also any recent updates ?

2) Assuming my above approach is valid, and based on the link above, i should proceed with the following code ? :

library(MultiAssayExperiment)

library(RaggedExperiment)

library(SummarizedExperiment)

accCOAD <- updateObject(accCOAD)

3) My final and perhaps most crusial question:

If my notion is correct-the data which are included in the above links and repositories, for the datasets, are from the firehose repository, right ?

And so, these are the legendary hg19 data from the original publications, without any updates in the survival rates or the protocols, right ?

My reason for asking, is also for my current project, i have already performed an analysis on the hg38 provisional TCGA data in the COAD dataset, but only on the transcriptomics layer, for my aformentioned signature. Thus, the posibility of utilizing the MultiAssayExperimement to interrogate at the same time different omic layers, will be a great asset for my purpose-

however, how i should interpret the difference in the genome ? i mean, if i perform a multivariate cox analysis for survival in the hg19 finding any genes significant-which have already show a survival significance in hg38-will strengthen more my results, and wiil illustrate that are robust, regardless the different protocols/technology used ?

Thank you very much for your time and consideration on this matter, and i wait for your very crusial comments or suggestions !!

Kind Regards,

Efstathios-Iason

3
Entering edit mode
@marcel-ramos-7325
Last seen 17 days ago
United States

Hi Efstathios-Iason,

1) As you may be aware, there are quite a number of resources for accessing TCGA data. I would first point you to the curatedTCGAData Bioconductor package. This will give you access to COAD data as MultiAssayExperiment objects.

You could also use curatedCRCData but you'd have to package the data into a MultiAssayExperiment object yourself. There are other packages available with data in various forms including (but not limited to) GenomicDataCommons, TCGAbiolinks, and RTCGAToolbox but you'd have to build the MultiAssayExperiment data object yourself.

The link that you provided contains alpha builds of the datasets in curatedTCGAData and you should use the data in the package rather than in the link.

2) If you do decide to use curatedTCGAData, your code would look like:

library(curatedTCGAData)


3) If you are looking for hg38 data you may be better off using the GenomicDataCommons since this package works specifically with the GDC API and can also provide legacy archive data. curatedTCGAData serves data mostly from the Firehose data pipeline along with some curated subtype information.

A & B) A can be done with the subsetByRow function or mae[gene_vector, , ] type of subsetting as long as all your experiments have rowname annotations. For B, you'd have to work with assays function on a MultiAssayExperiment to get a List of matrices for doing correlations.

Note. This workflow link may be of interest: https://master.bioconductor.org/packages/release/workflows/vignettes/liftOver/inst/doc/liftov.html

Regards, Marcel

0
Entering edit mode

Dear Marcel,

1) Yes, I'm a frequent user of TCGAbiolinks. mainly for analyzing transcriptomic and mutational data. However, for example there are various discrepancies when moving from hg19 to hg38, such as the absence of gistic data that could be used for the correlation analysis of gene expression. That is the main reason that i would like to use a "multi-data" container of various omics representations, as avoid significant computational time and pitfalls trying to isolate different molecular layers and then integrate them

2) Based on your answer, i should avoid using the link and the relative code i have posted ? and based on your code provided:

the final coad objectm is actually the needed MultiAssayExperiment object ? that is similar to the tutorial described also in the tutorial i have posted ?

3) Thank you also for the consideration of liftover-I'm also aware of this mainly for mutational data, but I'm a bit reluctant using it, because a lot of information is lost..

2
Entering edit mode

2) Yes, curatedTCGAData returns a MultiAssayExperiment data object.

0
Entering edit mode

Thank you Marcel for your updated answer-just an extra point that i forgot to mention for my answer-concering the data the MultiAssayExperiment containts, they are indeed hg19 correct ? And for the cox hazard model, if i find the same subset of genes with the hg19 analysis, significantly related with survival, as the same with hg38, in your opinion i could state that are robust regardless of technology/genome used ?

2
Entering edit mode

They're based on the Firehose output, which says in its FAQ that:

Q: What reference genome build are you using?

A: We match the reference genome used in our analyses to the reference used to generate the data as appropriate. Our understanding is that TCGA standards stipulate that OV, COAD/READ, and LAML data are hg18, and all else is hg19. caveat: SNP6 copy number data is available in both hg18 and hg19 for all tumor cohorts, so we use hg19 for copy number analyses in all cases.

Someone correct me if I'm wrong, but if you are using data summarized at the gene level, such as the level 3 RNA-seq data provided by curatedTCGAData, hg19 vs GRCh38 is not going to affect your analysis. Same for the level 3 mutation data which are summarized by genes. The genome build would affect the coordinates of mutations from VCF files, and copy number alterations, for example.