Dear Community,
based on a recent machine learning approach in cancer transcriptomic data, my basic goal is to utilize mainly various mutliomics experiments concerning a TCGA dataset (for example COAD and expand to other cancers), for 2 main analyses, based on the R packages curatedCRCData and MultiAssayExperiment, which include various types of omics data for each of TCGA cancers:
A) for a selected signature of 12 genes, to perform an analysis like a Multivariate Cox regression including RNA-seq, copy number, and pathology, and test for any one of these genes for any significant results-like the EZN2 gene as desribed following tutorial:
B) The second-also very important part-is to perform a correlation analysis for RNA-Seq data with copy number variation data, again for this subset of genes-and rank, these genes which indeed show the highest and most important correlation.
Thus:
1) For initial investigation of the COAD dataset with the R package MultiAssayExperiment, i only found this link:
https://docs.google.com/spreadsheets/d/1Ih64DDS5mqDlYFzDyCY9HAUnxvI1b6hapKP_akFuNPY/edit#gid=0
Thus, i could download the Colon dataset from there ?
Or there are also any recent updates ?
2) Assuming my above approach is valid, and based on the link above, i should proceed with the following code ? :
library(MultiAssayExperiment) library(RaggedExperiment) library(SummarizedExperiment) accCOAD <- readRDS("coadMAEO.rds") accCOAD <- updateObject(accCOAD)
3) My final and perhaps most crusial question:
If my notion is correct-the data which are included in the above links and repositories, for the datasets, are from the firehose repository, right ?
And so, these are the legendary hg19 data from the original publications, without any updates in the survival rates or the protocols, right ?
My reason for asking, is also for my current project, i have already performed an analysis on the hg38 provisional TCGA data in the COAD dataset, but only on the transcriptomics layer, for my aformentioned signature. Thus, the posibility of utilizing the MultiAssayExperimement to interrogate at the same time different omic layers, will be a great asset for my purpose-
however, how i should interpret the difference in the genome ? i mean, if i perform a multivariate cox analysis for survival in the hg19 finding any genes significant-which have already show a survival significance in hg38-will strengthen more my results, and wiil illustrate that are robust, regardless the different protocols/technology used ?
Thank you very much for your time and consideration on this matter, and i wait for your very crusial comments or suggestions !!
Kind Regards,
Efstathios-Iason
Dear Marcel,
thank you for your quick comments and suggestions-just to recap and summarize some important comments based on your answer:
1) Yes, I'm a frequent user of TCGAbiolinks. mainly for analyzing transcriptomic and mutational data. However, for example there are various discrepancies when moving from hg19 to hg38, such as the absence of gistic data that could be used for the correlation analysis of gene expression. That is the main reason that i would like to use a "multi-data" container of various omics representations, as avoid significant computational time and pitfalls trying to isolate different molecular layers and then integrate them
2) Based on your answer, i should avoid using the link and the relative code i have posted ? and based on your code provided:
library(curatedTCGAData)
coad <- curatedTCGAData("COAD", "*", FALSE)
the final coad objectm is actually the needed MultiAssayExperiment object ? that is similar to the tutorial described also in the tutorial i have posted ?
3) Thank you also for the consideration of liftover-I'm also aware of this mainly for mutational data, but I'm a bit reluctant using it, because a lot of information is lost..
2) Yes, `curatedTCGAData` returns a `MultiAssayExperiment` data object.
Thank you Marcel for your updated answer-just an extra point that i forgot to mention for my answer-concering the data the MultiAssayExperiment containts, they are indeed hg19 correct ? And for the cox hazard model, if i find the same subset of genes with the hg19 analysis, significantly related with survival, as the same with hg38, in your opinion i could state that are robust regardless of technology/genome used ?
They're based on the Firehose output, which says in its FAQ that:
Q: What reference genome build are you using?
A: We match the reference genome used in our analyses to the reference used to generate the data as appropriate. Our understanding is that TCGA standards stipulate that OV, COAD/READ, and LAML data are hg18, and all else is hg19. caveat: SNP6 copy number data is available in both hg18 and hg19 for all tumor cohorts, so we use hg19 for copy number analyses in all cases.
Someone correct me if I'm wrong, but if you are using data summarized at the gene level, such as the level 3 RNA-seq data provided by curatedTCGAData, hg19 vs GRCh38 is not going to affect your analysis. Same for the level 3 mutation data which are summarized by genes. The genome build would affect the coordinates of mutations from VCF files, and copy number alterations, for example.