based on an ongoing project about the evaluation of multi-omics data integration methodologies for the identification of biomarkers and causative pathways in distinct cancer types, we are currently utilizing the R packages MultiAssayExperiment and curatedTCGAData, to fetch and utilize specific multi-omics cancer datasets. Based on an illustrative example with the TCGA-COAD cohort:
coad.updated <- curatedTCGAData( diseaseCode = "COAD", assays = c( "RPPAArray","Mutation", "RNASeq2GeneNorm", "GISTIC_ThresholdedByGene" ), dry.run = FALSE )
My major questions based on the downstream pre-processing and utilization of MAE objects related to specific TCGA projects:
1) My first question is about the data processing steps for specific layers, such as the RNA-Seq expression data, which is this case is called "RNASeq2GeneNorm", and from the relative description, is matched to “Upper quartile normalized RSEM TPM gene expression values”-regarding the expression values, did you implement any further transformation steps ?
From the relative links of broad institute & firebrowse: https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844334346/Documentation#Documentation-RNAseqPipelines & http://firebrowse.org/
in the respective rna-seq pipelines, it has a part called mRNAseq_preprocessor, which also mentions log2-transformation- however, from checking the expression below:
exp.coad <- coad.updated[,,"COAD_RNASeq2GeneNorm-20160128"] matrix.exp <- assay(exp.coad) head(matrix.exp) TCGA-A6-2671-01A-01R-1410-07 TCGA-A6-2672-01A-01R-0826-07 A1BG 25.4180 60.7006 A1CF 10.8359 15.2866 A2BP1 6.1920 0.0000 A2LD1 131.2848 305.9618 A2ML1 0.0000 0.0000 range(matrix.exp)  0 1865036
it seems that the values have not been log2-transformed-thus, these expression values represent only rsem estimated counts but also on a scaled version ? additionally, is there any link for the relative data pipelines used for each of the legacy data layers ?
2) the main use of the function qreduceTCGA is to transform the initial mutational data object into a final object(simplified) with gene symbols, and 0/1s based on the column with variant classification, correct ? and based on the relative example (https://bioconductor.org/packages/3.11/bioc/vignettes/TCGAutils/inst/doc/TCGAutils.html#splitassays-separate-the-data-from-different-tissue-types) a liftover process is necessary to also make all the assays coordinated to hg19 ?
3) Regarding the very important function mergeReplicates:
in the relative vignette, you provide the following code chunk example:
Can this specific function also applied as default in any MAE object, without the use of intersectColumns ? and by default, it utilizes the mean of the replicated samples as one final unit/value, which includes the main patient barcode without "duplicated" measurements ? For example, in the simplify argument could anyone use as an input value a complete custom function ?
4) Finally, for any external data transformation in any aforementioned layers: for example isolate the numeric matrix of gene expression (via assay function), and perform dimensionality reduction of the features, as extra steps-then, to “re-assign” it to the initial MAE object, a simple function like the following would suffice ?
coad.updated[[“COAD_RNASeq2GeneNorm-20160128”]] <- updated.matrix