Dear Community,
based on an ongoing project about the evaluation of multi-omics data integration methodologies for the identification of biomarkers and causative pathways in distinct cancer types, we are currently utilizing the R packages MultiAssayExperiment and curatedTCGAData, to fetch and utilize specific multi-omics cancer datasets. Based on an illustrative example with the TCGA-COAD cohort:
coad.updated <- curatedTCGAData(
diseaseCode = "COAD",
assays = c(
"RPPAArray","Mutation",
"RNASeq2GeneNorm",
"GISTIC_ThresholdedByGene"
),
dry.run = FALSE
)
My major questions based on the downstream pre-processing and utilization of MAE objects related to specific TCGA projects:
1) My first question is about the data processing steps for specific layers, such as the RNA-Seq expression data, which is this case is called "RNASeq2GeneNorm", and from the relative description, is matched to “Upper quartile normalized RSEM TPM gene expression values”-regarding the expression values, did you implement any further transformation steps ?
From the relative links of broad institute & firebrowse: https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844334346/Documentation#Documentation-RNAseqPipelines & http://firebrowse.org/
in the respective rna-seq pipelines, it has a part called mRNAseq_preprocessor, which also mentions log2-transformation- however, from checking the expression below:
exp.coad <- coad.updated[,,"COAD_RNASeq2GeneNorm-20160128"]
matrix.exp <- assay(exp.coad)
head(matrix.exp)
TCGA-A6-2671-01A-01R-1410-07 TCGA-A6-2672-01A-01R-0826-07
A1BG 25.4180 60.7006
A1CF 10.8359 15.2866
A2BP1 6.1920 0.0000
A2LD1 131.2848 305.9618
A2ML1 0.0000 0.0000
range(matrix.exp)
[1] 0 1865036
it seems that the values have not been log2-transformed-thus, these expression values represent only rsem estimated counts but also on a scaled version ? additionally, is there any link for the relative data pipelines used for each of the legacy data layers ?
2) the main use of the function qreduceTCGA is to transform the initial mutational data object into a final object(simplified) with gene symbols, and 0/1s based on the column with variant classification, correct ? and based on the relative example (https://bioconductor.org/packages/3.11/bioc/vignettes/TCGAutils/inst/doc/TCGAutils.html#splitassays-separate-the-data-from-different-tissue-types) a liftover process is necessary to also make all the assays coordinated to hg19 ?
3) Regarding the very important function mergeReplicates:
in the relative vignette, you provide the following code chunk example:
mergeReplicates(intersectColumns(myMultiAssay))
Can this specific function also applied as default in any MAE object, without the use of intersectColumns ? and by default, it utilizes the mean of the replicated samples as one final unit/value, which includes the main patient barcode without "duplicated" measurements ? For example, in the simplify argument could anyone use as an input value a complete custom function ?
4) Finally, for any external data transformation in any aforementioned layers: for example isolate the numeric matrix of gene expression (via assay function), and perform dimensionality reduction of the features, as extra steps-then, to “re-assign” it to the initial MAE object, a simple function like the following would suffice ?
coad.updated[[“COAD_RNASeq2GeneNorm-20160128”]] <- updated.matrix
Best,
Efstathios
Dear Marcel,
thank you very much for your detailed feedback and suggestions-some additional comments to your answers, just to be certain that I understood correctly your explanations:
1) Initially, thank you very much regarding the technical details on how you obtain the data and the RTCGAToolbox For the RNA-Seq data-I totally understand-so, it is not the hiseq rather the illuminaGA expression data-I do not know the exact methodological differences (or it is just a different sequencing machine), but I would assume for our purposes this would not be any issue-moreover, due to the range of the values, I would assume that are scaled TPM values as mentioned, so definitely normalization and/or further transformation would be pivotal for any downstream analysis.
2) Regarding the qreduceTCGA, at a first glance I saw it as a necessary step, to put gene symbols as row annotations to the mutation RaggedExperiment object, as also to have the binary values 0/1-thus, 0 denotes no event of mutation or a silent one ?
And for a direct implementation after liftover (in case COAD is used)-the argument keep.assay=T should be used ? For example, for other datasets like BRCA, which do not have hg18 as a ref genome for the mutational data, the same function would be used but without any liftover procedure, correct ?
3) Concerning mergeReplicates-perhaps I understood slightly differently-but technical replicates, you do not mean repeated measurements that have the same patient TCGA barcode, but perhaps a different coordination center, etc. ? For example, TCGA-A7-A26F-01B & TCGA-A7-A26F-01A ? From some previous recommendations, like Broad Institute that suggested something like taking the sample with the "highest lexicographical sort value" for the plate number-thus, when using mergeReplicates, it concatenates samples belonging to the same patient, and by default averages the biological unit values ?
Interestingly, when combined both: mergeReplicates(intersectColumns(myMultiAssay)) but firstly used intersectColumns, the number of samples is not further reduced with mergeReplicates, as I suppose these samples do not have any replicated measurements ?
4) Finally, based on the last question for replacing or not the assay-specifically, for both rna-seq and RPPA data, my goal is to isolate both expression values as matrices, perform further normalization and/or feature reduction (especially for the rna-seq) data, and then replace the old instances in my MAE object with these two updated matrices. Thus, which is the best way to do it ?
from above, you mentioned the function c,and then use the argumentMapFrom-this could be performed with two assays ? or I have to repeated them one time each ? and then, automatically the MAE object would be updated, as also the relative sample data information ?
Thank you in advance,
Efstathios
Hi Efstathios,
1
denotes any non-silent mutations and0
otherwise.keep.assay
is optional if you'd like to keep the original datasets in theMultiAssayExperiment
.Repeated measurements with the same patient barcode could mean technical and biological replicates. We mean technical replicates as defined here: https://altogen.com/difference-technical-biological-replicates/#:~:text=The%20basic%20definitions%20of%20technical,the%20non%2Dtreated%20multiple%20times. If these measurements were done on the same sample at different centers, they would qualify as technical replicates AFAIK. mergeReplicates does not have the behavior you describe from the Broad, though that could be done outside of the function. Combining both
mergeReplicates
andintersectColumns
will ensure that each patient has one sample in each assay.mergeReplicates
works within assays andintersectColumns
works across assays.Yes, multiple assays are supported as long as the length of the input list is the same as the
mapFrom
argument.Best, Marcel