Question

Implementation of MultiAssayExperiment and curatedTCGAData R packages for quering and prosessing multi-omics cancer datasets

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on an ongoing project about the evaluation of multi-omics data integration methodologies for the identification of biomarkers and causative pathways in distinct cancer types, we are currently utilizing the R packages MultiAssayExperiment and curatedTCGAData, to fetch and utilize specific multi-omics cancer datasets. Based on an illustrative example with the TCGA-COAD cohort:

coad.updated <- curatedTCGAData(

    diseaseCode = "COAD",

    assays = c(

       "RPPAArray","Mutation",
       "RNASeq2GeneNorm",
       "GISTIC_ThresholdedByGene"

    ),

    dry.run = FALSE

)

My major questions based on the downstream pre-processing and utilization of MAE objects related to specific TCGA projects:

1) My first question is about the data processing steps for specific layers, such as the RNA-Seq expression data, which is this case is called "RNASeq2GeneNorm", and from the relative description, is matched to “Upper quartile normalized RSEM TPM gene expression values”-regarding the expression values, did you implement any further transformation steps ?

From the relative links of broad institute & firebrowse: https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844334346/Documentation#Documentation-RNAseqPipelines & http://firebrowse.org/

in the respective rna-seq pipelines, it has a part called mRNAseq_preprocessor, which also mentions log2-transformation- however, from checking the expression below:

exp.coad <- coad.updated[,,"COAD_RNASeq2GeneNorm-20160128"]
matrix.exp <- assay(exp.coad)
head(matrix.exp)
      TCGA-A6-2671-01A-01R-1410-07 TCGA-A6-2672-01A-01R-0826-07
A1BG                       25.4180                      60.7006
A1CF                       10.8359                      15.2866
A2BP1                       6.1920                       0.0000
A2LD1                     131.2848                     305.9618
A2ML1                       0.0000                       0.0000

range(matrix.exp)
[1]       0 1865036

it seems that the values have not been log2-transformed-thus, these expression values represent only rsem estimated counts but also on a scaled version ? additionally, is there any link for the relative data pipelines used for each of the legacy data layers ?

2) the main use of the function qreduceTCGA is to transform the initial mutational data object into a final object(simplified) with gene symbols, and 0/1s based on the column with variant classification, correct ? and based on the relative example (https://bioconductor.org/packages/3.11/bioc/vignettes/TCGAutils/inst/doc/TCGAutils.html#splitassays-separate-the-data-from-different-tissue-types) a liftover process is necessary to also make all the assays coordinated to hg19 ?

3) Regarding the very important function mergeReplicates:

in the relative vignette, you provide the following code chunk example:

mergeReplicates(intersectColumns(myMultiAssay))

Can this specific function also applied as default in any MAE object, without the use of intersectColumns ? and by default, it utilizes the mean of the replicated samples as one final unit/value, which includes the main patient barcode without "duplicated" measurements ? For example, in the simplify argument could anyone use as an input value a complete custom function ?

4) Finally, for any external data transformation in any aforementioned layers: for example isolate the numeric matrix of gene expression (via assay function), and perform dimensionality reduction of the features, as extra steps-then, to “re-assign” it to the initial MAE object, a simple function like the following would suffice ?

coad.updated[[“COAD_RNASeq2GeneNorm-20160128”]] <- updated.matrix

Best,

Efstathios

MultiAssayExperiment curatedTCGAData TCGAutils • 1.2k views

ADD COMMENT • link updated 3.4 years ago by Marcel Ramos 690 • written 3.5 years ago by svlachavas ▴ 830

score 2 · Answer 1 · 2020-10-14

2

Entering edit mode

Marcel Ramos 690

@marcel-ramos-7325

Last seen 22 hours ago

United States

Hi Efstathios,

Thank you for using our software and taking the time to write out this question. I would like to point you to several resources for investigating the provenance of the data. Unfortunately, due to the complexity and size of the pipeline, we have only been able to provide limited traceability of the data.

1) In general, further transformation steps were not done to the data. The data is served as obtained from the GDAC Firehose pipeline, mainly from RTCGAToolbox. You can obtain this same expression data using:

library(RTCGAToolbox)
co <- getFirehoseData("COAD", RNASeq2GeneNorm=TRUE, clinical = FALSE)
illuminaGA <- biocExtract(co, "RNASeq2GeneNorm")[[1]]
matrix.exp <- assay(illuminaGA)
head(matrix.exp)
range(data.matrix(matrix.exp))

As you can see in the screenshot below, only a couple of links from several listed were taken to be part of the MultiAssayExperiment.TCGA pipeline. We relied on the decisions made by RTCGAToolbox to provide this data.

https://drive.google.com/file/d/1IdiIGOslIYAL0nPBhpE8OXydOM6g2-Fb/view?usp=sharing rnaseq2gene from http://firebrowse.org/?cohort=COAD&download_dialog=true

The biocExtract save some recent implementation improvements generally provides the Bioconductor objects obtained from curatedTCGAData. In this particular case, the function returns a list object of two datasets each corresponding to different platforms (given in the names), illumina ga and illumina hiseq. A forthcoming update to the pipeline aims to provide both these data types in curatedTCGAData.

2) qreduceTCGA makes use of the qreduceAssay function to take gene-level windows and provide a tally of non-silent mutations based on the expected metadata column ('Variant_Classification'). This liftover process is necessary so that the reference data match with the actual data at the tallying stage. The code can be seen here: https://github.com/waldronlab/TCGAutils/blob/master/R/simplifyTCGA.R#L253

3) mergeReplicates provides the user to combine technical replicates into a single measurement. The user should be careful to not combine different sample measuments and risk drowning out any noise. Yes, this function can be applied to any MultiAssayExperiment object and you can provide your own function for summarizing. Usually, both can be applied in either order. Please see ?intersectColumns.

4) This depends on whether you'd like to replace the assay or not. It is recommended that you replace the internal assay matrix in the SummarizedExperiment rather than a plain matrix to avoid losing the metadata (if you're doing replacement). The alternative is to use the c function to add an the transformed data to the MultiAssayExperiment.

c(coad.updated, list(newMatrix = updated.matrix), mapFrom = "RNASeq2GeneNorm")

update

c(coad.updated, list(newMatrix = updated.matrix), mapFrom = "COAD_RNASeq2GeneNorm-20160128")

Best, Marcel

ADD COMMENT • link 3.5 years ago • updated 3.4 years ago Marcel Ramos 690

0

Entering edit mode

Dear Marcel,

thank you very much for your detailed feedback and suggestions-some additional comments to your answers, just to be certain that I understood correctly your explanations:

1) Initially, thank you very much regarding the technical details on how you obtain the data and the RTCGAToolbox For the RNA-Seq data-I totally understand-so, it is not the hiseq rather the illuminaGA expression data-I do not know the exact methodological differences (or it is just a different sequencing machine), but I would assume for our purposes this would not be any issue-moreover, due to the range of the values, I would assume that are scaled TPM values as mentioned, so definitely normalization and/or further transformation would be pivotal for any downstream analysis.

2) Regarding the qreduceTCGA, at a first glance I saw it as a necessary step, to put gene symbols as row annotations to the mutation RaggedExperiment object, as also to have the binary values 0/1-thus, 0 denotes no event of mutation or a silent one ?

And for a direct implementation after liftover (in case COAD is used)-the argument keep.assay=T should be used ? For example, for other datasets like BRCA, which do not have hg18 as a ref genome for the mutational data, the same function would be used but without any liftover procedure, correct ?

3) Concerning mergeReplicates-perhaps I understood slightly differently-but technical replicates, you do not mean repeated measurements that have the same patient TCGA barcode, but perhaps a different coordination center, etc. ? For example, TCGA-A7-A26F-01B & TCGA-A7-A26F-01A ? From some previous recommendations, like Broad Institute that suggested something like taking the sample with the "highest lexicographical sort value" for the plate number-thus, when using mergeReplicates, it concatenates samples belonging to the same patient, and by default averages the biological unit values ?

Interestingly, when combined both: mergeReplicates(intersectColumns(myMultiAssay)) but firstly used intersectColumns, the number of samples is not further reduced with mergeReplicates, as I suppose these samples do not have any replicated measurements ?

cc1 <- mergeReplicates(intersectColumns(coad.updated))
cc1
A MultiAssayExperiment object of 4 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 4:
 [1] COAD_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 100 columns
 [2] COAD_Mutation-20160128: RaggedExperiment with 62530 rows and 100 columns
 [3] COAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 100 columns
 [4] COAD_RPPAArray-20160128: SummarizedExperiment with 208 rows and 100 columns

cc2 <- mergeReplicates(coad.updated)
harmonizing input:
  removing 2 sampleMap rows with 'colname' not in colnames of experiments
Warning messages:
1: In mean.default(baseList, ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(baseList, ...) :
  argument is not numeric or logical: returning NA
cc2
A MultiAssayExperiment object of 4 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 4:
 [1] COAD_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 449 columns
 [2] COAD_Mutation-20160128: RaggedExperiment with 62530 rows and 154 columns
 [3] COAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 191 columns
 [4] COAD_RPPAArray-20160128: SummarizedExperiment with 208 rows and 358 columns

4) Finally, based on the last question for replacing or not the assay-specifically, for both rna-seq and RPPA data, my goal is to isolate both expression values as matrices, perform further normalization and/or feature reduction (especially for the rna-seq) data, and then replace the old instances in my MAE object with these two updated matrices. Thus, which is the best way to do it ?

from above, you mentioned the function c,and then use the argumentMapFrom-this could be performed with two assays ? or I have to repeated them one time each ? and then, automatically the MAE object would be updated, as also the relative sample data information ?

Thank you in advance,

Efstathios

ADD REPLY • link 3.5 years ago svlachavas ▴ 830

1

Entering edit mode

Hi Efstathios,

2) Regarding the qreduceTCGA, at a first glance I saw it as a necessary step, to put gene symbols as row annotations to the mutation RaggedExperiment object, as also to have the binary values 0/1-thus, 0 denotes no event of mutation or a silent one ? And for a direct implementation after liftover (in case COAD is used)-the argument keep.assay=T should be used ? For example, for other datasets like BRCA, which do not have hg18 as a ref genome for the mutational data, the same function would be used but without any liftover procedure, correct ?

Please read the documentation carefully and when in doubt, check the source code. If something in the documentation is not clear, open an issue on the GitHub issues page and we'd be happy to revise it. As stated, 1 denotes any non-silent mutations and 0 otherwise.
keep.assay is optional if you'd like to keep the original datasets in the MultiAssayExperiment.
You don't have to liftover if you're on the same ref genome.

3) Concerning mergeReplicates-perhaps I understood slightly differently-but technical replicates, you do not mean repeated measurements that have the same patient TCGA barcode, but perhaps a different coordination center, etc. ? For example, TCGA-A7-A26F-01B & TCGA-A7-A26F-01A ? From some previous recommendations, like Broad Institute that suggested something like taking the sample with the "highest lexicographical sort value" for the plate number-thus, when using mergeReplicates, it concatenates samples belonging to the same patient, and by default averages the biological unit values ?

Repeated measurements with the same patient barcode could mean technical and biological replicates. We mean technical replicates as defined here: https://altogen.com/difference-technical-biological-replicates/#:~:text=The%20basic%20definitions%20of%20technical,the%20non%2Dtreated%20multiple%20times. If these measurements were done on the same sample at different centers, they would qualify as technical replicates AFAIK. mergeReplicates does not have the behavior you describe from the Broad, though that could be done outside of the function. Combining both mergeReplicates and intersectColumns will ensure that each patient has one sample in each assay. mergeReplicates works within assays and intersectColumns works across assays.

from above, you mentioned the function c,and then use the argumentMapFrom-this could be performed with two assays ? or I have to repeated them one time each ? and then, automatically the MAE object would be updated, as also the relative sample data information ?

Yes, multiple assays are supported as long as the length of the input list is the same as the mapFrom argument.

Best, Marcel

ADD REPLY • link 3.5 years ago Marcel Ramos 690