Perform correlation analysis between miRNA and mRNA gene expression data on the same TCGA dataset based on the curatedTCGAData
1
1
Entering edit mode
svlachavas ▴ 800
@svlachavas-7225
Last seen 8 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

briefly, based on a previously identified gene signature in a specific type of cancer (through gene expression analysis), in parallel i found also 4 specific microRNAs (mature miRs) that regulate a specific subset of my signature (~18 genes) via experimentally validated databases. Now, as i final step i would like to explore in the TCGA COAD dataset, the expression of the miRs and the relative expression of these genes in the same patients, to investigate any kind of significant and negative correlation, which would confirm further my notion-

from a quick search, i found that the curatedTCGAData R package contains various assays for various types of TCGA data, including the cancer of interest, and from a small query:

curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)

Please see the list below for available cohorts and assays
Available Cancer codes:
ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
Available Data Types:
CNACGH CNASeq CNASNP CNVSNP GISTICA GISTICT
Methylation miRNAArray miRNASeqGene mRNAArray
Mutation RNASeq2GeneNorm RNASeqGene RPPAArray



Thus:

A) with COAD, which data types should i select ? in order to have only the miRNA expression and the RNASeq expression data ?

i see that there are miRNAArray, miRNASeqGene, RNASeq2GeneNorm, RNASeqGene and mRNAArray-however i dont know the specific differences, as i have used data mostly from the GDC server-my notion is that both types of expression should be normalized and/or transformed into the same way, for the correlation analysis to be appropriate

B) Moreover, how i could subset both assays, based on specific miRs and specific gene symbols simultaneously ?

Any suggestions, help or idea would be essential !!

2
Entering edit mode
@levi-waldron-3429
Last seen 7 weeks ago
CUNY Graduate School of Public Health a…

A) the drill-down process in curatedTCGAData goes something like this. The data are the last snapshot provided by TCGA Firehose, ie GDC "legacy" data ( https://confluence.broadinstitute.org/display/GDAC/FAQ ).

> library(curatedTCGAData)
> curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
Please see the list below for available cohorts and assays
Available Cancer codes:
ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
Available Data Types:
CNACGH CNASeq CNASNP CNVSNP GISTICA GISTICT
Methylation miRNAArray miRNASeqGene mRNAArray
Mutation RNASeq2GeneNorm RNASeqGene RPPAArray
> mae <- curatedTCGAData(diseaseCode = "COAD", assays = c("miRNASeqGene", "RNASeq2GeneNorm"), dry.run = FALSE)
>


B) This provides a MultiAssayExperiment object, which you can subset by rownames to select genes and miRNA of interest. The MultiAssayExperiment package has a cheat sheet to help with quick reference for such operations. For example:

> mae
A MultiAssayExperiment object of 2 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 2:
[1] COAD_miRNASeqGene-20160128: SummarizedExperiment with 705 rows and 221 columns
[2] COAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 191 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
\$, [, [[ - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
> rownames(mae)
CharacterList of length 2
[["COAD_miRNASeqGene-20160128"]] hsa-let-7a-1 hsa-let-7a-2 hsa-let-7a-3 hsa-let-7b ... hsa-mir-98 hsa-mir-99a hsa-mir-99b
[["COAD_RNASeq2GeneNorm-20160128"]] A1BG A1CF A2BP1 A2LD1 A2ML1 A2M A4GALT ... ZYG11A ZYG11B ZYX ZZEF1 ZZZ3 psiTPTE22 tAKR
> rownames(mae[c("hsa-let-7a-1", "A1BG"), , ])
CharacterList of length 2
>


Note that the TCGAUtils package provides a number of other helper functions for MultiAssayExperiment objects coming from curatedTCGAData, for example, adding ranges so that you can subset by GRanges objects instead of by symbols.

0
Entering edit mode

Dear Levi,

1) The link above that you have included does not work- so it is possible to search or check which processing steps with which algorithms the assays "miRNASeqGene" and "RNASeq2GeneNorm" have been performed ?

i could only found the following link:

For which it mentions:

*mRNAseq Preprocessor

The mRNAseq preprocessor picks the "scaled_estimate" (RSEM) value from Illumina HiSeq/GA2 mRNAseq level_3 (v2) data set and makes the mRNAseq matrix with log2 transformed for the downstream analysis. If there are overlap samples between two different platforms, samples from illumina hiseq will be selected. The pipeline also creates the matrix with RPKM and log2 transform from HiSeq/GA2 mRNAseq level 3 (v1) data set.*

*miRseq Preprocessor

The miRseq preprocessor picks the "RPM" (reads per million miRNA precursor reads) from the Illumina HiSeq/GA miRseq Level_3 data set and makes the matrix with log2 transformed values.*

Thus, the miRNASeqGene are RPM log2 values ? but what about RNASeq2GeneNorm ? are RSEM values or RPKM ? Please excuse me to insist on this, but it is crusial to decide if both expression values are comparable and i could then perform a direct correlation analysis, or further transformations are necessary

2) A) Thank you also for the cheat sheet-so, the one function that i could use to subset simultaneously both datasets, would be:

rownames(mae[c("hsa-let-7a-1", "A1BG"), , ]) ? and include both gene symbols and miRs ?


B) Moreover, with the functions assays() you think that it is also necessary to subset to only common samples in both experiments ? which is also mandatory to perform my type of correlation analysis ?

1
Entering edit mode

1) I've fixed the link to the Firehose FAQ above. Your interpretation of the miRNASeqGene values seems correct to me, but I don't want to represent myself as an expert on the Firehose pipeline itself or take any chance of giving you a wrong answer there...

2) You may find wideFormat(), MatchedAssayExperiment(), and assays() from MultiAssayExperiment useful for different kinds of correlation analysis. There are some examples in the workshop I gave last year at BioC2018.

0
Entering edit mode

Thank you one more time Levi for the feedback and information-i fully understand that the nature and type of normalization is upon me and how i should proceed-the workshop link looks great, so i think that based on your functions and tutorials, subsetting and moving for downstream analysis will not be a bottleneck-i will create a new post or answer here for any specific functions related to MultiAssayExperiment