Question

Error in MAE experiment regarding the replacement of internal assays with different number of features

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

while trying to utilize the MultiAssayExperiment data container, along with a specific queried TCGA dataset, to perform some initial filtering and data cleaning prior downstream analysis, I have encountered the following error when tried to add an additional omics layer to the MAE object:


library(MultiAssayExperiment)
library(curatedTCGAData)
library(TCGAutils)
library(UpSetR)
library(DESeq2)

curatedTCGAData(diseaseCode = "LUAD", assays = "*", dry.run = TRUE)

luad.mae<- curatedTCGAData(

    diseaseCode = "LUAD",

    assays = c(

       "RPPAArray","Mutation",
       "RNASeq2GeneNorm", # here perhaps the raw RSEM counts
       "GISTIC_ThresholdedByGene"

    ),

    dry.run = FALSE

)

rag <- "LUAD_Mutation-20160128"
genome(luad.mae[[rag]]) <- translateBuild(genome(luad.mae[[rag]]))
seqlevelsStyle(luad.mae[[rag]]) <- "UCSC"
genome(luad.mae[[rag]])

luad.updated <- qreduceTCGA(luad.mae, keep.assay = FALSE)

luad.processed <- mergeReplicates(intersectColumns(luad.updated))

tums <- TCGAsampleSelect(colnames(luad.processed), "01")
luad.final <- luad.processed[, tums, ]

luad.final
A MultiAssayExperiment object of 4 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 4:
 [1] LUAD_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 181 columns
 [2] LUAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 146 columns
 [3] LUAD_RPPAArray-20160128: SummarizedExperiment with 223 rows and 181 columns
 [4] LUAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22929 rows and 181 columns
Features:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample availability DFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices

# Isolate the rna-seq omics assay to perform normalization

rna.seq <- getWithColData(luad.final, 2L)
count.dat <- assay(rna.seq)
pheno.dat <- colData(rna.seq)

dds <- DESeqDataSetFromMatrix(countData = count.dat,
                                 colData = pheno.dat,
                                 design = ~ 1)

dds.norm <- vst(dds)

# a small non-specific "intensity" filtering procedure to remove unexpressed features

NotExpressed <- apply( assay(dds.norm), MARGIN=2, function( z ){
  dens <- density( z )
  expr.cut <- dens$x[ which.max( dens$y )]
    return( z < expr.cut )
} )

expr.ps <- rowSums( !NotExpressed ) > ( ncol( NotExpressed )/2 ) 
dds.filtered <- dds.norm[expr.ps,  ]

dim(dds.filtered)
[1] 17374   146

sum(duplicated(rownames(assay(dds.filtered))))
[1] 0

luad2 <- c(luad.final, list(newMatrix = assay(dds.filtered)), mapFrom = "RNASeq2GeneNorm")

Error: subscript contains invalid rownames
In addition: Warning message:
Assuming column order in the data provided 
 matches the order in 'mapFrom' experiment(s) colnames 

# alternative approach to directly replace the expression matrix instead of adding an additional layer:

luad.final[["LUAD_RNASeq2GeneNorm-20160128"]] <- assay(dds.filtered)
harmonizing input:
  removing 146 sampleMap rows with 'colname' not in colnames of experiments

# but then unfortunately all the relative columns from the RNA-Seq experiment are lost

luad.final
A MultiAssayExperiment object of 4 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 4:
 [1] LUAD_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 181 columns
 [2] LUAD_RNASeq2GeneNorm-20160128: matrix with 17374 rows and 0 columns
 [3] LUAD_RPPAArray-20160128: SummarizedExperiment with 223 rows and 181 columns
 [4] LUAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22929 rows and 181 columns
Features:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample availability DFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] M3C_1.10.0                              TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [3] GenomicFeatures_1.40.1                  AnnotationDbi_1.50.3                   
 [5] DESeq2_1.28.1                           limma_3.44.3                           
 [7] UpSetR_1.4.0                            TCGAutils_1.8.1                        
 [9] cBioPortalData_2.0.10                   AnVIL_1.0.3                            
[11] dplyr_1.0.2                             curatedTCGAData_1.10.1                 
[13] MultiAssayExperiment_1.14.0             SummarizedExperiment_1.18.2            
[15] DelayedArray_0.14.1                     matrixStats_0.57.0                     
[17] Biobase_2.48.0                          GenomicRanges_1.40.0                   
[19] GenomeInfoDb_1.24.2                     IRanges_2.22.2                         
[21] S4Vectors_0.26.1                        BiocGenerics_0.34.0                    
[23] TCGAbiolinks_2.17.3                     tximport_1.16.1                        
[25] tximportData_1.16.0

Thus:

1) Why there is an error hitting with the c function ? I have also checked and there are not duplicated row names, only a smaller subset of the initial RNA-Seq features-

2) Is there an alternative way of replacing the original RNA-Seq dataset, with the updated transformed one ? without losing any other information like sample phenotype information ?

3) Finally, could a similar replacement be performed for additional assays, like the mutations and the CNAs simultaneously ? providing only additional updated matrices with a reduced number of features ?

Overall, my ultimate goal is to use the final MultiAssayExperiment for multi-omics integration, and one putative alternative is to convert each layer to a long data frame regarding the omics matrices, and keep an additional data frame with all phenotype information...

Best,

Efstathios

MultiAssayExperiment curatedTCGAData TCGAutils • 1.2k views

ADD COMMENT • link updated 3.4 years ago by Marcel Ramos 690 • written 3.4 years ago by svlachavas ▴ 830

score 2 · Accepted Answer · 2020-12-04

2

Entering edit mode

Marcel Ramos 690

@marcel-ramos-7325

Last seen 3 days ago

United States

Hi Efstathios,

Next time, please be kind enough to provide a minimally reproducible example. It makes it easier for anyone to answer your question quickly.

1) You are having an error because the name provided and the name of the assays do not match. Use something like:

c(luad.final, list(newMatrix = assay(dds.filtered)), mapFrom = "LUAD_RNASeq2GeneNorm-20160128")

2) Yes, the alternative is to use the single bracket method ([) in conjunction with a List/list or ExperimentList of the same length as the subsetting vector (i.e., 1:2 in the following example):

luad.final[, , 1:2] <- list(A = matrix(...), B = SummarizedExperiment(...))

3) Yes, #2 will handle multiple replacements.

Best regards,

Marcel

ADD COMMENT • link 3.4 years ago Marcel Ramos 690

0

Entering edit mode

Dear Marcel,

thank you very much for your answer and comments !! and please excuse me for the long code chunk, I wanted just to be certain that I did not omitted any "extra" line that might caused the error !!

I was just confused based on my previous post regarding the MAE utilization (https://support.bioconductor.org/p/134693/) regarding your comment on the c function for adding experiments, and I did not utilized the whole name abbreviation (c(coad.updated, list(newMatrix = updated.matrix), mapFrom = "RNASeq2GeneNorm"))

ADD REPLY • link 3.4 years ago svlachavas ▴ 830

1

Entering edit mode

I've made minor changes to the documentation and examples to make this more clear in MultiAssayExperiment 1.17.2.

ADD REPLY • link 3.4 years ago Marcel Ramos 690