Error following instructions of CBioportal R-workshop example 2
1
0
Entering edit mode
Javier • 0
@fd04bde6
Last seen 25 days ago
Spain

Greeetings

This is my first time using bioinformatics data and manipulating S4 objects in R so apologies if it's very elementary, but I seem to be having trouble following along the tutorial cBioPortal provides

https://cbioportal.github.io/2020-cbioportal-r-workshop/Example2.pdf

(Please note that in my Spanish session, your ’ ’ are " " for me)

When I reach

upsetSamples(LUAD_MAE) in page 6, I get the following error

Error in .rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed Además: Warning message: non-unique values when setting 'row.names': ‘TCGA-50-5066’, ‘TCGA-50-5946’

Now after doing LUAD_MAE@colData@rownames == "TCGA-50-5066"

LUAD_MAE@colData@rownames == "TCGA-50-5946"

I confirm that I get TRUE twice, in rows just one after another. So there seems to be a duplicate subject here that was maybe added to the data after the tutorial was uploaded. I figured, no big deal, I will be careful with overlapping data in the future but this is just to get the feel of the functions offered, I'll rename them to something made up and the rest of the object and columns should be fine, just assigned to a name that is different to the original

However, trying

"TCGA-50-5166" <- LUAD_MAE@colData@rownames == "TCGA-50-5066"

and

"TCGA-50-5266" <- LUAD_MAE@colData@rownames == "TCGA-50-5946"

to overwrite them with madeup rownames doesn't seem to work. same error returns upon trying to run upsetSamples, and the same check with LUAD_MAE@colData@rownames == "TCGA-50-5166" brings a fully FALSE output, indicating I didn't even overwrite it properly. Guess here I found out editing S4 objects isn't as trivial as with a normal dataframe.

I tried to search before asking for help here, and SlotOP seems like it would do what I want if I just get the change to rownames as a string to input into LUAD_MAE, but unfortunately https://stat.ethz.ch/R-manual/R-devel/library/base/html/slotOp.html has a broken link at 'see base for more details', wayback machine doesn't have it archived, and the simplest slotOP( LUAD_MAE@coldata@rownames<-LUAD_MAErownamescorrected) doesn't work nor do I have easily googleable examples of use.

Please advise why the source data appears to be different now than at the time of the tutorial, and clarify if this is just a mistake in handling S4 objects and editing them that has an obvious solution I'm missing.

The traceback() output is

7: stop("duplicate 'row.names' are not allowed") 6: .rowNamesDF<-(x, value = value) 5: row.names<-.data.frame(*tmp*, value = value) 4: row.names<-(*tmp*, value = value) 3: rownames<-(*tmp*, value = rownames(colData(mae))) 2: rownames<-(*tmp*, value = rownames(colData(mae))) 1: upsetSamples(LUAD_MAE)

As for my session info

sessionInfo( ) R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] UpSetR_1.4.0 ggplot2_3.3.3 stringr_1.4.0
[4] httr_1.4.2 cBioPortalData_2.2.8 MultiAssayExperiment_1.16.0 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0 GenomicRanges_1.42.0
[10] GenomeInfoDb_1.26.7 IRanges_2.24.1 S4Vectors_0.28.1
[13] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 matrixStats_0.58.0
[16] AnVIL_1.2.0 dplyr_1.0.5

loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_4.0.5 progress_1.2.2
[4] rprojroot_2.0.2 GenomicDataCommons_1.14.0 tools_4.0.5
[7] utf8_1.2.1 R6_2.5.0 colorspace_2.0-0
[10] DBI_1.1.1 withr_2.4.1 processx_3.5.1
[13] gridExtra_2.3 tidyselect_1.1.0 prettyunits_1.1.1
[16] TCGAutils_1.10.0 bit_4.0.4 curl_4.3
[19] compiler_4.0.5 cli_2.4.0 rvest_1.0.0
[22] formatR_1.9 xml2_1.3.2 DelayedArray_0.16.3
[31] rapiclient_0.1.3 RCircos_1.2.1 digest_0.6.27
[34] Rsamtools_2.6.0 XVector_0.30.0 pkgconfig_2.0.3
[37] dbplyr_2.1.1 fastmap_1.1.0 limma_3.46.0
[40] rlang_0.4.10 rstudioapi_0.13 RSQLite_2.2.5
[43] generics_0.1.0 jsonlite_1.7.2 BiocParallel_1.24.1
[46] RCurl_1.98-1.3 magrittr_2.0.1 GenomeInfoDbData_1.2.4
[49] futile.logger_1.4.3 Matrix_1.3-2 munsell_0.5.0
[52] Rcpp_1.0.6 fansi_0.4.2 lifecycle_1.0.0
[55] stringi_1.5.3 yaml_2.2.1 RaggedExperiment_1.14.1
[58] RJSONIO_1.3-1.4 zlibbioc_1.36.0 pkgbuild_1.2.0
[61] plyr_1.8.6 BiocFileCache_1.14.0 grid_4.0.5
[64] blob_1.2.1 crayon_1.4.1 lattice_0.20-41
[67] Biostrings_2.58.0 splines_4.0.5 GenomicFeatures_1.42.3
[70] hms_1.0.0 ps_1.6.0 pillar_1.6.0
[73] biomaRt_2.46.3 futile.options_1.0.1 XML_3.99-0.6
[76] glue_1.4.2 remotes_2.3.0 lambda.r_1.2.4
[79] data.table_1.14.0 BiocManager_1.30.12 vctrs_0.3.7
[82] gtable_0.3.0 tidyr_1.1.3 openssl_1.4.3
[85] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.4
[88] xfun_0.22 survival_3.2-10 tibble_3.1.0
[91] RTCGAToolbox_2.20.0 GenomicAlignments_1.26.0 tinytex_0.31
[94] AnnotationDbi_1.52.0 memoise_2.0.0 ellipsis_0.3.1



Lastly, other issues that have turned up but that I suspect are unrelated (just in case) is that I seem to lack 'removeCache' which was supposed to be good to apply to imported data in a previous tutorial. The regular install.packages("Remove Cache") for CRAN doesn't work, nor does BiocManager::install("removeCache"), and googling gave me a package called cacheflow that seemed like a likely source but remotes::install_github("alekrutkowski/cacheflow") also does not change the fact that I get informed there is no such library to be found. On my way to do this I also discovered my haven is 2.3.1 instead of the 2.4.0 that BiocManager seems to expect, and my rtracklayer 1.49.5 instead of 1.50.0, but installing them (and RSQlite along the way) doesn't seem to work. I don't mind if this part is ignored if it's indeed unrelated.

MultiAssayExperiment cBioPortalData • 128 views
1
Entering edit mode
@marcel-ramos-7325
Last seen 13 days ago
United States

Thanks Kevin for attempting to answer.

These are packages in Bioconductor that I maintain.

I would recommend the OP (Javier) to review our documentation so that they can better work with the MultiAssayExperiment interface. website

Our collaborator wrote the tutorial for a workshop at cBioPortal and thus it has some bits of unconventional code. That said most of the functions there should be working.

In your example, it is likely that the data have different types of samples (tumor and normals) in them. The data do change as updates are made to them by the cBioPortal data team. I would recommend that you remove the rows that correspond to the normal samples in the colData as they may be causing the issue. You can do something like : colData(luad) <- colData(luad)[!endsWith(colData(luad)\$SAMPLE_ID, "02"),] to remove them.

See sampleTables(luad) and sampleTypes to get an overview of what samples are in the data. You can also use splitAssays to separate tumor vs normals or TCGAprimaryTumors to only get a subset of tumors. These functions are in TCGAutils see here for cheatsheet website.

Edit: PS. the cache removal functions have been renamed to removePackCache and removeDataCache.

Best,

Marcel

1
Entering edit mode

Sorry about that - I should have checked more in depth.