Hi,
I was processing a scRNA-seq data which has about 1.3M cells with around 53000 genes. When I tried the function perFeatureQCMetrics() to get some gene metrics for filtering genes, it ran over 24 hours and still not finished even with parallelization. I preferred to use all cells with bioconductor packages and not to apply Sketch-based analysis in Seurat. Is there a way to speed it up? Thanks a lot!
Best, JG
# Here is the code I used .
bp.params <- MulticoreParam(workers = 26)
per.feat <- perFeatureQCMetrics(sce, BPPARAM = bp.params)
sessionInfo( )
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Azimuth_0.5.0 shinyBS_0.61.1
[3] patchwork_1.3.0 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1
[11] tibble_3.2.1 tidyverse_2.0.0
[13] AnnotationHub_3.12.0 BiocFileCache_2.10.2
[15] dbplyr_2.5.0 PCAtools_2.16.0
[17] ggrepel_0.9.6 scran_1.32.0
[19] BiocParallel_1.38.0 DropletUtils_1.24.0
[21] scater_1.32.1 ggplot2_3.5.1
[23] scuttle_1.14.0 SingleCellExperiment_1.26.0
[25] SummarizedExperiment_1.34.0 Biobase_2.64.0
[27] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
[29] IRanges_2.38.1 S4Vectors_0.42.1
[31] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
[33] matrixStats_1.4.1
loaded via a namespace (and not attached):
[1] fs_1.6.4 spatstat.sparse_3.1-0
[3] ProtGenerics_1.34.0 bitops_1.0-9
[5] DirichletMultinomial_1.44.0 TFBSTools_1.42.0
[7] httr_1.4.7 RColorBrewer_1.1-3
[9] sctransform_0.4.1 tools_4.4.1
[11] utf8_1.2.4 R6_2.5.1
[13] DT_0.33 HDF5Array_1.32.1
[15] uwot_0.2.2 lazyeval_0.2.2
[17] rhdf5filters_1.16.0 withr_3.0.1
[19] sp_2.1-4 prettyunits_1.2.0
[21] gridExtra_2.3 progressr_0.14.0
[23] cli_3.6.3 spatstat.explore_3.3-2
[25] fastDummies_1.7.4 EnsDb.Hsapiens.v86_2.99.0
[27] shinyjs_2.1.0 Seurat_5.1.0
[29] spatstat.data_3.1-2 ggridges_0.5.6
[31] pbapply_1.7-2 Rsamtools_2.18.0
[33] R.utils_2.12.3 parallelly_1.38.0
[35] BSgenome_1.70.2 limma_3.60.6
[37] RSQLite_2.3.7 generics_0.1.3
[39] BiocIO_1.12.0 gtools_3.9.5
[41] spatstat.random_3.3-2 ica_1.0-3
[43] googlesheets4_1.1.1 GO.db_3.18.0
[45] Matrix_1.7-1 ggbeeswarm_0.7.2
[47] fansi_1.0.6 abind_1.4-8
[49] R.methodsS3_1.8.2 lifecycle_1.0.4
[51] yaml_2.3.10 edgeR_4.2.2
[53] rhdf5_2.48.0 SparseArray_1.4.8
[55] Rtsne_0.17 grid_4.4.1
[57] blob_1.2.4 promises_1.3.0
[59] dqrng_0.4.1 shinydashboard_0.7.2
[61] pwalign_1.0.0 crayon_1.5.3
[63] miniUI_0.1.1.1 lattice_0.22-6
[65] beachmat_2.20.0 cowplot_1.1.3
[67] annotate_1.82.0 GenomicFeatures_1.54.4
[69] KEGGREST_1.44.1 pillar_1.9.0
[71] metapod_1.12.0 rjson_0.2.23
[73] future.apply_1.11.2 codetools_0.2-20
[75] fastmatch_1.1-4 leiden_0.4.3.1
[77] glue_1.8.0 spatstat.univar_3.0-0
[79] data.table_1.16.2 remotes_2.5.0
[81] vctrs_0.6.5 png_0.1-8
[83] spam_2.11-0 cellranger_1.1.0
[85] poweRlaw_0.80.0 gtable_0.3.5
[87] cachem_1.1.0 Signac_1.14.0
[89] S4Arrays_1.4.1 mime_0.12
[91] pracma_2.4.4 survival_3.7-0
[93] gargle_1.5.2 RcppRoll_0.3.1
[95] statmod_1.5.0 bluster_1.14.0
[97] fitdistrplus_1.2-1 ROCR_1.0-11
[99] nlme_3.1-165 bit64_4.5.2
[101] progress_1.2.3 filelock_1.0.3
[103] RcppAnnoy_0.0.22 irlba_2.3.5.1
[105] vipor_0.4.7 KernSmooth_2.23-24
[107] seqLogo_1.68.0 SeuratDisk_0.0.0.9021
[109] colorspace_2.1-1 DBI_1.2.3
[111] tidyselect_1.2.1 processx_3.8.4
[113] bit_4.5.0 compiler_4.4.1
[115] curl_5.2.3 BiocNeighbors_1.22.0
[117] hdf5r_1.3.11 xml2_1.3.6
[119] desc_1.4.3 DelayedArray_0.30.1
[121] plotly_4.10.4 rtracklayer_1.62.0
[123] caTools_1.18.3 scales_1.3.0
[125] lmtest_0.9-40 callr_3.7.6
[127] rappdirs_0.3.3 goftest_1.2-3
[129] digest_0.6.37 presto_1.0.0
[131] spatstat.utils_3.1-0 XVector_0.44.0
[133] htmltools_0.5.8.1 pkgconfig_2.0.3
[135] sparseMatrixStats_1.16.0 fastmap_1.2.0
[137] ensembldb_2.26.0 rlang_1.1.4
[139] htmlwidgets_1.6.4 UCSC.utils_1.0.0
[141] shiny_1.9.1 DelayedMatrixStats_1.26.0
[143] farver_2.1.2 zoo_1.8-12
[145] jsonlite_1.8.9 R.oo_1.26.0
[147] BiocSingular_1.20.0 RCurl_1.98-1.16
[149] magrittr_2.0.3 GenomeInfoDbData_1.2.12
[151] dotCall64_1.2 Rhdf5lib_1.26.0
[153] munsell_0.5.1 Rcpp_1.0.13
[155] viridis_0.6.5 reticulate_1.39.0
[157] stringi_1.8.4 zlibbioc_1.50.0
[159] MASS_7.3-61 plyr_1.8.9
[161] pkgbuild_1.4.4 parallel_4.4.1
[163] listenv_0.9.1 CNEr_1.38.0
[165] deldir_2.0-4 Biostrings_2.72.1
[167] splines_4.4.1 tensor_1.5
[169] hms_1.1.3 locfit_1.5-9.10
[171] BSgenome.Hsapiens.UCSC.hg38_1.4.5 ps_1.7.6
[173] igraph_2.1.1 spatstat.geom_3.3-3
[175] RcppHNSW_0.6.0 reshape2_1.4.4
[177] biomaRt_2.58.2 ScaledMatrix_1.12.0
[179] TFMPvalue_0.0.9 BiocVersion_3.19.1
[181] XML_3.99-0.17 SeuratObject_5.0.2
[183] BiocManager_1.30.23 JASPAR2020_0.99.10
[185] tzdb_0.4.0 httpuv_1.6.15
[187] polyclip_1.10-7 RANN_2.6.2
[189] SeuratData_0.2.2.9001 future_1.34.0
[191] scattermore_1.2 rsvd_1.0.5
[193] xtable_1.8-4 restfulr_0.0.15
[195] AnnotationFilter_1.26.0 RSpectra_0.16-2
[197] later_1.3.2 googledrive_2.1.1
[199] viridisLite_0.4.2 memoise_2.0.1
[201] beeswarm_0.4.0 AnnotationDbi_1.66.0
[203] GenomicAlignments_1.38.2 cluster_2.1.6
[205] timechange_0.3.0 globals_0.16.3
You can examine the structure of a
DelayedMatrix
by calling theshowtree()
function. For example:You can see that, at the core of the object, the data is being stored in a HDF5 file. It only shows up as a
DelayedMatrix
because of all the extra delayed operations we added on top, e.g., delayed dimnames. Regardless of the underlying structure of theDelayedMatrix
, it can be loaded into memory withas(assay(X), "SVT_SparseMatrix")
, after which all functions should be a lot faster. scrapper functions will be the fastest but scran and scuttle should also benefit.For the sake of completeness, here is how you can get beachmat to handle the loading instead of using
as()
. This approach only works with scrapper functions, though.Hi Aaron,
You are right. I checked the structure and it is "TENxMatrix" (HDF5-based sparse matrix ). I tried to convert it to SparseArray, but looks it will take longer than I expected. Later I will test how long it will take. I think the transpose matrix may be a easy solution by now. Thank you so much for your help!
Best,JG
IIRC loading ~million cells into memory took about 3-5 minutes, but that was with a rather slow filesystem (HDD on a HPC), so YMMV depending on the specs of your computing environment.