Question

DEXSeq: summarizeOverlaps() crashes Rstudio with too many samples

0

Entering edit mode

thomas.heigl.ibk • 0

@adde7aac

Last seen 7 days ago

Switzerland

Hi there,

I am trying to use my 42 BAM files (aligned with Subjunc from RSubread package) to prepare a DEXSeqDataSet according to point 2 in the manual (https://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.html#2_Preparations)

I changed BamFileList(bamFiles) from the manual to bamFiles, because we used this function already on the bam files. But also with the original code snippet Rstudio crashed.

Since I have seven condition with each six replicates, I would love to use all BAM files for more complex analyses.

Are there any recommendations how to approach without the need of too much RAM (I have approx 60GB left over)?

Best Thomas

txdb = makeTxDbFromGFF("/home/chuddy/bioinformatics/ref_genomes/mouse_38/annotation/ensembl/Mus_musculus.GRCm38.102.gtf.gz")


flattenedAnnotation = exonicParts(txdb, linked.to.single.gene.only = TRUE)
names(flattenedAnnotation) = sprintf("%s:E%0.3d", flattenedAnnotation$gene_id, flattenedAnnotation$exonic_part)

bamFiles <- list.files("../BAM/subjunc/", full.names = T, include.dirs = F, pattern = "\\.BAM$")
bamFiles <- normalizePath(bamFiles) 

bamFiles = BamFileList(bamFiles)

se = summarizeOverlaps(flattenedAnnotation, 
                       bamFiles, 
                       singleEnd=T,
                       ignore.strand=TRUE)

sessionInfo( )

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Zurich
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3             forcats_1.0.0               stringr_1.5.1               dplyr_1.1.4                
 [5] purrr_1.0.2                 readr_2.1.4                 tidyr_1.3.0                 tibble_3.2.1               
 [9] ggplot2_3.4.4               tidyverse_2.0.0             GenomicAlignments_1.36.0    Rsamtools_2.16.0           
[13] Biostrings_2.68.1           XVector_0.40.0              GenomicFeatures_1.52.2      DEXSeq_1.46.0              
[17] RColorBrewer_1.1-3          AnnotationDbi_1.62.2        DESeq2_1.40.2               SummarizedExperiment_1.30.2
[21] GenomicRanges_1.52.1        GenomeInfoDb_1.36.4         IRanges_2.34.1              S4Vectors_0.38.2           
[25] MatrixGenerics_1.12.3       matrixStats_1.2.0           Biobase_2.60.0              BiocGenerics_0.46.0        
[29] BiocParallel_1.34.2        

loaded via a namespace (and not attached):
 [1] DBI_1.2.0               bitops_1.0-7            biomaRt_2.56.1          rlang_1.1.2             magrittr_2.0.3         
 [6] compiler_4.3.2          RSQLite_2.3.4           png_0.1-8               vctrs_0.6.5             pkgconfig_2.0.3        
[11] crayon_1.5.2            fastmap_1.1.1           dbplyr_2.4.0            utf8_1.2.4              tzdb_0.4.0             
[16] bit_4.0.5               xfun_0.41               zlibbioc_1.46.0         cachem_1.0.8            progress_1.2.3         
[21] blob_1.2.4              DelayedArray_0.26.7     parallel_4.3.2          prettyunits_1.2.0       R6_2.5.1               
[26] stringi_1.8.3           rtracklayer_1.60.1      genefilter_1.82.1       Rcpp_1.0.11             knitr_1.45             
[31] timechange_0.2.0        Matrix_1.6-4            splines_4.3.2           tidyselect_1.2.0        rstudioapi_0.15.0      
[36] abind_1.4-5             yaml_2.3.8              codetools_0.2-19        hwriter_1.3.2.1         curl_5.2.0             
[41] lattice_0.22-5          withr_2.5.2             KEGGREST_1.40.1         survival_3.5-7          BiocFileCache_2.8.0    
[46] xml2_1.3.6              pillar_1.9.0            filelock_1.0.3          generics_0.1.3          RCurl_1.98-1.13        
[51] hms_1.1.3               munsell_0.5.0           scales_1.3.0            xtable_1.8-4            glue_1.6.2             
[56] tools_4.3.2             BiocIO_1.10.0           annotate_1.78.0         locfit_1.5-9.8          XML_3.99-0.16          
[61] grid_4.3.2              colorspace_2.1-0        GenomeInfoDbData_1.2.10 restfulr_0.0.15         cli_3.6.2              
[66] rappdirs_0.3.3          fansi_1.0.6             S4Arrays_1.0.6          gtable_0.3.4            digest_0.6.33          
[71] geneplotter_1.78.0      rjson_0.2.21            memoise_2.0.1           lifecycle_1.0.4         httr_1.4.7             
[76] statmod_1.5.0           bit64_4.0.5

DEXSeq • 567 views

ADD COMMENT • link 5 months ago thomas.heigl.ibk • 0

score 1 · Answer 1 · 2024-02-01

1

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

You can read in chunks, using the 'yieldSize' argument, which may help with the memory situation. An alternative is to use Rsubread, which is likely faster, and might be more memory efficient.

ADD COMMENT • link 5 months ago James W. MacDonald 66k

0

Entering edit mode

thank you for your reply. if I add "yieldSize = 1000000" as a further argument I get Error:

se = summarizeOverlaps(flattenedAnnotation, 
+                        bamFiles, 
+                        singleEnd=T,
+                        ignore.strand=TRUE, yieldSize = 1000000 
+                        )
Error: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error:
Error in FUN(bf, param = param, ...): unused argument (yieldSize = 1e+06)

I would love to use Rsubread to get the counts, since I also use it for DEG analysis, however, I wasn't able to import the Rsubread count matrix into the DEXSeq pipeline, despite there being one tool, which did not work for me with the info or manual I found online. I will try again though and hope that it works with the updated knowledge on objects used in DEXSeq.

ADD REPLY • link 5 months ago thomas.heigl.ibk • 0

1

Entering edit mode

yieldSize is an argument for BamFileList, not summarizeOverlaps

ADD REPLY • link 5 months ago James W. MacDonald 66k

0

Entering edit mode

thank you, this worked.

ADD REPLY • link 5 months ago thomas.heigl.ibk • 0