EventPointer on TCGA RNA-seq
Entering edit mode
Marie • 0
Last seen 8 weeks ago

Hi! I am trying to run EventPointer on a large collection of RNA-seq samples from TCGA. The samples were downloaded from TCGA Data Portal and are in BAM-format. Ideally, I would be able to run this on ~400 samples, but ~250-300 would also be ok.

The first round I tried running EventPointer, the algorithm produced warnings about regions on chromosome "ChrUn_XXXX" and "ChrM_XXXX". The script continued to print warning after warning about it skipping specific regions and ran for ages without seeming able to complete. It seemed EventPointer got "stuck" on these reads from unmapped/mitochondrial chromosome regions. Therefore, I made new BAM-files with command

samtools view -b file.bam chr1 chr2 chr3 chr4 ... > file_24chroms.bam

to extract reads from "normal" chromosomes 1-22 and X and Y. I also have checked that the BAM-files have the "XS"tag mentioned in the EventPointer manual in step 4.1.1.

I ran a test of EventPointer on a couple of samples with reads from exclusively the "normal" chromosomes, and it was successfull. However, when I try running EventPointer on all the samples, it constantly runs out of memory. I am running the script on a HPC in a slurm-script. With my user rights I can ask for allocation of ~200hours and 24GB per CPU with 4 CPUS. When I run the script, it runs through PrepareBam and begins the "Predicting Features from BAMs...". I am again getting warnings of skipped regions, but the script continues to run.

My problem now is that the script never finishes without being killed by out-of-memory handlers. I am wondering if it is possible to calculate exactly how much memory ( maybe also time) needed based on the sizes of the BAM-files, so that I can ask specifically for the CPU needed to finish my process. Any tips on how to run EventPointer for many RNASeq samples on HPCs much appreciated!

Thank you!

This is the message I get when it runs out of memory:

Starting job 3204849 on c3-[31-32] at Mon Aug 2 10:24:30 CEST 2021

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) StdEnv
slurmstepd: error: Detected 47 oom-kill event(s) in StepId=3204849.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Task and CPU usage stats:
       JobID    JobName  AllocCPUS   NTasks     MinCPU MinCPUTask     AveCPU    Elapsed ExitCode
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- --------
3204849              EP          8                                           12-02:00:20      0:0
3204849.bat+      batch          4        1                                  12-02:00:20      0:0
3204849.ext+     extern          8        2                                  12-02:00:20      0:0

Memory usage stats:
       JobID     MaxRSS MaxRSSTask     AveRSS MaxPages   MaxPagesTask   AvePages
------------ ---------- ---------- ---------- -------- -------------- ----------

Disk usage stats:
       JobID  MaxDiskRead MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteTask   AveDiskWrite
------------ ------------ --------------- -------------- ------------ ---------------- --------------

Job 3204849 completed at Sat Aug 14 12:24:50 CEST 2021

This is my EventPointer R script:

# Setup
print("Setting libpath to local directory to be able to load S4Vectors")
print("Done setting libpath.")

print("Loading Eventpointer")
print("Done loading Evenpointer")

# Number of cores to use
RESULTS_DIR <- "/mypath/TCGA-COAD-RNA-seq/results/chroms"

# Run EP
# Read list of BAM filenames from a file
samples <- scan("/mypath/TCGA-COAD-RNA-seq/scripts/ep_script/chroms/TCGA-COAD-RNAseq-400_samples.txt", what="list", sep="\n")
path_to_samples <- "/mypath/TCGA-COAD-RNA-seq/BAM-FILES/chroms/bam_files"
path_to_gtf <- "/mypath/TCGA-COAD-RNA-seq/gencode.v35.annotation.gtf"

print("PrepareBam START")
sg_feature_counts <- PrepareBam_EP(Samples=samples, SamplePath=path_to_samples, Ref_Transc="GTF", fileTrans$
# Save preliminary results
save.image(file.path(RESULTS_DIR, "preparebam_finished_chroms_400.RData"))
print("PrepareBam STOP")

# Run EventDetection function
print("EventDetection START")
# Directory to write the EventsFound_RNASeq.txt file
txt_path <- RESULTS_DIR
AllEvents_RNASeq <- EventDetection(sg_feature_counts, cores=NUM_CORES, Path=txt_path)
# Save preliminary results
save.image(file.path(RESULTS_DIR, "eventdetection_finished_chroms_400.RData"))
print("EventDetection STOP")

> sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /cluster/software/OpenBLAS/0.3.12-GCC-10.2.0/lib/libopenblas_skylakexp-r0.3.12.so

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] EventPointer_2.8.0          Matrix_1.2-18
 [3] SGSeq_1.24.0                SummarizedExperiment_1.20.0
 [5] Biobase_2.50.0              MatrixGenerics_1.2.1
 [7] matrixStats_0.58.0          Rsamtools_2.6.0
 [9] Biostrings_2.58.0           XVector_0.30.0
[11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7
[13] IRanges_2.24.1              S4Vectors_0.28.1
[15] BiocGenerics_0.36.1

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                      bit64_4.0.5
 [3] doParallel_1.0.16                 progress_1.2.2
 [5] httr_1.4.2                        tools_4.0.3
 [7] R6_2.5.0                          DBI_1.1.0
 [9] colorspace_1.4-1                  rhdf5filters_1.2.1
[11] tidyselect_1.1.0                  prettyunits_1.1.1
[13] bit_4.0.4                         curl_4.3
[15] compiler_4.0.3                    graph_1.68.0
[17] quantreg_5.75                     SparseM_1.78
[19] xml2_1.3.2                        DelayedArray_0.16.3
[21] rtracklayer_1.50.0                scales_1.1.1
[23] nnls_1.4                          RBGL_1.66.0
[25] askpass_1.1                       rappdirs_0.3.1
[27] stringr_1.4.0                     digest_0.6.27
[29] pkgconfig_2.0.3                   dbplyr_2.0.0
[31] limma_3.46.0                      BSgenome_1.58.0
[33] rlang_0.4.8                       RSQLite_2.2.1
[35] generics_0.1.0                    BiocParallel_1.24.1
[37] dplyr_1.0.2                       RCurl_1.98-1.2
[39] magrittr_1.5                      GenomeInfoDbData_1.2.4
[41] Rhdf5lib_1.12.1                   Rcpp_1.0.5
[43] munsell_0.5.0                     lifecycle_0.2.0
[45] stringi_1.5.3                     MASS_7.3-53
[47] zlibbioc_1.36.0                   rhdf5_2.34.0
[49] plyr_1.8.6                        qvalue_2.22.0
[51] BiocFileCache_1.14.0              grid_4.0.3
[53] affxparser_1.62.0                 blob_1.2.1
[55] crayon_1.3.4                      lattice_0.20-41
[57] splines_4.0.3                     GenomicFeatures_1.42.3
[59] hms_0.5.3                         BSgenome.Hsapiens.UCSC.hg38_1.4.3
[61] pillar_1.4.6                      igraph_1.2.6
[63] RUnit_0.4.32                      reshape2_1.4.4
[65] codetools_0.2-18                  biomaRt_2.46.3
[67] XML_3.99-0.5                      glue_1.4.2
[69] cobs_1.3-4                        vctrs_0.3.4
[71] foreach_1.5.1                     MatrixModels_0.4-1
[73] gtable_0.3.0                      openssl_1.4.3
[75] purrr_0.3.4                       assertthat_0.2.1
[77] ggplot2_3.3.2                     prodlim_2019.11.13
[79] survival_3.2-7                    tibble_3.0.4
[81] conquer_1.0.2                     iterators_1.0.13
[83] GenomicAlignments_1.26.0          AnnotationDbi_1.52.0
[85] memoise_1.1.0                     lava_1.6.8.1
[87] ellipsis_0.3.1
RNASeq EventPointer • 68 views

Login before adding your answer.

Traffic: 305 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6