Question

Gviz/Biostrings memory issue?

0

Entering edit mode

stianlagstad ▴ 90

@stianlagstad-9723

Last seen 4.1 years ago

I'm running into memory issues trying to plot a region from a .BAM file produced from RACE-seq. The .BAM file is not huge, but the high coverage might be the problem. Here's an example:

# Reproducing the memory problem
# First, source this alternative bam-file import function for use with Gviz::AlignmentsTrack. It will import a bamfile
# with non-UCSC chromosome names (i.e. "chr7" instead of just "7").
importFunctionNonUCSC <- function (file, selection) {
  indNames <- c(sub("\\.bam$", ".bai", file), paste(file, "bai", sep = "."))
  index <- NULL
  for (i in indNames) {
    if (file.exists(i)) {
      index <- i
      break
    }
  }
  if (is.null(index))
    stop("Unable to find index for BAM file '", file, "'. You can build an index using the following command:\n\t",
         "library(Rsamtools)\n\tindexBam(\"", file, "\")")
  pairedEnd <- parent.env(environment())[["._isPaired"]]
  if (is.null(pairedEnd))
    pairedEnd <- TRUE
  bf <- Rsamtools::BamFile(file, index = index, asMates = pairedEnd)
  # Rename chromosome names from UCSD to non-UCSC
  GenomeInfoDb::seqlevels(selection) <- sub(
    "chr",
    "",
    GenomeInfoDb::seqlevels(selection))
  param <- Rsamtools::ScanBamParam(
    which = selection,
    what = Rsamtools::scanBamWhat(),
    tag = "MD",
    flag = Rsamtools::scanBamFlag(isUnmappedQuery = FALSE))
  reads <- if (as.character(seqnames(selection)[1]) %in% names(Rsamtools::scanBamHeader(bf)$targets))
    Rsamtools::scanBam(bf, param = param)[[1]]
  else list()
  md <- if (is.null(reads$tag$MD))
    rep(as.character(NA), length(reads$pos))
  else reads$tag$MD
  if (length(reads$pos)) {
    layed_seq <- GenomicAlignments::sequenceLayer(reads$seq, reads$cigar)
    region <- unlist(Rsamtools::bamWhich(param), use.names = FALSE)
    ans <- stackStrings(
      layed_seq,
      start(region),
      end(region),
      shift = reads$pos - 1L,
      Lpadding.letter = "+",
      Rpadding.letter = "+")
    names(ans) <- seq_along(reads$qname)
  } else {
    ans <- DNAStringSet()
  }
  return(GRanges(
    # Return chromosome name prefixed with "chr"
    seqnames = if (is.null(reads$rname)) character() else paste("chr", reads$rname, sep=""),
    strand = if (is.null(reads$strand)) character() else reads$strand,
    ranges = IRanges(start = reads$pos, width = reads$qwidth),
    id = reads$qname, cigar = reads$cigar, mapq = reads$mapq,
    flag = reads$flag, md = md, seq = ans, isize = reads$isize,
    groupid = if (pairedEnd) reads$groupid else seq_along(reads$pos),
    status = if (pairedEnd) reads$mate_status else rep(factor("unmated", levels = c("mated", "ambiguous", "unmated")), length(reads$pos))))
}

# Create the alignments track
alTrack <- Gviz::AlignmentsTrack(
  range = "ETV1_cut_new.bam",
  isPaired = TRUE,
  genome = "hg19",
  name = "RNA coverage",
  ylim = c(0,1000),
  importFunction = importFunctionNonUCSC)

# Try to plot. More and more memory will be used until the program freezes and the only way out is to shut it down.
Gviz::plotTracks(
  alTrack,
  from = 13978871-50000,
  to = 13978871+50000,
  chromosome = "chr7")

I have not gotten any specific error messages myself, but a college got this:

## Warning in .Call2("XStringSet_xscat", args, PACKAGE = "Biostrings"):
## Reached total allocation of 8075Mb: see help(memory.size)

Which says that the problem might be in Biostrings. Can anyone here help me look into this? The .BAM file used in the example above can be downloaded here, and the .BAI file can be downloaded here.

gviz biostrings • 1.8k views

ADD COMMENT • link 7.4 years ago stianlagstad ▴ 90

0

Entering edit mode

I am not sure that I understand the issue. Are you saying that consecutive draws of the same region fill up the memory? Cause that I can't reproduce:

I only see a sharp increase in the memory profile after the first call. If this is a memory leak one should see a steady increases somewhere (most likely in the char or character types)

ADD REPLY • link 7.4 years ago florian.hahne@novartis.com ★ 1.6k

0

Entering edit mode

Thank you for responding. I'm saying that when I run plotTracks like in my post, my R session stops responding because it is demanding too much memory. The only way out is to shut down the program (killall rstudio). Were you abel to produce the plot I tried to plot?

ADD REPLY • link 7.4 years ago stianlagstad ▴ 90

0

Entering edit mode

Yes, worked without any issues. I may have more memory on my box than you have, though. At a certain point R will realize all these reads from the BAM file region in memory. If you have crazy coverage then you may reach a limit eventually, but there's nothing that can be done about that.

ADD REPLY • link 7.4 years ago florian.hahne@novartis.com ★ 1.6k

0

Entering edit mode

I was hoping something could be done about that, since I'm not interested in showing the reads. Could it not be possible to count the coverage without holding everything in memory?

ADD REPLY • link 7.4 years ago stianlagstad ▴ 90

0

Entering edit mode

You could use type="coverage" for the plot. It would still have to read in all the CIGAR strings in order to figure out gapped alignments, So not sure how much that would reduce the memory footprint. If that is not an option, just calculate the coverage directly from the BAM files (there are multiple documented ways to do this) and feed that in as a DataTrack

ADD REPLY • link 7.4 years ago florian.hahne@novartis.com ★ 1.6k

0

Entering edit mode

Alright. Thanks again!

ADD REPLY • link 7.4 years ago stianlagstad ▴ 90