Memory error while running alignments
1
0
Entering edit mode
Mihir • 0
@6fc2d61b
Last seen 3 months ago
United States

Hello I am getting this error while aligning sequences. I use Clusterize(), AlignSeq() and DistanceMatrix(). Clusterize works fine, but not the other two. There are about 50,000 sequences with range 50-200 base pairs each


Aligning Sequences:
================================================================================

Time difference of 17679.85 secs

============
 *** caught segfault ***
address 0x14ba5fc5c040, cause 'memory not mapped'

Traceback:
 1: DistanceMatrix(aligned_seqs, verbose = TRUE, processors = 1)
An irrecoverable exception occurred. R is aborting now ...
/var/spool/uge/e241/job_scripts/20856211: line 16: 4046314 Segmentation fault
DECIPHER Alignment • 371 views
ADD COMMENT
0
Entering edit mode

How much memory is available on the computer? A distance matrix of 50k sequences requires at least 50k50k/2 8 bytes = 10GB, assuming the output type = 'dist' in DistanceMatrix(). AlignSeqs() (depending on the parameterization) will need more memory than the distance matrix. Please provide example code if you would like more assistance.

ADD REPLY
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 1 hour ago
The Cave, 181 Longwood Avenue, Boston, …

Your error arises from insufficient memory allocation on your system. A distance matrix for 50,000 sequences requires substantial RAM. For the default output type of 'dist' in DistanceMatrix(), this needs at least 10 gigabytes, calculated as (50,000 50,000 / 2) 8 bytes. The AlignSeqs() function may require even more memory, depending on its parameters such as the guide tree construction or refinement steps.

The segmentation fault indicates that R attempted to access unmapped memory, which typically occurs when the process exceeds available RAM. Your system likely cannot handle the full set of 50,000 sequences in one operation.

To resolve this, process your sequences in smaller batches. First, use Clusterize() to group similar sequences, as you mentioned it works. Then, align and compute distances within each cluster separately. This reduces memory usage per operation.

Here is an example workflow in R:

library(DECIPHER)

# Assume 'seqs' is your DNAStringSet with 50,000 sequences
clusters <- Clusterize(seqs, cutoff = 0.05)  # Adjust cutoff as needed

# Loop over unique clusters
unique_clusters <- unique(clusters$cluster)
for (cl in unique_clusters) {
  cluster_seqs <- seqs[clusters$cluster == cl]
  if (length(cluster_seqs) > 1) {  # Skip singletons if desired
    aligned <- AlignSeqs(cluster_seqs, processors = NULL)  # Use all cores
    dist_mat <- DistanceMatrix(aligned, type = "dist", verbose = TRUE)
    # Save or process dist_mat here, e.g., saveRDS(dist_mat, paste0("dist_cluster_", cl, ".rds"))
  }
}

Increase your system's RAM if possible, or run on a high-memory server. Set processors = NULL in AlignSeqs() and DistanceMatrix() to utilize multiple cores, which may help with speed but not directly with memory. Avoid verbose = TRUE if it exacerbates issues.

If this does not work, provide your exact code and system specifications for further diagnosis.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6