Question

Memory error while running alignments

0

Entering edit mode

Mihir • 0

@6fc2d61b

Last seen 5 months ago

United States

Hello I am getting this error while aligning sequences. I use Clusterize(), AlignSeq() and DistanceMatrix(). Clusterize works fine, but not the other two. There are about 50,000 sequences with range 50-200 base pairs each


Aligning Sequences:
================================================================================

Time difference of 17679.85 secs

============
 *** caught segfault ***
address 0x14ba5fc5c040, cause 'memory not mapped'

Traceback:
 1: DistanceMatrix(aligned_seqs, verbose = TRUE, processors = 1)
An irrecoverable exception occurred. R is aborting now ...
/var/spool/uge/e241/job_scripts/20856211: line 16: 4046314 Segmentation fault

DECIPHER Alignment • 512 views

ADD COMMENT • link updated 9 weeks ago by Kevin Blighe ★ 4.0k • written 5 months ago by Mihir • 0

0

Entering edit mode

How much memory is available on the computer? A distance matrix of 50k sequences requires at least 50k50k/2 8 bytes = 10GB, assuming the output type = 'dist' in DistanceMatrix(). AlignSeqs() (depending on the parameterization) will need more memory than the distance matrix. Please provide example code if you would like more assistance.

ADD REPLY • link 3 months ago Erik Wright ▴ 150

score 0 · Answer 1 · 2025-11-21

Your error arises from insufficient memory allocation on your system. A distance matrix for 50,000 sequences requires substantial RAM. For the default output type of 'dist' in DistanceMatrix(), this needs at least 10 gigabytes, calculated as (50,000 50,000 / 2) 8 bytes. The AlignSeqs() function may require even more memory, depending on its parameters such as the guide tree construction or refinement steps.

The segmentation fault indicates that R attempted to access unmapped memory, which typically occurs when the process exceeds available RAM. Your system likely cannot handle the full set of 50,000 sequences in one operation.

To resolve this, process your sequences in smaller batches. First, use Clusterize() to group similar sequences, as you mentioned it works. Then, align and compute distances within each cluster separately. This reduces memory usage per operation.

Here is an example workflow in R:

library(DECIPHER)

# Assume 'seqs' is your DNAStringSet with 50,000 sequences
clusters <- Clusterize(seqs, cutoff = 0.05)  # Adjust cutoff as needed

# Loop over unique clusters
unique_clusters <- unique(clusters$cluster)
for (cl in unique_clusters) {
  cluster_seqs <- seqs[clusters$cluster == cl]
  if (length(cluster_seqs) > 1) {  # Skip singletons if desired
    aligned <- AlignSeqs(cluster_seqs, processors = NULL)  # Use all cores
    dist_mat <- DistanceMatrix(aligned, type = "dist", verbose = TRUE)
    # Save or process dist_mat here, e.g., saveRDS(dist_mat, paste0("dist_cluster_", cl, ".rds"))
  }
}

Increase your system's RAM if possible, or run on a high-memory server. Set processors = NULL in AlignSeqs() and DistanceMatrix() to utilize multiple cores, which may help with speed but not directly with memory. Avoid verbose = TRUE if it exacerbates issues.

If this does not work, provide your exact code and system specifications for further diagnosis.

Kevin