Question: Using readAAMultipleAlignment to parse CMFinder Stockholm format alignments
gravatar for Keith Hughitt
20 months ago by
Keith Hughitt120
United States
Keith Hughitt120 wrote:


I'm currently working on a pipeline which makes use of the downloadable version of the covariance model-based RNA motif detection program, CMFinder.

CMFinder outputs detected motifs in a modified version of the Stockholm output format. (Example:


My goal is to read the motifs back into R and convert them to PWM's for further use.

Attempting to call readRNAMultipleAlignment directly, however, the fails. For example, using the glmS.sto sample output file above:

> readRNAMultipleAlignment('glmS.sto', format='stockholm')
Error in .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$") :
  alignment rows out of order
Calls: readRNAMultipleAlignment ... .read.Stockholm -> chartr -> .read.MultipleAlignment.splitRows


The problematic function (.read.MultipleAlignment.splitRows) is defined as:

debugging in: .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$")

debug: {
    markupLines <- grep(markupPattern, rows, perl = TRUE)
    alnLines <- gaps(as(markupLines, "IRanges"), start = 1, end = length(rows))
    nseq <- unique(width(alnLines))
    if (length(nseq) != 1) 
        stop("missing alignment rows")
    rows <- extractROWS(rows, alnLines)
    spaces <- regexpr("\\s+", rows)
    ids <- substr(rows, 1L, spaces - 1L)
    nsplits <- length(rows)%/%nseq
--->if (!identical(ids,, nseq), nsplits))) 
        stop("alignment rows out of order")
    alns <- substr(rows, spaces + attr(spaces, "match.length"), 
    structure(, c(split(alns, rep(seq_len(nsplits), 
        each = nseq)), sep = "")), names = head(ids, nseq))

The problematic line (indicated by ---> above), fails due to the fact that ids:

> ids
 [1] "AP001507.1/288594-288961" "AB006424.1/2906-3273"    
 [3] "AE017024.1/157480-157835" "AE016998.1/163093-163448"
 [5] "AP004593.1/258808-259176" "AL596166.1/50559-50954"  


Looks nothing like, nseq), nsplits):

>, nseq), nsplits)
 [1] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [3] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [5] "AP001507.1/288594-288961" "AP001507.1/288594-288961"

Which is, in turn, due to the fact the the width of the entries in alnLines is always 1:

Browse[2]> alnLines
IRanges object with 70 ranges and 0 metadata columns:
           start       end     width
       <integer> <integer> <integer>
   [1]        34        34         1
   [2]        36        36         1
   [3]        38        38         1


This is about as far as I've gotten into tracking down the problem.  I think the next step will be to try parsing some Stockholm files from other sources and see if that works, and if so, what the differences are in the files.

I'm hoping that there is something small that I can do either to patch Biostrings, or to modify the input file stream to enable the CMfinder outputs to be read in, but I thought It would be good to check here first in case anyone else has already solved this problem, or has suggestions.

Any help would be greatly appreciated!




System info:

> sessionInfo()
R Under development (unstable) (2017-01-08 r71936)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.43.2   XVector_0.15.0      IRanges_2.9.14     
[4] S4Vectors_0.13.5    BiocGenerics_0.21.1 colorout_1.1-2     

loaded via a namespace (and not attached):
[1] zlibbioc_1.21.0 compiler_3.4.0



ADD COMMENTlink modified 20 months ago • written 20 months ago by Keith Hughitt120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 409 users visited in the last hour