Using readAAMultipleAlignment to parse CMFinder Stockholm format alignments
0
0
Entering edit mode
Keith Hughitt ▴ 180
@keith-hughitt-6740
Last seen 9 months ago
United States

Greetings,

I'm currently working on a pipeline which makes use of the downloadable version of the covariance model-based RNA motif detection program, CMFinder.

CMFinder outputs detected motifs in a modified version of the Stockholm output format. (Example: http://bio.cs.washington.edu/CMfinder/glmS.sto)

Problem

My goal is to read the motifs back into R and convert them to PWM's for further use.

Attempting to call readRNAMultipleAlignment directly, however, the fails. For example, using the glmS.sto sample output file above:

> readRNAMultipleAlignment('glmS.sto', format='stockholm')
Error in .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$") :
  alignment rows out of order
Calls: readRNAMultipleAlignment ... .read.Stockholm -> chartr -> .read.MultipleAlignment.splitRows

Debugging

The problematic function (.read.MultipleAlignment.splitRows) is defined as:

debugging in: .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$")

debug: {
    markupLines <- grep(markupPattern, rows, perl = TRUE)
    alnLines <- gaps(as(markupLines, "IRanges"), start = 1, end = length(rows))
    nseq <- unique(width(alnLines))
    if (length(nseq) != 1) 
        stop("missing alignment rows")
    rows <- extractROWS(rows, alnLines)
    spaces <- regexpr("\\s+", rows)
    ids <- substr(rows, 1L, spaces - 1L)
    nsplits <- length(rows)%/%nseq
--->if (!identical(ids, rep.int(head(ids, nseq), nsplits))) 
        stop("alignment rows out of order")
    alns <- substr(rows, spaces + attr(spaces, "match.length"), 
        nchar(rows))
    structure(do.call(paste, c(split(alns, rep(seq_len(nsplits), 
        each = nseq)), sep = "")), names = head(ids, nseq))
}

The problematic line (indicated by ---> above), fails due to the fact that ids:

> ids
 [1] "AP001507.1/288594-288961" "AB006424.1/2906-3273"    
 [3] "AE017024.1/157480-157835" "AE016998.1/163093-163448"
 [5] "AP004593.1/258808-259176" "AL596166.1/50559-50954"  

...

Looks nothing like rep.int(head(ids, nseq), nsplits):

> rep.int(head(ids, nseq), nsplits)
 [1] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [3] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [5] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
...

Which is, in turn, due to the fact the the width of the entries in alnLines is always 1:

Browse[2]> alnLines
IRanges object with 70 ranges and 0 metadata columns:
           start       end     width
       <integer> <integer> <integer>
   [1]        34        34         1
   [2]        36        36         1
   [3]        38        38         1

...

This is about as far as I've gotten into tracking down the problem.  I think the next step will be to try parsing some Stockholm files from other sources and see if that works, and if so, what the differences are in the files.

I'm hoping that there is something small that I can do either to patch Biostrings, or to modify the input file stream to enable the CMfinder outputs to be read in, but I thought It would be good to check here first in case anyone else has already solved this problem, or has suggestions.

Any help would be greatly appreciated!

Cheers,

Keith

----

System info:

> sessionInfo()
R Under development (unstable) (2017-01-08 r71936)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.43.2   XVector_0.15.0      IRanges_2.9.14     
[4] S4Vectors_0.13.5    BiocGenerics_0.21.1 colorout_1.1-2     

loaded via a namespace (and not attached):
[1] zlibbioc_1.21.0 compiler_3.4.0

 

 

biostrings stockholm msa cmfinder • 1.4k views
ADD COMMENT

Login before adding your answer.

Traffic: 529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6