Search
Question: Using readAAMultipleAlignment to parse CMFinder Stockholm format alignments
0
gravatar for Keith Hughitt
10 months ago by
United States
Keith Hughitt90 wrote:

Greetings,

I'm currently working on a pipeline which makes use of the downloadable version of the covariance model-based RNA motif detection program, CMFinder.

CMFinder outputs detected motifs in a modified version of the Stockholm output format. (Example: http://bio.cs.washington.edu/CMfinder/glmS.sto)

Problem

My goal is to read the motifs back into R and convert them to PWM's for further use.

Attempting to call readRNAMultipleAlignment directly, however, the fails. For example, using the glmS.sto sample output file above:

> readRNAMultipleAlignment('glmS.sto', format='stockholm')
Error in .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$") :
  alignment rows out of order
Calls: readRNAMultipleAlignment ... .read.Stockholm -> chartr -> .read.MultipleAlignment.splitRows

Debugging

The problematic function (.read.MultipleAlignment.splitRows) is defined as:

debugging in: .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$")

debug: {
    markupLines <- grep(markupPattern, rows, perl = TRUE)
    alnLines <- gaps(as(markupLines, "IRanges"), start = 1, end = length(rows))
    nseq <- unique(width(alnLines))
    if (length(nseq) != 1) 
        stop("missing alignment rows")
    rows <- extractROWS(rows, alnLines)
    spaces <- regexpr("\\s+", rows)
    ids <- substr(rows, 1L, spaces - 1L)
    nsplits <- length(rows)%/%nseq
--->if (!identical(ids, rep.int(head(ids, nseq), nsplits))) 
        stop("alignment rows out of order")
    alns <- substr(rows, spaces + attr(spaces, "match.length"), 
        nchar(rows))
    structure(do.call(paste, c(split(alns, rep(seq_len(nsplits), 
        each = nseq)), sep = "")), names = head(ids, nseq))
}

The problematic line (indicated by ---> above), fails due to the fact that ids:

> ids
 [1] "AP001507.1/288594-288961" "AB006424.1/2906-3273"    
 [3] "AE017024.1/157480-157835" "AE016998.1/163093-163448"
 [5] "AP004593.1/258808-259176" "AL596166.1/50559-50954"  

...

Looks nothing like rep.int(head(ids, nseq), nsplits):

> rep.int(head(ids, nseq), nsplits)
 [1] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [3] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [5] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
...

Which is, in turn, due to the fact the the width of the entries in alnLines is always 1:

Browse[2]> alnLines
IRanges object with 70 ranges and 0 metadata columns:
           start       end     width
       <integer> <integer> <integer>
   [1]        34        34         1
   [2]        36        36         1
   [3]        38        38         1

...

This is about as far as I've gotten into tracking down the problem.  I think the next step will be to try parsing some Stockholm files from other sources and see if that works, and if so, what the differences are in the files.

I'm hoping that there is something small that I can do either to patch Biostrings, or to modify the input file stream to enable the CMfinder outputs to be read in, but I thought It would be good to check here first in case anyone else has already solved this problem, or has suggestions.

Any help would be greatly appreciated!

Cheers,

Keith

----

System info:

> sessionInfo()
R Under development (unstable) (2017-01-08 r71936)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.43.2   XVector_0.15.0      IRanges_2.9.14     
[4] S4Vectors_0.13.5    BiocGenerics_0.21.1 colorout_1.1-2     

loaded via a namespace (and not attached):
[1] zlibbioc_1.21.0 compiler_3.4.0

 

 

ADD COMMENTlink modified 10 months ago • written 10 months ago by Keith Hughitt90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 146 users visited in the last hour