Question

Using readAAMultipleAlignment to parse CMFinder Stockholm format alignments

0

Entering edit mode

Keith Hughitt ▴ 180

@keith-hughitt-6740

Last seen 8 weeks ago

United States

Greetings,

I'm currently working on a pipeline which makes use of the downloadable version of the covariance model-based RNA motif detection program, CMFinder.

CMFinder outputs detected motifs in a modified version of the Stockholm output format. (Example: http://bio.cs.washington.edu/CMfinder/glmS.sto)

Problem

My goal is to read the motifs back into R and convert them to PWM's for further use.

Attempting to call readRNAMultipleAlignment directly, however, the fails. For example, using the glmS.sto sample output file above:

> readRNAMultipleAlignment('glmS.sto', format='stockholm')
Error in .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$") :
  alignment rows out of order
Calls: readRNAMultipleAlignment ... .read.Stockholm -> chartr -> .read.MultipleAlignment.splitRows

Debugging

The problematic function (.read.MultipleAlignment.splitRows) is defined as:

debugging in: .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$")

debug: {
    markupLines <- grep(markupPattern, rows, perl = TRUE)
    alnLines <- gaps(as(markupLines, "IRanges"), start = 1, end = length(rows))
    nseq <- unique(width(alnLines))
    if (length(nseq) != 1) 
        stop("missing alignment rows")
    rows <- extractROWS(rows, alnLines)
    spaces <- regexpr("\\s+", rows)
    ids <- substr(rows, 1L, spaces - 1L)
    nsplits <- length(rows)%/%nseq
--->if (!identical(ids, rep.int(head(ids, nseq), nsplits))) 
        stop("alignment rows out of order")
    alns <- substr(rows, spaces + attr(spaces, "match.length"), 
        nchar(rows))
    structure(do.call(paste, c(split(alns, rep(seq_len(nsplits), 
        each = nseq)), sep = "")), names = head(ids, nseq))
}

The problematic line (indicated by ---> above), fails due to the fact that ids:

> ids
 [1] "AP001507.1/288594-288961" "AB006424.1/2906-3273"    
 [3] "AE017024.1/157480-157835" "AE016998.1/163093-163448"
 [5] "AP004593.1/258808-259176" "AL596166.1/50559-50954"  

...

Looks nothing like rep.int(head(ids, nseq), nsplits):

> rep.int(head(ids, nseq), nsplits)
 [1] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [3] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
 [5] "AP001507.1/288594-288961" "AP001507.1/288594-288961"
...

Which is, in turn, due to the fact the the width of the entries in alnLines is always 1:

Browse[2]> alnLines
IRanges object with 70 ranges and 0 metadata columns:
           start       end     width
       <integer> <integer> <integer>
   [1]        34        34         1
   [2]        36        36         1
   [3]        38        38         1

...

This is about as far as I've gotten into tracking down the problem. I think the next step will be to try parsing some Stockholm files from other sources and see if that works, and if so, what the differences are in the files.

I'm hoping that there is something small that I can do either to patch Biostrings, or to modify the input file stream to enable the CMfinder outputs to be read in, but I thought It would be good to check here first in case anyone else has already solved this problem, or has suggestions.

Any help would be greatly appreciated!

Cheers,

Keith

----

System info:

> sessionInfo()
R Under development (unstable) (2017-01-08 r71936)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] Biostrings_2.43.2 XVector_0.15.0 IRanges_2.9.14
[4] S4Vectors_0.13.5 BiocGenerics_0.21.1 colorout_1.1-2

loaded via a namespace (and not attached):
[1] zlibbioc_1.21.0 compiler_3.4.0

biostrings stockholm msa cmfinder • 1.2k views

ADD COMMENT • link 7.3 years ago Keith Hughitt ▴ 180