Greetings,
I'm currently working on a pipeline which makes use of the downloadable version of the covariance model-based RNA motif detection program, CMFinder.
CMFinder outputs detected motifs in a modified version of the Stockholm output format. (Example: http://bio.cs.washington.edu/CMfinder/glmS.sto)
Problem
My goal is to read the motifs back into R and convert them to PWM's for further use.
Attempting to call readRNAMultipleAlignment
directly, however, the fails. For example, using the glmS.sto sample output file above:
> readRNAMultipleAlignment('glmS.sto', format='stockholm') Error in .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$") : alignment rows out of order Calls: readRNAMultipleAlignment ... .read.Stockholm -> chartr -> .read.MultipleAlignment.splitRows
Debugging
The problematic function (.read.MultipleAlignment.splitRows
) is defined as:
debugging in: .read.MultipleAlignment.splitRows(rows, "(^\\s*|^#.*|^//\\s*)$")
debug: {
markupLines <- grep(markupPattern, rows, perl = TRUE)
alnLines <- gaps(as(markupLines, "IRanges"), start = 1, end = length(rows))
nseq <- unique(width(alnLines))
if (length(nseq) != 1)
stop("missing alignment rows")
rows <- extractROWS(rows, alnLines)
spaces <- regexpr("\\s+", rows)
ids <- substr(rows, 1L, spaces - 1L)
nsplits <- length(rows)%/%nseq
--->if (!identical(ids, rep.int(head(ids, nseq), nsplits)))
stop("alignment rows out of order")
alns <- substr(rows, spaces + attr(spaces, "match.length"),
nchar(rows))
structure(do.call(paste, c(split(alns, rep(seq_len(nsplits),
each = nseq)), sep = "")), names = head(ids, nseq))
}
The problematic line (indicated by ---> above), fails due to the fact that ids
:
> ids [1] "AP001507.1/288594-288961" "AB006424.1/2906-3273" [3] "AE017024.1/157480-157835" "AE016998.1/163093-163448" [5] "AP004593.1/258808-259176" "AL596166.1/50559-50954" ...
Looks nothing like rep.int(head(ids, nseq), nsplits)
:
> rep.int(head(ids, nseq), nsplits) [1] "AP001507.1/288594-288961" "AP001507.1/288594-288961" [3] "AP001507.1/288594-288961" "AP001507.1/288594-288961" [5] "AP001507.1/288594-288961" "AP001507.1/288594-288961" ...
Which is, in turn, due to the fact the the width of the entries in alnLines
is always 1:
Browse[2]> alnLines IRanges object with 70 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 34 34 1 [2] 36 36 1 [3] 38 38 1 ...
This is about as far as I've gotten into tracking down the problem. I think the next step will be to try parsing some Stockholm files from other sources and see if that works, and if so, what the differences are in the files.
I'm hoping that there is something small that I can do either to patch Biostrings, or to modify the input file stream to enable the CMfinder outputs to be read in, but I thought It would be good to check here first in case anyone else has already solved this problem, or has suggestions.
Any help would be greatly appreciated!
Cheers,
Keith
----
System info:
> sessionInfo()
R Under development (unstable) (2017-01-08 r71936)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] Biostrings_2.43.2 XVector_0.15.0 IRanges_2.9.14
[4] S4Vectors_0.13.5 BiocGenerics_0.21.1 colorout_1.1-2
loaded via a namespace (and not attached):
[1] zlibbioc_1.21.0 compiler_3.4.0