Parallel comparison of two DNAString objects with IUPAC ambiguities
1
0
Entering edit mode
maltethodberg ▴ 180
@maltethodberg-9690
Last seen 1 day ago
Denmark

Is there a way to compare each letter of two DNAString objects of the same length while taking into account IUPAC nucleotide ambiguity codes?

For example:

# Example strings
x <- DNAString("ACTAGCTG")
y <- DNAString("ACYAGCBA")

# == only gives a single value
x == y

# DNAStringSets does not not use ambiguity codes
split(x) == split(y)

I would like following output: c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)

The actual DNAString objects I'm looking at are up to millions of letters long.

Biostrings • 574 views
ADD COMMENT
4
Entering edit mode
Aidan ▴ 60
@3efa9cc7
Last seen 19 days ago
United States

I'm not sure if this is the best way to go about it, but I'd probably do something like the following:

compareStrings <- function(dna1, dna2){
  (as.raw(dna1) & as.raw(dna2))  > 0
}

compareStringsNoGaps <- function(dna1, dna2){
  ((as.raw(dna1) & as.raw(dna2)) & as.raw(15))  > 0
}

x <- DNAString("ATGCATGCATGC")
y <- DNAString("ATGGAYYCANNA")
compareStrings(x, y)
# [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE

which(compareStrings(x,y))
# [1]  1  2  3  5  6  8  9 10 11


## your example
x <- DNAString("ACTAGCTG")
y <- DNAString("ACYAGCBA")
compareStrings(x,y)
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

## have to AND with 15 to prevent matching to gaps, if that's important
compareStrings(DNAString("ATGC.+-"),DNAString("NNNN.+-"))                                                                    
[1]  TRUE  TRUE  TRUE  TRUE TRUE TRUE TRUE

compareStringsNoGaps(DNAString("ATGC.+-"),DNAString("NNNN.+-"))                                                                    
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

That does require that you have enough space for a raw vector as long as the DNAString, but since you're looking for an output that's logical(length(x)), that shouldn't really be a problem.

The first function returns TRUE if the two characters match (obeying ambiguities). The second returns TRUE if the two characters match AND neither character is a gap character (+-.).

Edit: just wanted to give more detail on why this works.

DNAString objects store their values internally as bits, with A=0b0001, C=0b0010, G=0b0100, T=0b1000. Ambiguity codes are created by bitwise-ORs on the values (so M = (A OR C) = 0b0011). You can see all the values for these with xscodes(DNAString()) (though they're expressed in decimal).

Calling as.raw on a DNAString object will convert it to its raw bit representation. If you then AND the strings together, only bits in common will remain. For example, if one position has M and the other A, then M AND A = 0b0011 & 0b0001 = 0b0001 = A. That means that if they have any bits in common, they'll return a non-zero value, and otherwise it'll be zero.

There's a slight hiccup in that other ambiguities are stored in the higher bits. In addition to the first four bits, the fifth bit denotes -, the sixth +, and the seventh .. If it's important to return FALSE when both strings are the same value and in -+., then you have to remove the bits above the fourth bit, which is done with a bitwise AND with 15 = 0b1111.

ADD COMMENT

Login before adding your answer.

Traffic: 921 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6