Search
0
2.2 years ago by
nicholasbauer0 wrote:

I'm trying to perform set operations (union, intersect, setdiff) on DNAStringSets, but doing so strips off the names. How can I do set operations while keeping the names intact?

modified 2.2 years ago by Hervé Pagès ♦♦ 13k • written 2.2 years ago by nicholasbauer0

Looking at the source... the SetOperation method does a unique on the arguments, turns them into character vectors, then performs the set operation on the vector. But as.character strips attributes. I can see the use of this in some cases, but they appear inconsistent with the rest of the class, as it doesn't appear that any other parts of the class assume or enforce that sequences are unique or that names should not be preserved if possible.

I don't know the proper R way this could be done, but for the purposes of the set operations, could the name be appended to the character vector prior to the set operation and then extracted afterwards?

Discovered that as.character accepts use.names, but this has no effect on the result of the set operation.

4
2.2 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

The implementation of the set operations for XStringSet objects is a relic from prehistoric times. A better (and more generic) implementation is:

setMethod("union", c("Vector", "Vector"),
function(x, y) unique(c(x, y))
)
setMethod("intersect", c("Vector", "Vector"),
function(x, y) unique(x[x %in% y])
)
setMethod("setdiff", c("Vector", "Vector"),
function(x, y) unique(x[!(x %in% y)])
)

They don't coerce to character vector internally (so are more efficient) and they propagate the names and metadata columns of the first argument (x).

Note that right now if you define the above methods (by copy/past'ing the above code in your session), the more specific methods for XStringSet objects will get in the way, that is, dispatch will still get the methods for XStringSet objects. So for now, to work around this, you would need to replace the occurrences of Vector with XStringSet. I'm in the process of adding the above methods to the S4Vectors package (where they belong) and removing the old methods for XStringSet objects from the Biostrings package. I'll let you know when I'm done.

Cheers,

H.

Awesome, thanks!

Done in S4Vectors 0.10.1 and Biostrings 2.40.1. It will take about 48 hours before they become available via biocLite().

Cheers,

H.