Question about parallel manipulation on CharacterList objects
1
1
Entering edit mode
li lilingdu ▴ 450
@li-lilingdu-1884
Last seen 3.3 years ago

Hi, I want to know how to parallelly manipulate CharacterList objects in IRanges package.

For example, for very large list of letters:

ir = successiveIRanges(width=sample(1:26,1000000,replace=T))
dat = relist(sample(letters,sum(width(ir)),replace=T),ir)


For each element of the length-1000000-CharacterList, I want get the setdiff of 26 letters and the members in each element. I try the psetdiff function, however it dosn't work for CompressedCharacterList object. Also, I don't know how to combine two CharacterList objects parallel.

no.letters = psetdiff(letters, dat)  ##psetdiff does not work here
combined = puion(upper(no.letters),dat) ###try to combine two CharacterList objects.


Any suggestions, thanks.

s4vectors IRanges • 521 views
1
Entering edit mode
@martin-morgan-1513
Last seen 9 weeks ago
United States

Probably there is something fast already. I did this naively as

CharacterList(lapply(dat, setdiff, x=letters))

It took about 30 seconds to evaluate. To be more clever, and thinking that the data wasn't too large I made a matrix of TRUE values, where rows represent elements of dat and columns the letters.

m <- matrix(TRUE, nrow=length(dat), ncol=length(letters))

Then I indexed into the matrix and set to FALSE each position that was in dat

row = rep(seq_along(dat), lengths(dat))
col = match(unlist(dat), letters)
m[cbind(row, col)] = FALSE

And finally retrieved the remaining TRUE values and placed them in a list

row = row(m)[m]
col = letters[col(m)[m]]
splitAsList(col, row)

This takes about 3 seconds to evaluate. A problem is when an element of dat contains all letters so there is no row element of character(0) created. A solution is to update the original data structure

row = row(m)[m]; urow = unique(row)
col = letters[col(m)[m]]
dat[urow] = splitAsList(col, row)
dat[-urow] = list(character())