Question

DNAStringSet to DNAStringSetList according to pattern in sequence names

0

Entering edit mode

s.ghignone ▴ 10

@sghignone-7573

Last seen 6.1 years ago

European Union/Italy/Turin/CNR

Given a very simple DNAStringSet, built like this:

afastafile <- DNAStringSet(c("GCAAATGGG", "CCCGGGTT", "AAAGGGTT", "TTTGGGCC"))
names(afastafile) <- c("ABC1_1", "ABC2_1", "ABC3_1", "ABC1_2")

I would get a DNAStringSetList where the list elements are grouped by a pattern in the sequence name;
in this example, I would get a list of 3 (ABC*) elements, with the first element containing sequence #1 and #4 (ABC1_1 and ABC1_2), and so on...

dnastringset dnastringsetlist seqnames • 956 views

ADD COMMENT • link 6.1 years ago s.ghignone ▴ 10

score 2 · Accepted Answer · 2018-11-05

This code should work for the given example:

splitAsList(afastafile, levels(as.factor(gsub("_\\d", "", names(afastafile)))))

For datasets with more complex sequence naming schema, this is the working code (using "fct_inorder" from the package forcats):

( all.cds.list<-splitAsList(all.cds, fct_inorder(sub('(^[^_]+_[^_]+_[^_]+)_(.*)$', "\\2", names(all.cds)))) )

Hope it helps,

s.-