5 months ago by
Main problem with those codes is that they are not 1-letter codes so wouldn't play well with the "1 letter per nucleotide" paradigm that is at the core of nucleotide sequence representation in Biostrings. So the 1st thing I would recommend is to use 1-letter codes to represent those exotic nucleotide bases.
Another problem as you found out is that DNAString/DNAStringSet and RNAString/RNAStringSet objects only allow letters that belong to predefined alphabets
RNA_ALPHABET, respectively. Even though it would be possible (at least in theory) to extend these alphabets to support new letters, this is not a change to do lightly so it would need to be considered very closely and supported by a strong use case. And even that might not be the right thing to do.
An important question is what kind of sequences are you dealing with? If these exotic nucleotides don't show up in DNA or mRNA molecules then the DNAString/DNAStringSet or RNAString/RNAStringSet classes are probably not the appropriated classes to represent your sequences in the first place. So maybe you have a case where implementing a new specialized XString concrete subclass would be more appropriate (note that XString is a virtual class with currently 4 concrete subclasses: BString, DNAString, RNAString, and AAString). Then you would be free to choose the alphabet you want to support for this new XString subclass. I believe that the Modstrings package (submitted a couple of weeks ago and still pending for review, see here) does something like that i.e. it defines its own XString/XStringSet subclasses so is probably a good place to look at if you decide to ride with this.
Another much simpler option is to just use BString/BStringSet objects. No enforced alphabets for these objects but hey, with character vectors you don't get that kind of enforcement either. At least by using BString/BStringSet objects you can take advantage of the efficient internal representation and fast string matching facilities provided by the Biostrings package.
That being said, if using character vectors does the job for you and performance is reasonable (maybe your sequences are short and you don't deal with hundreds of thousands of them, are you dealing with tRNA?) then you might just want to stick to that. You may have a use case where re-using the Biostrings infrastructure is not worth it and that's ok.