Hi,
I was wondering if there is a way of specifying an argument to readDNAStringSet to ignore non-DNA characters? I have a set of .fasta files from IMGT that include their preferred gap notation, i.e. sequences that have some DNA strings followed by ..... followed by more DNA.
`>sequence1
GCTTCCACCAAGGGCCCATCGGTCTTCCCCCTGGCGCCCTGCTCCAGGAGCACCTCTGGGGGCACAGC.............GGCCCTGGGCTGCCTGGTCAAGGACTACTTCCCCGAACCGGTGACGGTGTC
So far I have used readDNAStringSet
, followed by as.character
, followed by str_remove
, and finally DNAStringSet
again. See below.
sequence1 <- readDNAStringSet("sequence1.fa")
sequence1 <- sequence1 |>
as.character() |>
str_remove(pattern = "\\.") |>
DNAStringSet() |>
`names<-`(names(sequence1))
Feels rather clunky. Is there a way of just skipping over (ignoring) non-DNA charcters when reading in? I've tried to add the seqtype because in other functions it seems to do what I need
readDNAStringSet("sequence1.fa", seqtype = "DNA")
but just get this error:
Error in readDNAStringSet("sequence1.fa", seqtype = "DNA") : unused argument (seqtype = "DNA")
Thanks, Miha
I would try to avoid converting to and from
character
objects, you'll lose the main point of having objects asXString
/XStringSet
.'.'
is a valid character for DNAString objects (you can see them all by printingDNA_ALPHABET
), so it won't create an error here.This is also discouraged -- I would recommend reading the help pages for functions to see what the arguments are. In this case, you can see (by checking
?readDNAStringSet
) thatreadDNAStringSet
doesn't accept an argument called "seqtype", so trying to pass it will create an error.There isn't currently a way to skip over gap characters when reading in the sequence. You can remove them without converting to
character
usingDECIPHER::RemoveGaps
, e.g.:If you're trying to just skip over any character not in
DNA_ALPHABET
, you won't be able to do that out of the box. Non-valid characters will throw an error, so you'll have to usereadBStringSet
to read it in as aBStringSet
and then work with it from there. If that's your use-case feel free to send more info; there's probably a better way but it'll depend on what you're trying to do. For this example you're sharing, the above should be enough.