Question

readDNAStringSet with non DNA characters

0

Entering edit mode

biomiha ▴ 20

@biomiha-11346

Last seen 4 months ago

UK/Cambridge

Hi,

I was wondering if there is a way of specifying an argument to readDNAStringSet to ignore non-DNA characters? I have a set of .fasta files from IMGT that include their preferred gap notation, i.e. sequences that have some DNA strings followed by ..... followed by more DNA.

`>sequence1 GCTTCCACCAAGGGCCCATCGGTCTTCCCCCTGGCGCCCTGCTCCAGGAGCACCTCTGGGGGCACAGC.............GGCCCTGGGCTGCCTGGTCAAGGACTACTTCCCCGAACCGGTGACGGTGTC

So far I have used readDNAStringSet, followed by as.character, followed by str_remove, and finally DNAStringSet again. See below.


    sequence1 <- readDNAStringSet("sequence1.fa")

    sequence1 <- sequence1 |> 
      as.character() |> 
      str_remove(pattern = "\\.") |> 
      DNAStringSet() |> 
      `names<-`(names(sequence1))

Feels rather clunky. Is there a way of just skipping over (ignoring) non-DNA charcters when reading in? I've tried to add the seqtype because in other functions it seems to do what I need

    readDNAStringSet("sequence1.fa", seqtype = "DNA")

but just get this error:

Error in readDNAStringSet("sequence1.fa", seqtype = "DNA") : unused argument (seqtype = "DNA")

Thanks, Miha

Biostrings • 494 views

ADD COMMENT • link 5 months ago • updated 4 months ago biomiha ▴ 20

1

Entering edit mode

I would try to avoid converting to and from character objects, you'll lose the main point of having objects as XString/XStringSet.

'.' is a valid character for DNAString objects (you can see them all by printing DNA_ALPHABET), so it won't create an error here.

I've tried to add the seqtype because in other functions it seems to do what I need

This is also discouraged -- I would recommend reading the help pages for functions to see what the arguments are. In this case, you can see (by checking ?readDNAStringSet) that readDNAStringSet doesn't accept an argument called "seqtype", so trying to pass it will create an error.

There isn't currently a way to skip over gap characters when reading in the sequence. You can remove them without converting to character using DECIPHER::RemoveGaps, e.g.:

library(Biostrings)
library(DECIPHER)
x <- readDNAStringSet('path/to/fasta.fa')
x <- RemoveGaps(x)
x

If you're trying to just skip over any character not in DNA_ALPHABET, you won't be able to do that out of the box. Non-valid characters will throw an error, so you'll have to use readBStringSet to read it in as a BStringSet and then work with it from there. If that's your use-case feel free to send more info; there's probably a better way but it'll depend on what you're trying to do. For this example you're sharing, the above should be enough.

ADD REPLY • link 5 months ago Aidan ▴ 70

0

Entering edit mode

DECIPHER does the trick, thank you!

ADD REPLY • link 4 months ago biomiha ▴ 20

score 0 · Answer 1 · 2025-09-15

0

Entering edit mode

ATpoint ★ 5.0k

@atpoint-13662

Last seen 1 day ago

Germany

See ?readDNAStringSet which will tell that for DNAStringSet only valid characters are allowed. You can use BStringSet (readBStringSet()) though to read fasta files with any characters.

ADD COMMENT • link 5 months ago ATpoint ★ 5.0k