readDNAStringSet with non DNA characters
1
0
Entering edit mode
biomiha ▴ 20
@biomiha-11346
Last seen 4 hours ago
UK/Cambridge

Hi,

I was wondering if there is a way of specifying an argument to readDNAStringSet to ignore non-DNA characters? I have a set of .fasta files from IMGT that include their preferred gap notation, i.e. sequences that have some DNA strings followed by ..... followed by more DNA.

`>sequence1 GCTTCCACCAAGGGCCCATCGGTCTTCCCCCTGGCGCCCTGCTCCAGGAGCACCTCTGGGGGCACAGC.............GGCCCTGGGCTGCCTGGTCAAGGACTACTTCCCCGAACCGGTGACGGTGTC

So far I have used readDNAStringSet, followed by as.character, followed by str_remove, and finally DNAStringSet again. See below.


    sequence1 <- readDNAStringSet("sequence1.fa")

    sequence1 <- sequence1 |> 
      as.character() |> 
      str_remove(pattern = "\\.") |> 
      DNAStringSet() |> 
      `names<-`(names(sequence1))

Feels rather clunky. Is there a way of just skipping over (ignoring) non-DNA charcters when reading in? I've tried to add the seqtype because in other functions it seems to do what I need

    readDNAStringSet("sequence1.fa", seqtype = "DNA")

but just get this error:

Error in readDNAStringSet("sequence1.fa", seqtype = "DNA") : unused argument (seqtype = "DNA")

Thanks, Miha

Biostrings • 35 views
ADD COMMENT
0
Entering edit mode

I would try to avoid converting to and from character objects, you'll lose the main point of having objects as XString/XStringSet.

'.' is a valid character for DNAString objects (you can see them all by printing DNA_ALPHABET), so it won't create an error here.

I've tried to add the seqtype because in other functions it seems to do what I need

This is also discouraged -- I would recommend reading the help pages for functions to see what the arguments are. In this case, you can see (by checking ?readDNAStringSet) that readDNAStringSet doesn't accept an argument called "seqtype", so trying to pass it will create an error.

There isn't currently a way to skip over gap characters when reading in the sequence. You can remove them without converting to character using DECIPHER::RemoveGaps, e.g.:

library(Biostrings)
library(DECIPHER)
x <- readDNAStringSet('path/to/fasta.fa')
x <- RemoveGaps(x)
x

If you're trying to just skip over any character not in DNA_ALPHABET, you won't be able to do that out of the box. Non-valid characters will throw an error, so you'll have to use readBStringSet to read it in as a BStringSet and then work with it from there. If that's your use-case feel free to send more info; there's probably a better way but it'll depend on what you're trying to do. For this example you're sharing, the above should be enough.

ADD REPLY
0
Entering edit mode
ATpoint ★ 4.9k
@atpoint-13662
Last seen 3 hours ago
Germany

See ?readDNAStringSet which will tell that for DNAStringSet only valid characters are allowed. You can use BStringSet (readBStringSet()) though to read fasta files with any characters.

ADD COMMENT

Login before adding your answer.

Traffic: 894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6