I just found a problem with rtracklayer's import function when importing amino acid fasta (also the default way of storing amino acid sequences. (E.g. the "Protein-coding transcript translation sequences" at Gencode genes official site)).
Here is an easy to reproduce example:
### Define local output path (here i use ~/Downloads as a tmp dir)
outputPath <- '~/Downloads/'
### Make example data
aaExample <- 'MAAWRAAGLNYINYSNIAARLLRKALKPELRAQAVRRDDSHIKFTKWQNGKPEKAITE'
aaExample <- AAStringSet(aaExample)
names(aaExample) <- 'test'
### Write it to fasta file
writeXStringSet(
aaExample,
filepath = paste0(outputPath, '/aa.fasta')
)
### Verify fasta file
rawInput <- readLines(paste0(outputPath, '/aa.fasta'))
nchar(rawInput[[2]]) == nchar(aaExample)
Which is TRUE whereby we verified the produced fasta file is correct. Now let's import it with rtracklayer
imporAA <- rtracklayer::import(paste0(outputPath, '/aa.fasta'))
width(imporAA)
The resulting with is 41 and the import throws the following warning:
Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
reading FASTA file ~/Downloads//aa.fasta: ignored 17 invalid one-letter sequence codes
Hope this can be fixed as it would make it much easier to work with AA sequences.
Cheers Kristoffer
Is there in practice any difference between that and using Biostrings::readAAStringSet() ? Would you recomend using Biostrings::readDNAStringSet() or rtracklayer::import() for DNA fasta files?
It's just there for completeness (rtracklayer also provides 2bit import) and is a simple wrapper around Biostrings, which is probably the better way to go.