Question

How to run getSeq() on a GRanges object with invalid DNA letters?

0

Entering edit mode

Jon Bråte ▴ 270

@jon-brate-6263

Last seen 19 months ago

Norway

Hi,

I have a huge fasta file that I want to subset, i.e. pull out a subset of the entries and write them to a file. I used indexFa() from RSamtools to create an index of the fasta file and then I made a GRanges object with gr = as(seqinfo(fa), "GRanges"). But when I try to get the sequences from the gr object it fails because some entries contain invalid DNA letters. But I would like to simply write the sequences as they were originally. I use this code to get the sequences, where x is a vector of integers representing entries I want to keep: getSeq(fa, gr[x]).

Error message:

Error in value[[3L]](cond) : 
   record 1 (hCoV-19/Slovakia/PKM2021022700521/2021:1-29921) contains invalid DNA letters
  file: GISAID_Download_package_2021.05.25_sequences_cut_names.fasta

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nb_NO.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fastmatch_1.1-0      Rsamtools_2.2.3      Biostrings_2.54.0    XVector_0.26.0       GenomicRanges_1.38.0
 [6] GenomeInfoDb_1.22.1  IRanges_2.20.2       S4Vectors_0.24.4     BiocGenerics_0.32.0  lubridate_1.7.10    
[11] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.5          purrr_0.3.4          readr_1.4.0         
[16] tidyr_1.1.3          tibble_3.1.0         ggplot2_3.3.3        tidyverse_1.3.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             assertthat_0.2.1       utf8_1.2.1             R6_2.5.0              
 [5] cellranger_1.1.0       backports_1.2.1        reprex_2.0.0           httr_1.4.2            
 [9] pillar_1.6.0           zlibbioc_1.32.0        rlang_0.4.10           readxl_1.3.1          
[13] rstudioapi_0.13        BiocParallel_1.20.1    RCurl_1.98-1.3         munsell_0.5.0         
[17] tinytex_0.31           broom_0.7.6            compiler_3.6.3         modelr_0.1.8          
[21] xfun_0.22              pkgconfig_2.0.3        tidyselect_1.1.0       GenomeInfoDbData_1.2.2
[25] fansi_0.4.2            crayon_1.4.1           dbplyr_2.1.1           withr_2.4.1           
[29] bitops_1.0-7           grid_3.6.3             jsonlite_1.7.2         gtable_0.3.0          
[33] lifecycle_1.0.0        DBI_1.1.1              magrittr_2.0.1         scales_1.1.1          
[37] cli_2.4.0              stringi_1.5.3          fs_1.5.0               xml2_1.3.2            
[41] ellipsis_0.3.1         generics_0.1.0         vctrs_0.3.7            tools_3.6.3           
[45] glue_1.4.2             hms_1.0.0              colorspace_2.0-0       rvest_1.0.0           
[49] haven_2.3.1

Rsamtools Biostrings • 2.8k views

ADD COMMENT • link updated 4.6 years ago by Hervé Pagès 16k • written 4.7 years ago by Jon Bråte ▴ 270

0

Entering edit mode

Not completely relevant since you said you wanted to print exactly what you have but you could also look at Biostrings::replaceAmbiguities

ADD REPLY • link 4.7 years ago shepherl 4.3k

score 3 · Accepted Answer · 2021-05-27

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

You don't have to read in as a DNAStringSet. As an example,

> file.copy( "C:/Users/jmacdon/AppData/Roaming/R/win-library/4.1/Rsamtools/extdata/ce2dict1.fa", "tmp.fa")
## edit things manually - here is the result:
> readLines("tmp.fa")
 [1] ">pattern01"                "GCGAAACTAGGAGAGGCT"       
 [3] ">pattern02"                "CTGTTAGCTAATTTTAAAAATQRST"     <- Notice this one
 [5] ">pattern03"                "ACTACCACCCAAATTTAGATATTC" 
 [7] ">pattern04"                "AAATTTTTTTTGTTGCAAATTTGA" 
 [9] ">pattern05"                "TCTTCTTGGCTTTGGTGGTACTTTT"
> fa <- FaFile("tmp.fa")
> indexFa(fa)
class: FaFile 
path: tmp.fa
index: tmp.fa.fai
gzindex: tmp.fa.gzi
isOpen: FALSE 
yieldSize: NA 
> gr <- as(seqinfo(fa), "GRanges")
> getSeq(fa, gr[2])
Error in value[[3L]](cond) : 
   record 1 (pattern02:1-25) contains invalid DNA letters
  file: tmp.fa
> getSeq(fa, gr[2], as="AAStringSet" )
AAStringSet object of length 1:
    width seq                                               names               
[1]    25 CTGTTAGCTAATTTTAAAAATQRST                         pattern02

ADD COMMENT • link 4.7 years ago James W. MacDonald 68k

0

Entering edit mode

That works, thanks! But I don't understand why reading the sequences as an AAStringSet works? I though the AAString class was only the amino acids?

ADD REPLY • link 4.7 years ago Jon Bråte ▴ 270

0

Entering edit mode

Well, that's true. It is only for the amino acids. But the amino acid code includes all of the capital letters, so by definition you can have any capital letter and it will be OK

> all(LETTERS %in% names(AMINO_ACID_CODE))
[1] TRUE

ADD REPLY • link 4.7 years ago James W. MacDonald 68k

0

Entering edit mode

Note that you don't even need to bother using an AAStringSet object, which works here just by luck, when you can use a BStringSet object. BStringSet objects don't restrict letters: they're the analog of character vectors in base R.

H.

ADD REPLY • link 4.7 years ago Hervé Pagès 16k

1

Entering edit mode

Except a BStringSet isn't one of the choices for getSeq for a FaFile object. Which is why I chose AAStringSet.

> getSeq(fa, gr[2], as = "BStringSet")
Error in match.arg(as) : 
  'arg' should be one of "DNAStringSet", "RNAStringSet", "AAStringSet"

But maybe there is a better way to do what the OP wants than using getSeq?

ADD REPLY • link 4.7 years ago James W. MacDonald 68k

0

Entering edit mode

oops... bummer!

That restriction seems to be coming from Rsamtools::scanFa() which the getSeq() method for FaFile objects is based on.

Doesn't sound like a crazy restriction though, given that

the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences

according to this Wikipedia article.

So I wonder what the Q letter in the OP's file is supposed to represent. First time I see it in a FASTA file containing nucleotide sequences. If this is a legit thing to use in nucleotide sequences, then maybe DNA_ALPHABET in Biostrings should be extended to include this new letter, or at least Rsamtools::scanFa() should support as="BStringSet".

I guess for now we can just wait and see if this new DNA letter causes problems again in the future or if it was a one-time thing.

H.

ADD REPLY • link 4.6 years ago Hervé Pagès 16k

0

Entering edit mode

The OP might have an amino acid FASTA file. It appears to be something from GISAID, which you need to register to get, and I'm not that interested to go through whatever the registration process is, so...