Hi,
I have a huge fasta file that I want to subset, i.e. pull out a subset of the entries and write them to a file.
I used indexFa()
from RSamtools
to create an index of the fasta file and then I made a GRanges object with gr = as(seqinfo(fa), "GRanges")
. But when I try to get the sequences from the gr
object it fails because some entries contain invalid DNA letters. But I would like to simply write the sequences as they were originally. I use this code to get the sequences, where x
is a vector of integers representing entries I want to keep: getSeq(fa, gr[x])
.
Error message:
Error in value[[3L]](cond) :
record 1 (hCoV-19/Slovakia/PKM2021022700521/2021:1-29921) contains invalid DNA letters
file: GISAID_Download_package_2021.05.25_sequences_cut_names.fasta
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nb_NO.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=nb_NO.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] fastmatch_1.1-0 Rsamtools_2.2.3 Biostrings_2.54.0 XVector_0.26.0 GenomicRanges_1.38.0
[6] GenomeInfoDb_1.22.1 IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0 lubridate_1.7.10
[11] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4 readr_1.4.0
[16] tidyr_1.1.3 tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 assertthat_0.2.1 utf8_1.2.1 R6_2.5.0
[5] cellranger_1.1.0 backports_1.2.1 reprex_2.0.0 httr_1.4.2
[9] pillar_1.6.0 zlibbioc_1.32.0 rlang_0.4.10 readxl_1.3.1
[13] rstudioapi_0.13 BiocParallel_1.20.1 RCurl_1.98-1.3 munsell_0.5.0
[17] tinytex_0.31 broom_0.7.6 compiler_3.6.3 modelr_0.1.8
[21] xfun_0.22 pkgconfig_2.0.3 tidyselect_1.1.0 GenomeInfoDbData_1.2.2
[25] fansi_0.4.2 crayon_1.4.1 dbplyr_2.1.1 withr_2.4.1
[29] bitops_1.0-7 grid_3.6.3 jsonlite_1.7.2 gtable_0.3.0
[33] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
[37] cli_2.4.0 stringi_1.5.3 fs_1.5.0 xml2_1.3.2
[41] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.7 tools_3.6.3
[45] glue_1.4.2 hms_1.0.0 colorspace_2.0-0 rvest_1.0.0
[49] haven_2.3.1
Not completely relevant since you said you wanted to print exactly what you have but you could also look at
Biostrings::replaceAmbiguities