Dear All,
I have a large fasta.gz file (645.000 elements) that I need to translate to AA. I am having trouble with the translate() function since it does not seem to handle gaps '---' and returns an error. I also need to remove 'X' for use with another program that does not recognize unidentified AA.
I would like to replace the gaps '---' with an empty character '' during translation, somewhat similar to if.fuzzy.codon = 'solve'. Ideally I would also replace ambiguous codons with '' instead of 'X' since ultimately I will have to remove any unknown AA from my input file.
I have searched terms such as: gap sequences, unknown aa, ambiguous aa, translate() documentation, DNAstring documentation, and have not been able to come up with a solution. I would appreciate any pointers or tips.
#Create DNAstring containing all 645k sequences and headers
orthologs = readDNAStringSet('protein_coding_orthologs_dna_cleaned.fasta.gz')
subseq(orthologs)
#Loop to translate DNA > AA and output AA sequence in correct input format for TANGO
# for (i in 1:length(orthologs))
for (i in 1:50) {
tryCatch({
aa = Biostrings::translate(orthologs[i], if.fuzzy.codon = "solve")
name = names(orthologs)[i]
cat(name, "N N 7 298 0.1", as.character(aa), "\n", file = 'tangoinput.txt', append = T)
}, error=function(e){cat("ERROR:", name , conditionMessage(e), "\n")})
}
#ERROR: >header_name not a base at pos 2914
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.60.2 GenomeInfoDb_1.28.1 XVector_0.32.0 IRanges_2.26.0 S4Vectors_0.30.0
[6] BiocGenerics_0.38.0 ggplot2_3.3.5 phylotools_0.2.2 ape_5.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 pillar_1.6.2 compiler_4.1.0 BiocManager_1.30.16
[5] bitops_1.0-7 tools_4.1.0 zlibbioc_1.38.0 lifecycle_1.0.0
[9] tibble_3.1.3 nlme_3.1-152 gtable_0.3.0 lattice_0.20-44
[13] pkgconfig_2.0.3 rlang_0.4.11 rstudioapi_0.13 GenomeInfoDbData_1.2.6
[17] withr_2.4.2 dplyr_1.0.7 generics_0.1.0 vctrs_0.3.8
[21] grid_4.1.0 tidyselect_1.1.1 glue_1.4.2 R6_2.5.0
[25] fansi_0.5.0 purrr_0.3.4 magrittr_2.0.1 scales_1.1.1
[29] ellipsis_0.3.2 colorspace_2.0-2 utf8_1.2.2 RCurl_1.98-1.3
[33] munsell_0.5.0 crayon_1.4.1