Bioconductor: Ambiguous Amino Acids & Gap Sequences
0
0
Entering edit mode
@8abec5f2
Last seen 2.7 years ago
Germany

Dear All,

I have a large fasta.gz file (645.000 elements) that I need to translate to AA. I am having trouble with the translate() function since it does not seem to handle gaps '---' and returns an error. I also need to remove 'X' for use with another program that does not recognize unidentified AA.

I would like to replace the gaps '---' with an empty character '' during translation, somewhat similar to if.fuzzy.codon = 'solve'. Ideally I would also replace ambiguous codons with '' instead of 'X' since ultimately I will have to remove any unknown AA from my input file.

I have searched terms such as: gap sequences, unknown aa, ambiguous aa, translate() documentation, DNAstring documentation, and have not been able to come up with a solution. I would appreciate any pointers or tips.


#Create DNAstring containing all 645k sequences and headers
orthologs = readDNAStringSet('protein_coding_orthologs_dna_cleaned.fasta.gz')
subseq(orthologs)

#Loop to translate DNA > AA and output AA sequence in correct input format for TANGO
# for (i in 1:length(orthologs))

for (i in 1:50) {
tryCatch({
  aa = Biostrings::translate(orthologs[i], if.fuzzy.codon = "solve")
  name = names(orthologs)[i]
  cat(name, "N N 7 298 0.1", as.character(aa), "\n", file = 'tangoinput.txt', append = T)
 }, error=function(e){cat("ERROR:", name , conditionMessage(e), "\n")})
}

#ERROR: >header_name not a base at pos 2914 

sessionInfo( )

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.60.2   GenomeInfoDb_1.28.1 XVector_0.32.0      IRanges_2.26.0      S4Vectors_0.30.0   
[6] BiocGenerics_0.38.0 ggplot2_3.3.5       phylotools_0.2.2    ape_5.5            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7             pillar_1.6.2           compiler_4.1.0         BiocManager_1.30.16   
 [5] bitops_1.0-7           tools_4.1.0            zlibbioc_1.38.0        lifecycle_1.0.0       
 [9] tibble_3.1.3           nlme_3.1-152           gtable_0.3.0           lattice_0.20-44       
[13] pkgconfig_2.0.3        rlang_0.4.11           rstudioapi_0.13        GenomeInfoDbData_1.2.6
[17] withr_2.4.2            dplyr_1.0.7            generics_0.1.0         vctrs_0.3.8           
[21] grid_4.1.0             tidyselect_1.1.1       glue_1.4.2             R6_2.5.0              
[25] fansi_0.5.0            purrr_0.3.4            magrittr_2.0.1         scales_1.1.1          
[29] ellipsis_0.3.2         colorspace_2.0-2       utf8_1.2.2             RCurl_1.98-1.3        
[33] munsell_0.5.0          crayon_1.4.1
Translation Biostrings • 756 views
ADD COMMENT

Login before adding your answer.

Traffic: 1034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6