Question

How to reverse complement some elements of a DNAStringSet

0

Entering edit mode

Christine Jones • 0

@254804fa

Last seen 14 months ago

United Kingdom

I have a DNAStringSet from which I need to reverse complement some of the sequences. I found a Biostars response to a very similar question https://www.biostars.org/p/405892/ but that used sub-setting to return only the reverse complemented sequences when I would like to retain all of the sequences. I assumed I could use lapply to vectorise an ifelse function but I can't quite figure out how to make it work.

> # example DNAStringSet
> myFasta <- DNAStringSet(c("CCGAAAACCATGATGGTTGCCAG", "TTACACTCTTTCCCTACACGACGCTC", "TTCCGATCTGTAAAACGACGGCCAGT"))
> # vector of which sequences I would like to be reverse complemented
> myFasta_rc <- c(TRUE, FALSE, TRUE)
> # attempt to lapply ifelse across the DNAStringSet
> lapply(myFasta, function(x) {ifelse(myFasta_rc[] == "TRUE", reverseComplement(myFasta[x]), myFasta[x])})
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'reverseComplement': invalid subscript


> sessionInfo( )
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.70.2                  
 [3] rtracklayer_1.62.0                BiocIO_1.12.0                    
 [5] lubridate_1.9.3                   forcats_1.0.0                    
 [7] stringr_1.5.1                     dplyr_1.1.4                      
 [9] purrr_1.0.2                       readr_2.1.5                      
[11] tidyr_1.3.1                       tibble_3.2.1                     
[13] ggplot2_3.5.1                     tidyverse_2.0.0                  
[15] ShortRead_1.60.0                  GenomicAlignments_1.38.2         
[17] SummarizedExperiment_1.32.0       Biobase_2.62.0                   
[19] MatrixGenerics_1.14.0             matrixStats_1.4.1                
[21] Rsamtools_2.18.0                  GenomicRanges_1.54.1             
[23] Biostrings_2.70.3                 GenomeInfoDb_1.38.8              
[25] XVector_0.42.0                    IRanges_2.36.0                   
[27] S4Vectors_0.40.2                  BiocGenerics_0.48.1              
[29] BiocManager_1.30.25               BiocParallel_1.36.0              

loaded via a namespace (and not attached):
 [1] gtable_0.3.5            rjson_0.2.23            hwriter_1.3.2.1        
 [4] latticeExtra_0.6-30     lattice_0.22-5          tzdb_0.4.0             
 [7] vctrs_0.6.5             tools_4.3.3             bitops_1.0-8           
[10] generics_0.1.3          parallel_4.3.3          fansi_1.0.6            
[13] pkgconfig_2.0.3         Matrix_1.6-5            RColorBrewer_1.1-3     
[16] lifecycle_1.0.4         GenomeInfoDbData_1.2.11 compiler_4.3.3         
[19] deldir_2.0-4            munsell_0.5.1           codetools_0.2-20       
[22] yaml_2.3.10             RCurl_1.98-1.16         pillar_1.9.0           
[25] crayon_1.5.3            DelayedArray_0.28.0     abind_1.4-8            
[28] tidyselect_1.2.1        stringi_1.8.4           restfulr_0.0.15        
[31] grid_4.3.3              colorspace_2.1-1        cli_3.6.3              
[34] SparseArray_1.2.4       magrittr_2.0.3          S4Arrays_1.2.1         
[37] XML_3.99-0.17           utf8_1.2.4              withr_3.0.1            
[40] scales_1.3.0            timechange_0.3.0        jpeg_0.1-10            
[43] interp_1.1-6            png_0.1-8               hms_1.1.3              
[46] rlang_1.1.4             Rcpp_1.0.13             glue_1.8.0             
[49] rstudioapi_0.16.0       R6_2.5.1                zlibbioc_1.48.2

biostr Biostrings • 1.2k views

ADD COMMENT • link written 15 months ago by Christine Jones • 0

score 1 · Accepted Answer · 2024-10-14

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 22 hours ago

United States

You should always try the vectorized method first, rather than looping.

> myFasta[myFasta_rc] <- reverseComplement(myFasta[myFasta_rc])
> myFasta
DNAStringSet object of length 3:
    width seq
[1]    23 CTGGCAACCATCATGGTTTTCGG
[2]    26 TTACACTCTTTCCCTACACGACGCTC
[3]    26 ACTGGCCGTCGTTTTACAGATCGGAA
>