How to reverse complement some elements of a DNAStringSet
1
0
Entering edit mode
@254804fa
Last seen 12 months ago
United Kingdom

I have a DNAStringSet from which I need to reverse complement some of the sequences. I found a Biostars response to a very similar question https://www.biostars.org/p/405892/ but that used sub-setting to return only the reverse complemented sequences when I would like to retain all of the sequences. I assumed I could use lapply to vectorise an ifelse function but I can't quite figure out how to make it work.

> # example DNAStringSet
> myFasta <- DNAStringSet(c("CCGAAAACCATGATGGTTGCCAG", "TTACACTCTTTCCCTACACGACGCTC", "TTCCGATCTGTAAAACGACGGCCAGT"))
> # vector of which sequences I would like to be reverse complemented
> myFasta_rc <- c(TRUE, FALSE, TRUE)
> # attempt to lapply ifelse across the DNAStringSet
> lapply(myFasta, function(x) {ifelse(myFasta_rc[] == "TRUE", reverseComplement(myFasta[x]), myFasta[x])})
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'reverseComplement': invalid subscript


> sessionInfo( )
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.70.2                  
 [3] rtracklayer_1.62.0                BiocIO_1.12.0                    
 [5] lubridate_1.9.3                   forcats_1.0.0                    
 [7] stringr_1.5.1                     dplyr_1.1.4                      
 [9] purrr_1.0.2                       readr_2.1.5                      
[11] tidyr_1.3.1                       tibble_3.2.1                     
[13] ggplot2_3.5.1                     tidyverse_2.0.0                  
[15] ShortRead_1.60.0                  GenomicAlignments_1.38.2         
[17] SummarizedExperiment_1.32.0       Biobase_2.62.0                   
[19] MatrixGenerics_1.14.0             matrixStats_1.4.1                
[21] Rsamtools_2.18.0                  GenomicRanges_1.54.1             
[23] Biostrings_2.70.3                 GenomeInfoDb_1.38.8              
[25] XVector_0.42.0                    IRanges_2.36.0                   
[27] S4Vectors_0.40.2                  BiocGenerics_0.48.1              
[29] BiocManager_1.30.25               BiocParallel_1.36.0              

loaded via a namespace (and not attached):
 [1] gtable_0.3.5            rjson_0.2.23            hwriter_1.3.2.1        
 [4] latticeExtra_0.6-30     lattice_0.22-5          tzdb_0.4.0             
 [7] vctrs_0.6.5             tools_4.3.3             bitops_1.0-8           
[10] generics_0.1.3          parallel_4.3.3          fansi_1.0.6            
[13] pkgconfig_2.0.3         Matrix_1.6-5            RColorBrewer_1.1-3     
[16] lifecycle_1.0.4         GenomeInfoDbData_1.2.11 compiler_4.3.3         
[19] deldir_2.0-4            munsell_0.5.1           codetools_0.2-20       
[22] yaml_2.3.10             RCurl_1.98-1.16         pillar_1.9.0           
[25] crayon_1.5.3            DelayedArray_0.28.0     abind_1.4-8            
[28] tidyselect_1.2.1        stringi_1.8.4           restfulr_0.0.15        
[31] grid_4.3.3              colorspace_2.1-1        cli_3.6.3              
[34] SparseArray_1.2.4       magrittr_2.0.3          S4Arrays_1.2.1         
[37] XML_3.99-0.17           utf8_1.2.4              withr_3.0.1            
[40] scales_1.3.0            timechange_0.3.0        jpeg_0.1-10            
[43] interp_1.1-6            png_0.1-8               hms_1.1.3              
[46] rlang_1.1.4             Rcpp_1.0.13             glue_1.8.0             
[49] rstudioapi_0.16.0       R6_2.5.1                zlibbioc_1.48.2
biostr Biostrings • 913 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

You should always try the vectorized method first, rather than looping.

> myFasta[myFasta_rc] <- reverseComplement(myFasta[myFasta_rc])
> myFasta
DNAStringSet object of length 3:
    width seq
[1]    23 CTGGCAACCATCATGGTTTTCGG
[2]    26 TTACACTCTTTCCCTACACGACGCTC
[3]    26 ACTGGCCGTCGTTTTACAGATCGGAA
>
ADD COMMENT
0
Entering edit mode

Fantastic, thank you

ADD REPLY
0
Entering edit mode

Dealing with DNAStringSet manipulation can be tricky! When selectively reverse complementing sequences, consider how to efficiently retain the original set. Instead of subsetting, perhaps explore indexing within your lapply function. Think of Omegle conversations: sometimes you want to filter responses, sometimes you need the whole thread. How can you apply that "whole thread" principle to maintain your complete DNAStringSet while modifying specific sequences? There might be a simpler vectorization solution hiding in plain sight.

ADD REPLY

Login before adding your answer.

Traffic: 849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6