How to left_join a tidySummarizedExperiment with a GRanges object by seqnames and start (tidyverse-style)?
1
0
Entering edit mode
Kateřina • 0
@5b0a26b7
Last seen 5 weeks ago
Czechia

Hello, I'm working with a ranged tidySummarizedExperiment and I have a separate GRanges object that contains metadata I'd like to integrate. I'd like to perform a left_join() between the two, matching by seqnames and start where the start values are identical between the two objects. Is there a way to perform this join in a tidyverse-native way (similar to joining two GRanges objects using plyranges), ideally without having to convert both the tidySummarizedExperiment and GRanges object to a tibble and then back again? I'd love to keep everything within the tidy grammar and not break the abstraction if possible. Using left_join() doesn't seem to work (probably because seqnames and start are view-only variables?) even after converting it to a tibble - but maybe I am just missing something.


library(SummarizedExperiment)
library(GenomicRanges)
library(tidyomics)
library(dplyr)

gr <- GRanges(
  seqnames = c("chr1", "chr1", "chr2"),
  ranges = IRanges(start = c(100, 200, 300), width = 50),
  strand = c("+", "-", "+")
)

assay_mat <- matrix(1:9, ncol = 3)
colnames(assay_mat) <- c("Sample1", "Sample2", "Sample3")

se <- SummarizedExperiment(
  assays = list(counts = assay_mat),
  rowRanges = gr
)

gr_annot <- GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges(start = c(100, 300), width = 1),
  strand = c("+", "+"),
  gene_name = c("GeneA", "GeneB")
)

gr_annot_tb <- se |> left_join(as_tibble(gr_annot))


Error in `join_function()`:
`by` must be supplied when `x` and `y` have no common variables.
Use `cross_join()` to perform a cross-join.
Run `rlang::last_trace()` to see where the error occurred.

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.1.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Prague
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] nullranges_1.12.0               plyranges_1.26.0               
 [3] tidybulk_1.18.0                 tidyseurat_0.8.0               
 [5] SeuratObject_5.0.2              sp_2.2-0                       
 [7] tidySingleCellExperiment_1.16.0 SingleCellExperiment_1.28.1    
 [9] tidySummarizedExperiment_1.16.0 ttservice_0.4.1                
[11] ggplot2_3.5.1                   tidyr_1.3.1                    
[13] tidyomics_1.2.0                 dplyr_1.1.4                    
[15] SummarizedExperiment_1.36.0     Biobase_2.66.0                 
[17] GenomicRanges_1.58.0            GenomeInfoDb_1.42.3            
[19] IRanges_2.40.1                  S4Vectors_0.44.0               
[21] BiocGenerics_0.52.0             MatrixGenerics_1.18.1          
[23] matrixStats_1.5.0              

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3       rstudioapi_0.17.1        jsonlite_1.9.1          
  [4] magrittr_2.0.3           spatstat.utils_3.1-3     farver_2.1.2            
  [7] BiocIO_1.16.0            zlibbioc_1.52.0          vctrs_0.6.5             
 [10] ROCR_1.0-11              Rsamtools_2.22.0         spatstat.explore_3.3-4  
 [13] RCurl_1.98-1.16          htmltools_0.5.8.1        S4Arrays_1.6.0          
 [16] curl_6.2.1               SparseArray_1.6.2        sctransform_0.4.1       
 [19] parallelly_1.42.0        KernSmooth_2.23-26       htmlwidgets_1.6.4       
 [22] ica_1.0-3                plyr_1.8.9               plotly_4.10.4           
 [25] zoo_1.8-13               GenomicAlignments_1.42.0 igraph_2.1.4            
 [28] mime_0.13                lifecycle_1.0.4          pkgconfig_2.0.3         
 [31] Matrix_1.7-3             R6_2.6.1                 fastmap_1.2.0           
 [34] GenomeInfoDbData_1.2.13  fitdistrplus_1.2-2       future_1.34.0           
 [37] shiny_1.10.0             digest_0.6.37            colorspace_2.1-1        
 [40] patchwork_1.3.0          Seurat_5.2.1             tensor_1.5              
 [43] RSpectra_0.16-2          irlba_2.3.5.1            progressr_0.15.1        
 [46] fansi_1.0.6              spatstat.sparse_3.1-0    httr_1.4.7              
 [49] polyclip_1.10-7          abind_1.4-8              compiler_4.4.2          
 [52] withr_3.0.2              BiocParallel_1.40.0      fastDummies_1.7.5       
 [55] MASS_7.3-65              DelayedArray_0.32.0      rjson_0.2.23            
 [58] tools_4.4.2              lmtest_0.9-40            httpuv_1.6.15           
 [61] future.apply_1.11.3      goftest_1.2-3            glue_1.8.0              
 [64] InteractionSet_1.34.0    restfulr_0.0.15          nlme_3.1-167            
 [67] promises_1.3.2           grid_4.4.2               Rtsne_0.17              
 [70] cluster_2.1.8.1          reshape2_1.4.4           generics_0.1.3          
 [73] gtable_0.3.6             spatstat.data_3.1-6      tzdb_0.5.0              
 [76] preprocessCore_1.68.0    hms_1.1.3                data.table_1.17.0       
 [79] utf8_1.2.4               XVector_0.46.0           spatstat.geom_3.3-5     
 [82] RcppAnnoy_0.0.22         ggrepel_0.9.6            RANN_2.6.2              
 [85] pillar_1.10.1            stringr_1.5.1            spam_2.11-1             
 [88] RcppHNSW_0.6.0           later_1.4.1              splines_4.4.2           
 [91] lattice_0.22-6           rtracklayer_1.66.0       survival_3.8-3          
 [94] deldir_2.0-4             tidyselect_1.2.1         Biostrings_2.74.1       
 [97] miniUI_0.1.1.1           pbapply_1.7-2            gridExtra_2.3           
[100] scattermore_1.2          stringi_1.8.4            UCSC.utils_1.2.0        
[103] yaml_2.3.10              lazyeval_0.2.2           codetools_0.2-20        
[106] tibble_3.2.1             cli_3.6.4                uwot_0.2.3              
[109] xtable_1.8-4             reticulate_1.41.0.1      munsell_0.5.1           
[112] Rcpp_1.0.14              spatstat.random_3.3-2    globals_0.16.3          
[115] png_0.1-8                XML_3.99-0.18            spatstat.univar_3.1-2   
[118] parallel_4.4.2           ellipsis_0.3.2           readr_2.1.5             
[121] dotCall64_1.2            bitops_1.0-9             listenv_0.9.1           
[124] viridisLite_0.4.2        scales_1.3.0             ggridges_0.5.6          
[127] purrr_1.0.4              crayon_1.5.3             rlang_1.1.5             
[130] cowplot_1.1.3
GenomicRanges tidySummarizedExperiment tidyomics • 1.8k views
ADD COMMENT
0
Entering edit mode

Have you figured out how to do this using tidy grammar? Using non-tidy grammar is quite simple, and perhaps you already know how to do that. But if not, let us know. I can't help with tidy grammar, but I can help with conventional methods.

ADD REPLY
0
Entering edit mode

I unfortunately haven't. Non-tidy grammar is fine, but I have wanted to use tidy-only grammar. Thank you nonetheless!

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 14 days ago
United States

We are still working on adding tidyomics syntax to operate on the rowRanges of a ranged SE without breaking it off and using plyranges.

For now I think the approach would have to be join_overlap_left of the two ranges, followed by subsetting of the SE.

> rowRanges(se) |> anchor_5p() |> mutate(width=1) |> join_overlap_left(gr_annot)

      seqnames    ranges strand |   gene_name
  [1]     chr1       100      + |       GeneA
  [2]     chr1       249      - |        <NA>
  [3]     chr2       300      + |       GeneB

# ...then assign the gene_name column manually to rowData(se)
ADD COMMENT

Login before adding your answer.

Traffic: 1509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6