Question

How to left_join a tidySummarizedExperiment with a GRanges object by seqnames and start (tidyverse-style)?

0

Entering edit mode

Kateřina • 0

@5b0a26b7

Last seen 4 months ago

Czechia

Hello, I'm working with a ranged tidySummarizedExperiment and I have a separate GRanges object that contains metadata I'd like to integrate. I'd like to perform a left_join() between the two, matching by seqnames and start where the start values are identical between the two objects. Is there a way to perform this join in a tidyverse-native way (similar to joining two GRanges objects using plyranges), ideally without having to convert both the tidySummarizedExperiment and GRanges object to a tibble and then back again? I'd love to keep everything within the tidy grammar and not break the abstraction if possible. Using left_join() doesn't seem to work (probably because seqnames and start are view-only variables?) even after converting it to a tibble - but maybe I am just missing something.


library(SummarizedExperiment)
library(GenomicRanges)
library(tidyomics)
library(dplyr)

gr <- GRanges(
  seqnames = c("chr1", "chr1", "chr2"),
  ranges = IRanges(start = c(100, 200, 300), width = 50),
  strand = c("+", "-", "+")
)

assay_mat <- matrix(1:9, ncol = 3)
colnames(assay_mat) <- c("Sample1", "Sample2", "Sample3")

se <- SummarizedExperiment(
  assays = list(counts = assay_mat),
  rowRanges = gr
)

gr_annot <- GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges(start = c(100, 300), width = 1),
  strand = c("+", "+"),
  gene_name = c("GeneA", "GeneB")
)

gr_annot_tb <- se |> left_join(as_tibble(gr_annot))


Error in `join_function()`:
`by` must be supplied when `x` and `y` have no common variables.
Use `cross_join()` to perform a cross-join.
Run `rlang::last_trace()` to see where the error occurred.

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.1.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Prague
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] nullranges_1.12.0               plyranges_1.26.0               
 [3] tidybulk_1.18.0                 tidyseurat_0.8.0               
 [5] SeuratObject_5.0.2              sp_2.2-0                       
 [7] tidySingleCellExperiment_1.16.0 SingleCellExperiment_1.28.1    
 [9] tidySummarizedExperiment_1.16.0 ttservice_0.4.1                
[11] ggplot2_3.5.1                   tidyr_1.3.1                    
[13] tidyomics_1.2.0                 dplyr_1.1.4                    
[15] SummarizedExperiment_1.36.0     Biobase_2.66.0                 
[17] GenomicRanges_1.58.0            GenomeInfoDb_1.42.3            
[19] IRanges_2.40.1                  S4Vectors_0.44.0               
[21] BiocGenerics_0.52.0             MatrixGenerics_1.18.1          
[23] matrixStats_1.5.0              

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3       rstudioapi_0.17.1        jsonlite_1.9.1          
  [4] magrittr_2.0.3           spatstat.utils_3.1-3     farver_2.1.2            
  [7] BiocIO_1.16.0            zlibbioc_1.52.0          vctrs_0.6.5             
 [10] ROCR_1.0-11              Rsamtools_2.22.0         spatstat.explore_3.3-4  
 [13] RCurl_1.98-1.16          htmltools_0.5.8.1        S4Arrays_1.6.0          
 [16] curl_6.2.1               SparseArray_1.6.2        sctransform_0.4.1       
 [19] parallelly_1.42.0        KernSmooth_2.23-26       htmlwidgets_1.6.4       
 [22] ica_1.0-3                plyr_1.8.9               plotly_4.10.4           
 [25] zoo_1.8-13               GenomicAlignments_1.42.0 igraph_2.1.4            
 [28] mime_0.13                lifecycle_1.0.4          pkgconfig_2.0.3         
 [31] Matrix_1.7-3             R6_2.6.1                 fastmap_1.2.0           
 [34] GenomeInfoDbData_1.2.13  fitdistrplus_1.2-2       future_1.34.0           
 [37] shiny_1.10.0             digest_0.6.37            colorspace_2.1-1        
 [40] patchwork_1.3.0          Seurat_5.2.1             tensor_1.5              
 [43] RSpectra_0.16-2          irlba_2.3.5.1            progressr_0.15.1        
 [46] fansi_1.0.6              spatstat.sparse_3.1-0    httr_1.4.7              
 [49] polyclip_1.10-7          abind_1.4-8              compiler_4.4.2          
 [52] withr_3.0.2              BiocParallel_1.40.0      fastDummies_1.7.5       
 [55] MASS_7.3-65              DelayedArray_0.32.0      rjson_0.2.23            
 [58] tools_4.4.2              lmtest_0.9-40            httpuv_1.6.15           
 [61] future.apply_1.11.3      goftest_1.2-3            glue_1.8.0              
 [64] InteractionSet_1.34.0    restfulr_0.0.15          nlme_3.1-167            
 [67] promises_1.3.2           grid_4.4.2               Rtsne_0.17              
 [70] cluster_2.1.8.1          reshape2_1.4.4           generics_0.1.3          
 [73] gtable_0.3.6             spatstat.data_3.1-6      tzdb_0.5.0              
 [76] preprocessCore_1.68.0    hms_1.1.3                data.table_1.17.0       
 [79] utf8_1.2.4               XVector_0.46.0           spatstat.geom_3.3-5     
 [82] RcppAnnoy_0.0.22         ggrepel_0.9.6            RANN_2.6.2              
 [85] pillar_1.10.1            stringr_1.5.1            spam_2.11-1             
 [88] RcppHNSW_0.6.0           later_1.4.1              splines_4.4.2           
 [91] lattice_0.22-6           rtracklayer_1.66.0       survival_3.8-3          
 [94] deldir_2.0-4             tidyselect_1.2.1         Biostrings_2.74.1       
 [97] miniUI_0.1.1.1           pbapply_1.7-2            gridExtra_2.3           
[100] scattermore_1.2          stringi_1.8.4            UCSC.utils_1.2.0        
[103] yaml_2.3.10              lazyeval_0.2.2           codetools_0.2-20        
[106] tibble_3.2.1             cli_3.6.4                uwot_0.2.3              
[109] xtable_1.8-4             reticulate_1.41.0.1      munsell_0.5.1           
[112] Rcpp_1.0.14              spatstat.random_3.3-2    globals_0.16.3          
[115] png_0.1-8                XML_3.99-0.18            spatstat.univar_3.1-2   
[118] parallel_4.4.2           ellipsis_0.3.2           readr_2.1.5             
[121] dotCall64_1.2            bitops_1.0-9             listenv_0.9.1           
[124] viridisLite_0.4.2        scales_1.3.0             ggridges_0.5.6          
[127] purrr_1.0.4              crayon_1.5.3             rlang_1.1.5             
[130] cowplot_1.1.3

GenomicRanges tidySummarizedExperiment tidyomics • 2.2k views

ADD COMMENT • link 9 months ago • updated 4 months ago Kateřina • 0

0

Entering edit mode

Have you figured out how to do this using tidy grammar? Using non-tidy grammar is quite simple, and perhaps you already know how to do that. But if not, let us know. I can't help with tidy grammar, but I can help with conventional methods.

ADD REPLY • link 9 months ago James W. MacDonald 68k

0

Entering edit mode

I unfortunately haven't. Non-tidy grammar is fine, but I have wanted to use tidy-only grammar. Thank you nonetheless!

ADD REPLY • link 8 months ago Kateřina • 0

score 1 · Answer 1 · 2025-05-31

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

We are still working on adding tidyomics syntax to operate on the rowRanges of a ranged SE without breaking it off and using plyranges.

For now I think the approach would have to be join_overlap_left of the two ranges, followed by subsetting of the SE.

> rowRanges(se) |> anchor_5p() |> mutate(width=1) |> join_overlap_left(gr_annot)

      seqnames    ranges strand |   gene_name
  [1]     chr1       100      + |       GeneA
  [2]     chr1       249      - |        <NA>
  [3]     chr2       300      + |       GeneB

# ...then assign the gene_name column manually to rowData(se)