I have two GRanges objects: (1) TFBS and (2) GENEMODELS
# (1) TFBS: Transcription factor binding sites (for SRF)
bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis',
'/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
TFBS <- rtracklayer::import.bed(bedfile, genome = 'mm10')
TFBS
GRanges object with 1974 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr2 83972890-83972905 + | SRF_MA0083.3 8.61532
[1974] chr4 106237089-106237104 + | SRF_MA0083.3 9.7378
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
# (2) GENEMODELS: TxDb turned GRanges
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
GENEMODELS <- GenomicFeatures::genes(txdb)
GENEMODELS
GRanges object with 39017 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSMUSG00000000001 chr3 108107280-108146146 - | ENSMUSG00000000001
ENSMUSG00000099334 chr4 74635214-74635351 + | ENSMUSG00000099334
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
Left-joining TFBS and GENEMODELS annotates TFBS.
seqInfo
gets messed up, probably due to an additional seqlevel "." being added (how to fix this?)
But apart from that, the left-join itself is successful.
ANNOTATEDTFBS <- plyranges::join_overlap_left(TFBS, GENEMODELS)
ANNOTATEDTFBS
GRanges object with 2040 ranges and 3 metadata columns:
seqnames ranges strand | name score gene_id
<Rle> <IRanges> <Rle> | <character> <numeric> <character>
[1] chr1 4712628-4712643 - | SRF_MA0083.3 10.49542 <NA>
[2040] chrY 89126494-89126509 - | SRF_MA0083.3 4.54393 <NA>
-------
seqinfo: 67 sequences (1 circular) from 2 genomes (NA, mm10)
GenomeInfoDb::seqlevels(ANNOTATEDTFBS)
[1] "." "chr1" "chr2" "chr3" "chr4" etc.
Some ANNOTATEDTFBS ranges get duplicated, since some sites have multiple associated genes.
I want to have dimensions of TFBS and ANNOTATEDTFBS identical, by paste(collapse=';')-ing such cases.
plyranges::group_by
seems suited, but freezes R.
An alternative data.table-based split-apply-combine strategy works well:
# plyranges::group_by freezes R - don't execute
# plyranges::group_by(ANNOTATEDTFBS, seqnames, start, end, strand)
# data.table-based group-apply-combine works well
DT <- data.table::as.data.table(ANNOTATEDTFBS)
DT [ !is.na(gene_id), # select only non-NA gene_ids
gene_id := paste0(gene_id, collapse = ';'), # Paste-collapse
by = c('seqnames', 'start', 'end', 'strand') ] # Per seqnames-start-end-strand group
ANNOTATEDTFBS <- as(unique(DT), 'GRanges')
Feedback requests: (1) Is it possible to fix the seqInfo
messup? (2) Would it be useful for plyranges
to provide a data.table
-based GRanges
split-apply-combine functionality? (discussion on the second point to be continued on bioc-devel or github).
I don't know whether there is any equivalent in the dplyr world, but it might be useful to have a function that rather than expand the left object instead splits the right object and attaches it as a new metadata column. I was unable to load the data file in your example (possible corruption?), but If I were doing this directly in Bioconductor, it would be something like:
Thank you Michael, Just updated my code using this idiom.
Hmm, no, I have to stay with
plyranges::join_overlap_left
since I would like to include the other annotation columns as wellHi Aditya, I've fixed the left overlap join modifying the seqinfo in the current and devel versions of plyranges. I'm currently working on a big overhaul of how plyranges does grouping so will chime back in to give you a proper answer to your query soon!
Thank you, Stuart :-).