Question

Annotating GRanges with TxDb gene models using plyranges

1

Entering edit mode

Aditya ▴ 160

@aditya-7667

Last seen 3.7 years ago

Germany

I have two GRanges objects: (1) TFBS and (2) GENEMODELS

# (1) TFBS: Transcription factor binding sites (for SRF)
    bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis', 
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
    TFBS <- rtracklayer::import.bed(bedfile, genome = 'mm10')
    TFBS
           GRanges object with 1974 ranges and 2 metadata columns:
                 seqnames           ranges       strand |         name     score
                   <Rle>           <IRanges>      <Rle> |  <character> <numeric>
              [1]     chr2   83972890-83972905      +   | SRF_MA0083.3   8.61532
           [1974]     chr4 106237089-106237104      +   | SRF_MA0083.3    9.7378
          -------
          seqinfo: 66 sequences (1 circular) from mm10 genome    

# (2) GENEMODELS: TxDb turned GRanges
    txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    GENEMODELS <- GenomicFeatures::genes(txdb)
    GENEMODELS
           GRanges object with 39017 ranges and 1 metadata column:
                               seqnames              ranges strand |            gene_id
                                  <Rle>           <IRanges>  <Rle> |        <character>
            ENSMUSG00000000001     chr3 108107280-108146146      - | ENSMUSG00000000001
            ENSMUSG00000099334     chr4   74635214-74635351      + | ENSMUSG00000099334
            -------
            seqinfo: 66 sequences (1 circular) from mm10 genome

Left-joining TFBS and GENEMODELS annotates TFBS. seqInfo gets messed up, probably due to an additional seqlevel "." being added (how to fix this?) But apart from that, the left-join itself is successful.

  ANNOTATEDTFBS <- plyranges::join_overlap_left(TFBS, GENEMODELS)
  ANNOTATEDTFBS
            GRanges object with 2040 ranges and 3 metadata columns:
                   seqnames            ranges strand |         name     score     gene_id
                      <Rle>         <IRanges>  <Rle> |  <character> <numeric> <character>
               [1]     chr1   4712628-4712643      - | SRF_MA0083.3  10.49542        <NA>
            [2040]     chrY 89126494-89126509      - | SRF_MA0083.3   4.54393        <NA>
            -------
            seqinfo: 67 sequences (1 circular) from 2 genomes (NA, mm10)

  GenomeInfoDb::seqlevels(ANNOTATEDTFBS)
   [1] "."    "chr1"    "chr2"    "chr3"    "chr4"    etc.

Some ANNOTATEDTFBS ranges get duplicated, since some sites have multiple associated genes. I want to have dimensions of TFBS and ANNOTATEDTFBS identical, by paste(collapse=';')-ing such cases. plyranges::group_by seems suited, but freezes R. An alternative data.table-based split-apply-combine strategy works well:

# plyranges::group_by freezes R - don't execute
# plyranges::group_by(ANNOTATEDTFBS, seqnames, start, end, strand)

# data.table-based group-apply-combine works well
DT <- data.table::as.data.table(ANNOTATEDTFBS)
DT [ !is.na(gene_id),                                                         # select only non-NA gene_ids
     gene_id := paste0(gene_id, collapse = ';'),                  # Paste-collapse
     by = c('seqnames', 'start', 'end', 'strand') ]                    # Per seqnames-start-end-strand group
ANNOTATEDTFBS <- as(unique(DT), 'GRanges')

Feedback requests: (1) Is it possible to fix the seqInfo messup? (2) Would it be useful for plyranges to provide a data.table-based GRanges split-apply-combine functionality? (discussion on the second point to be continued on bioc-devel or github).

plyranges • 2.8k views

ADD COMMENT • link 6.4 years ago Aditya ▴ 160

0

Entering edit mode

I don't know whether there is any equivalent in the dplyr world, but it might be useful to have a function that rather than expand the left object instead splits the right object and attaches it as a new metadata column. I was unable to load the data file in your example (possible corruption?), but If I were doing this directly in Bioconductor, it would be something like:

hits <- findOverlaps(TFBS, GENEMODELS)
TFBS$gene_id <- unstrsplit(extractList(GENEMODELS$gene_id, as(hits, "List")), ";")

ADD REPLY • link 6.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael, Just updated my code using this idiom.

ADD REPLY • link 6.3 years ago Aditya ▴ 160

0

Entering edit mode

Hmm, no, I have to stay with plyranges::join_overlap_left since I would like to include the other annotation columns as well

ADD REPLY • link 6.3 years ago Aditya ▴ 160

0

Entering edit mode

Hi Aditya, I've fixed the left overlap join modifying the seqinfo in the current and devel versions of plyranges. I'm currently working on a big overhaul of how plyranges does grouping so will chime back in to give you a proper answer to your query soon!