Question: Annotating GRanges with TxDb gene models using plyranges
0
gravatar for Aditya
5 weeks ago by
Aditya120
Germany
Aditya120 wrote:

I have two GRanges objects: (1) TFBS and (2) GENEMODELS

# (1) TFBS: Transcription factor binding sites (for SRF)
    bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis', 
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
    TFBS <- rtracklayer::import.bed(bedfile, genome = 'mm10')
    TFBS
           GRanges object with 1974 ranges and 2 metadata columns:
                 seqnames           ranges       strand |         name     score
                   <Rle>           <IRanges>      <Rle> |  <character> <numeric>
              [1]     chr2   83972890-83972905      +   | SRF_MA0083.3   8.61532
           [1974]     chr4 106237089-106237104      +   | SRF_MA0083.3    9.7378
          -------
          seqinfo: 66 sequences (1 circular) from mm10 genome    

# (2) GENEMODELS: TxDb turned GRanges
    txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    GENEMODELS <- GenomicFeatures::genes(txdb)
    GENEMODELS
           GRanges object with 39017 ranges and 1 metadata column:
                               seqnames              ranges strand |            gene_id
                                  <Rle>           <IRanges>  <Rle> |        <character>
            ENSMUSG00000000001     chr3 108107280-108146146      - | ENSMUSG00000000001
            ENSMUSG00000099334     chr4   74635214-74635351      + | ENSMUSG00000099334
            -------
            seqinfo: 66 sequences (1 circular) from mm10 genome

Left-joining TFBS and GENEMODELS annotates TFBS. seqInfo gets messed up, probably due to an additional seqlevel "." being added (how to fix this?) But apart from that, the left-join itself is successful.

  ANNOTATEDTFBS <- plyranges::join_overlap_left(TFBS, GENEMODELS)
  ANNOTATEDTFBS
            GRanges object with 2040 ranges and 3 metadata columns:
                   seqnames            ranges strand |         name     score     gene_id
                      <Rle>         <IRanges>  <Rle> |  <character> <numeric> <character>
               [1]     chr1   4712628-4712643      - | SRF_MA0083.3  10.49542        <NA>
            [2040]     chrY 89126494-89126509      - | SRF_MA0083.3   4.54393        <NA>
            -------
            seqinfo: 67 sequences (1 circular) from 2 genomes (NA, mm10)

  GenomeInfoDb::seqlevels(ANNOTATEDTFBS)
   [1] "."    "chr1"    "chr2"    "chr3"    "chr4"    etc.

Some ANNOTATEDTFBS ranges get duplicated, since some sites have multiple associated genes. I want to have dimensions of TFBS and ANNOTATEDTFBS identical, by paste(collapse=';')-ing such cases. plyranges::group_by seems suited, but freezes R. An alternative data.table-based split-apply-combine strategy works well:

# plyranges::group_by freezes R - don't execute
# plyranges::group_by(ANNOTATEDTFBS, seqnames, start, end, strand)

# data.table-based group-apply-combine works well
DT <- data.table::as.data.table(ANNOTATEDTFBS)
DT [ !is.na(gene_id),                                                         # select only non-NA gene_ids
     gene_id := paste0(gene_id, collapse = ';'),                  # Paste-collapse
     by = c('seqnames', 'start', 'end', 'strand') ]                    # Per seqnames-start-end-strand group
ANNOTATEDTFBS <- as(unique(DT), 'GRanges')

Feedback requests: (1) Is it possible to fix the seqInfo messup? (2) Would it be useful for plyranges to provide a data.table-based GRanges split-apply-combine functionality? (discussion on the second point to be continued on bioc-devel or github).

plyranges • 169 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Aditya120

I don't know whether there is any equivalent in the dplyr world, but it might be useful to have a function that rather than expand the left object instead splits the right object and attaches it as a new metadata column. I was unable to load the data file in your example (possible corruption?), but If I were doing this directly in Bioconductor, it would be something like:

hits <- findOverlaps(TFBS, GENEMODELS)
TFBS$gene_id <- unstrsplit(extractList(GENEMODELS$gene_id, as(hits, "List")), ";")
ADD REPLYlink written 5 weeks ago by Michael Lawrence11k

Thank you Michael, Just updated my code using this idiom.

ADD REPLYlink written 27 days ago by Aditya120

Hmm, no, I have to stay with plyranges::join_overlap_left since I would like to include the other annotation columns as well

ADD REPLYlink modified 27 days ago • written 27 days ago by Aditya120

Hi Aditya, I've fixed the left overlap join modifying the seqinfo in the current and devel versions of plyranges. I'm currently working on a big overhaul of how plyranges does grouping so will chime back in to give you a proper answer to your query soon!

ADD REPLYlink written 14 days ago by lee.s70

Thank you, Stuart :-).

ADD REPLYlink written 14 days ago by Aditya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 284 users visited in the last hour