Annotating GRanges with TxDb gene models using plyranges
0
1
Entering edit mode
Aditya ▴ 160
@aditya-7667
Last seen 21 months ago
Germany

I have two GRanges objects: (1) TFBS and (2) GENEMODELS

# (1) TFBS: Transcription factor binding sites (for SRF)
    bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis', 
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
    TFBS <- rtracklayer::import.bed(bedfile, genome = 'mm10')
    TFBS
           GRanges object with 1974 ranges and 2 metadata columns:
                 seqnames           ranges       strand |         name     score
                   <Rle>           <IRanges>      <Rle> |  <character> <numeric>
              [1]     chr2   83972890-83972905      +   | SRF_MA0083.3   8.61532
           [1974]     chr4 106237089-106237104      +   | SRF_MA0083.3    9.7378
          -------
          seqinfo: 66 sequences (1 circular) from mm10 genome    

# (2) GENEMODELS: TxDb turned GRanges
    txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    GENEMODELS <- GenomicFeatures::genes(txdb)
    GENEMODELS
           GRanges object with 39017 ranges and 1 metadata column:
                               seqnames              ranges strand |            gene_id
                                  <Rle>           <IRanges>  <Rle> |        <character>
            ENSMUSG00000000001     chr3 108107280-108146146      - | ENSMUSG00000000001
            ENSMUSG00000099334     chr4   74635214-74635351      + | ENSMUSG00000099334
            -------
            seqinfo: 66 sequences (1 circular) from mm10 genome

Left-joining TFBS and GENEMODELS annotates TFBS. seqInfo gets messed up, probably due to an additional seqlevel "." being added (how to fix this?) But apart from that, the left-join itself is successful.

  ANNOTATEDTFBS <- plyranges::join_overlap_left(TFBS, GENEMODELS)
  ANNOTATEDTFBS
            GRanges object with 2040 ranges and 3 metadata columns:
                   seqnames            ranges strand |         name     score     gene_id
                      <Rle>         <IRanges>  <Rle> |  <character> <numeric> <character>
               [1]     chr1   4712628-4712643      - | SRF_MA0083.3  10.49542        <NA>
            [2040]     chrY 89126494-89126509      - | SRF_MA0083.3   4.54393        <NA>
            -------
            seqinfo: 67 sequences (1 circular) from 2 genomes (NA, mm10)

  GenomeInfoDb::seqlevels(ANNOTATEDTFBS)
   [1] "."    "chr1"    "chr2"    "chr3"    "chr4"    etc.

Some ANNOTATEDTFBS ranges get duplicated, since some sites have multiple associated genes. I want to have dimensions of TFBS and ANNOTATEDTFBS identical, by paste(collapse=';')-ing such cases. plyranges::group_by seems suited, but freezes R. An alternative data.table-based split-apply-combine strategy works well:

# plyranges::group_by freezes R - don't execute
# plyranges::group_by(ANNOTATEDTFBS, seqnames, start, end, strand)

# data.table-based group-apply-combine works well
DT <- data.table::as.data.table(ANNOTATEDTFBS)
DT [ !is.na(gene_id),                                                         # select only non-NA gene_ids
     gene_id := paste0(gene_id, collapse = ';'),                  # Paste-collapse
     by = c('seqnames', 'start', 'end', 'strand') ]                    # Per seqnames-start-end-strand group
ANNOTATEDTFBS <- as(unique(DT), 'GRanges')

Feedback requests: (1) Is it possible to fix the seqInfo messup? (2) Would it be useful for plyranges to provide a data.table-based GRanges split-apply-combine functionality? (discussion on the second point to be continued on bioc-devel or github).

plyranges • 1.7k views
ADD COMMENT
0
Entering edit mode

I don't know whether there is any equivalent in the dplyr world, but it might be useful to have a function that rather than expand the left object instead splits the right object and attaches it as a new metadata column. I was unable to load the data file in your example (possible corruption?), but If I were doing this directly in Bioconductor, it would be something like:

hits <- findOverlaps(TFBS, GENEMODELS)
TFBS$gene_id <- unstrsplit(extractList(GENEMODELS$gene_id, as(hits, "List")), ";")
ADD REPLY
0
Entering edit mode

Thank you Michael, Just updated my code using this idiom.

ADD REPLY
0
Entering edit mode

Hmm, no, I have to stay with plyranges::join_overlap_left since I would like to include the other annotation columns as well

ADD REPLY
0
Entering edit mode

Hi Aditya, I've fixed the left overlap join modifying the seqinfo in the current and devel versions of plyranges. I'm currently working on a big overhaul of how plyranges does grouping so will chime back in to give you a proper answer to your query soon!

ADD REPLY
0
Entering edit mode

Thank you, Stuart :-).

ADD REPLY

Login before adding your answer.

Traffic: 776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6