In order to detect head-to-head promoters I am looking for a good way to detect head-to-head genomic ranges, like B and C in the example below.
                         ------C---->  
     <---A---  <---B---          <----D----
They are sometimes also called divergent or bidirectional and are of biological interests. Here are two examples from the littereature, With CAGE data they are found in transcribed enhancer regions (Andersson 2014). In a GRO-seq analysis (Tome 2018), they have been categorised as "divergent pairs" of TSS when they are less than 300 bp apart.
I am tempted to implement head-to-head promoter detection in CAGEr. At the moment I found the way that I describe below, but I wonder if:
- This is already done in a Bioconductor packages that I can depend on ?
- There is a more efficient or elegant way to do it ?
Here is how I would do at the moment.
As a source of GRanges an example, I am taking a GRanges object from the AnnotationHub. Keep only genes, their names and their biotype to simplify example output.
ah = AnnotationHub::AnnotationHub()
gtf <- ah[["AH64858"]]
gr <- gtf[gtf$type == "gene"]
mcols(gr) <- DataFrame(gr$gene_id, gr$gene_name, gr$gene_biotype)
gr$source <- gr$phase <- gr$score <- gr$type <- gr$gene_version <- NULL
Index the ranges by their order.
names(gr) <- 1:length(gr)
Extract their 5-prime ends, copy the index, then sort.
p <- GPos(promoters(gr, 0, 1))
names(p) <- names(gr)
p <- sort(p, ignore.strand = TRUE)
Stitch the next range on the same line as the current one. Remove the pairs that are on different chromosomes.
p$Next <- c(p[-1], p[1])
p <- p[seqnames(p) == seqnames(p$Next)]
Find which pairs have a plus / minus (head to head) orientation.
p$strandStrand <- paste(strand(p), strand(p$Next), sep = "")
plusMinus <- which(p$strandStrand == "-+")
hh <- p[plusMinus]
Use the indexes to retrieve the original ranges.
i1 <- names(hh)
i2 <- names(hh$Next)
sort(c(gr[i1], gr[i2]), ignore.strand=TRUE)
## GRanges object with 9384 ranges and 3 metadata columns:
##           seqnames        ranges strand |         gr.gene_id      gr.gene_name
##              <Rle>     <IRanges>  <Rle> |        <character>       <character>
##       6          1 381289-396363      - | ENSTRUG00000009462             cox10
##       7          1 425325-445021      + | ENSTRUG00000009368 si:ch211-216b21.2
##       8          1 488176-495178      - | ENSTRUG00000009358              <NA>
##      11          1 509902-520076      + | ENSTRUG00000009053              sgca
##      14          1 551027-551347      - | ENSTRUG00000007292              <NA>
##     ...        ...           ...    ... .                ...               ...
##   20921 HE596567.1     3463-3571      + | ENSTRUG00000024526           RF00001
##   21029 HE597694.1        45-776      - | ENSTRUG00000019592              <NA>
##   21030 HE597694.1     1132-2168      + | ENSTRUG00000024912              <NA>
##   18947 HE598556.1     1435-2893      - | ENSTRUG00000001164          prss59.1
##   18948 HE598556.1    7000-11144      + | ENSTRUG00000022925            mrpl17
##         gr.gene_biotype
##             <character>
##       6  protein_coding
##       7  protein_coding
##       8  protein_coding
##      11  protein_coding
##      14  protein_coding
##     ...             ...
##   20921            rRNA
##   21029  protein_coding
##   21030  protein_coding
##   18947  protein_coding
##   18948  protein_coding
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths
Store the head to head pair information in an object that represents span of the head to head pair and provides the coordinates of the gap within. Keep only the pairs that are separated by less than 50 nt.
HH <- GRanges(seqnames(gr[i1]), IRanges(start(gr[i1]), end(gr[i2])))
names(HH) <- names(hh)
HH$interval <- GRanges(seqnames(hh), IRanges(pos(hh), pos(hh$Next)))
within50 <- width(HH$interval) < 50
sort(c(gr[i1][within50], gr[i2][within50]), ignore.strand=TRUE)
## GRanges object with 136 ranges and 3 metadata columns:
##           seqnames            ranges strand |         gr.gene_id
##              <Rle>         <IRanges>  <Rle> |        <character>
##     467          1   7758838-7765115      - | ENSTRUG00000020313
##     468          1   7765148-7774656      + | ENSTRUG00000025597
##     957          1 15182345-15189325      - | ENSTRUG00000019879
##     958          1 15189333-15194992      + | ENSTRUG00000017600
##    1329          1 22606072-22608556      - | ENSTRUG00000020162
##     ...        ...               ...    ... .                ...
##   18436 HE591882.1       41358-43374      + | ENSTRUG00000025239
##   18721 HE591938.1         3285-3398      - | ENSTRUG00000025173
##   18722 HE591938.1         3402-8100      + | ENSTRUG00000024399
##   19233 HE592076.1         6177-9784      - | ENSTRUG00000002645
##   19234 HE592076.1        9831-14047      + | ENSTRUG00000003176
##              gr.gene_name gr.gene_biotype
##               <character>     <character>
##     467            znf668  protein_coding
##     468            znf646  protein_coding
##     957            Scrn3a  protein_coding
##     958              CIR1  protein_coding
##    1329             maip1  protein_coding
##     ...               ...             ...
##   18436              lsm1  protein_coding
##   18721           RF00020           snRNA
##   18722 NXPH4 (1 of many)  protein_coding
##   19233              <NA>  protein_coding
##   19234              tlr2  protein_coding
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths

Hi Charles, can you please define what you mean by "head to head ranges"? I don't want to have to go thru the algorithm you're showing above to try and guess. I've learned that it's a lot more productive to spend some time defining exactly "what" we want to achieve before we work on the "how". Thanks!
Thanks ! I edited the question to add background information.