Question: Obtain coordinates of 5' upstream region up to closest coding region
gravatar for eggrandio
3 months ago by
eggrandio0 wrote:


I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:

gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)

However, some regions overlap with coding regions upstream (other genes).

I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).

I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.


upstream region:


overlapping gene:


If I remove it, I am left with:

=====      =============(TSS)gene

And I want to retrieve only:


Thanks in advance!

granges R promoter • 146 views
ADD COMMENTlink modified 3 months ago by Hervé Pagès ♦♦ 14k • written 3 months ago by eggrandio0
Answer: Obtain coordinates of 5' upstream region up to closest coding region
gravatar for Hervé Pagès
3 months ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:


So IIUC, for each gene you want the longest upstream region that is <= 2kb and does not overlap with any known CDS. Or, said otherwise, you want to minimally trim (on the 5' side) the upstream regions in gnflanks so that they are CDS-free. Note that this trimming will sometimes result in an empty upstream region. This will happen for genes for which the upstream position immediately adjacent to the gene is already in a known CDS (supposedly from another gene). Maybe that's a very rare and/or unrealistic occurrence but for the sake of robustness our code should handle it. The way it will handle it is by shrinking the upstream region to a zero-width range.

A naive but not very efficient way to do the above is:

cds <- cds(TxDb.Athaliana.BioMart.plantsmart28)

## This lapply() loop takes about 40 min on my laptop!
gnflanks2 <- lapply(seq_along(gnflanks),
    function(i) {
        gnflank <- gnflanks[i]
        upstream_ranges <- setdiff(gnflank, cds)
        if (as.logical(strand(gnflank) == "+")) {
            ## We're on the plus strand
            upstream_range <- tail(upstream_ranges, n=1)
            if (length(upstream_range) == 0 ||
                end(upstream_range) != end(gnflank))
                ## Shrink to zero-width range
                upstream_range <- gnflank
                start(upstream_range) <- end(upstream_range) + 1
        } else {
            ## We're on the minus strand
            upstream_range <- head(upstream_ranges, n=1)
            if (length(upstream_range) == 0 ||
                start(upstream_range) != start(gnflank))
                ## Shrink to zero-width range
                upstream_range <- gnflank
                end(upstream_range) <- start(upstream_range) - 1

gnflanks2 <-"c", gnflanks2)
names(gnflanks2) <- names(gnflanks)  # propagate the gene ids

gnflanks2 is a GRanges object parallel to GRanges object gnflanks, that is, the 2 objects have the same length and the i-th range in one object corresponds to the i-th range in the other. Furthermore, each range in gnflanks2 is either the same as the corresponding range in gnflanks (if no CDS got in the way), or a trimmed version of it (if one or more CDS got in the way).

Lightly tested only. Hope this helps.


ADD COMMENTlink modified 3 months ago • written 3 months ago by Hervé Pagès ♦♦ 14k

And FWIW here is the fast version of it:

trimFivePrimeSide <- function(gr, remove)
    stopifnot(is(gr, "GenomicRanges"), is(remove, "GenomicRanges"))

    mapping <- as(findOverlaps(gr, remove), "IntegerList")

    new_start_on_plus_strand <- pmin(max(extractList(end(remove),
                                     end(gr)) + 1L
    new_end_on_minus_strand <- pmax(min(extractList(start(remove),
                                    start(gr)) - 1L

    on_plus_strand <- as.logical(strand(gr) == "+")
    on_minus_strand <- as.logical(strand(gr) == "-")
    has_hit <- lengths(mapping) != 0L

    new_start <- ifelse(on_plus_strand & has_hit,
    new_end <- ifelse(on_minus_strand & has_hit,
    ranges(gr) <- IRanges(new_start, new_end, names=names(gr))


gnflanks2 <- trimFivePrimeSide(gnflanks, cds)

This version is semantically equivalent to my previous "naive" version but only takes 0.062 s on my laptop so is about 38000 times faster ;-)



ADD REPLYlink modified 3 months ago • written 3 months ago by Hervé Pagès ♦♦ 14k

Wow! thank you so much for your help! it worked perfectly!

ADD REPLYlink written 3 months ago by eggrandio0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour