Obtain coordinates of 5' upstream region up to closest coding region
1
0
Entering edit mode
eggrandio • 0
@eggrandio-20085
Last seen 5.1 years ago

Hi,

I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:

library(GenomicRanges)
library(TxDb.Athaliana.BioMart.plantsmart28)
gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)

However, some regions overlap with coding regions upstream (other genes).

I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).

I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.

example:

upstream region:

========================(TSS)gene

overlapping gene:

     ======
========================(TSS)gene

If I remove it, I am left with:

=====      =============(TSS)gene

And I want to retrieve only:

           =============

Thanks in advance!

r granges promoter • 1.2k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 22 hours ago
Seattle, WA, United States

Hi,

So IIUC, for each gene you want the longest upstream region that is <= 2kb and does not overlap with any known CDS. Or, said otherwise, you want to minimally trim (on the 5' side) the upstream regions in gnflanks so that they are CDS-free. Note that this trimming will sometimes result in an empty upstream region. This will happen for genes for which the upstream position immediately adjacent to the gene is already in a known CDS (supposedly from another gene). Maybe that's a very rare and/or unrealistic occurrence but for the sake of robustness our code should handle it. The way it will handle it is by shrinking the upstream region to a zero-width range.

A naive but not very efficient way to do the above is:

cds <- cds(TxDb.Athaliana.BioMart.plantsmart28)

## This lapply() loop takes about 40 min on my laptop!
gnflanks2 <- lapply(seq_along(gnflanks),
    function(i) {
        gnflank <- gnflanks[i]
        upstream_ranges <- setdiff(gnflank, cds)
        if (as.logical(strand(gnflank) == "+")) {
            ## We're on the plus strand
            upstream_range <- tail(upstream_ranges, n=1)
            if (length(upstream_range) == 0 ||
                end(upstream_range) != end(gnflank))
            {
                ## Shrink to zero-width range
                upstream_range <- gnflank
                start(upstream_range) <- end(upstream_range) + 1
            }
        } else {
            ## We're on the minus strand
            upstream_range <- head(upstream_ranges, n=1)
            if (length(upstream_range) == 0 ||
                start(upstream_range) != start(gnflank))
            {
                ## Shrink to zero-width range
                upstream_range <- gnflank
                end(upstream_range) <- start(upstream_range) - 1
            }
        }
        upstream_range
    })

gnflanks2 <- do.call("c", gnflanks2)
names(gnflanks2) <- names(gnflanks)  # propagate the gene ids

gnflanks2 is a GRanges object parallel to GRanges object gnflanks, that is, the 2 objects have the same length and the i-th range in one object corresponds to the i-th range in the other. Furthermore, each range in gnflanks2 is either the same as the corresponding range in gnflanks (if no CDS got in the way), or a trimmed version of it (if one or more CDS got in the way).

Lightly tested only. Hope this helps.

H.

ADD COMMENT
1
Entering edit mode

And FWIW here is the fast version of it:

trimFivePrimeSide <- function(gr, remove)
{
    stopifnot(is(gr, "GenomicRanges"), is(remove, "GenomicRanges"))

    mapping <- as(findOverlaps(gr, remove), "IntegerList")

    new_start_on_plus_strand <- pmin(max(extractList(end(remove),
                                                     mapping)),
                                     end(gr)) + 1L
    new_end_on_minus_strand <- pmax(min(extractList(start(remove),
                                                    mapping)),
                                    start(gr)) - 1L

    on_plus_strand <- as.logical(strand(gr) == "+")
    on_minus_strand <- as.logical(strand(gr) == "-")
    has_hit <- lengths(mapping) != 0L

    new_start <- ifelse(on_plus_strand & has_hit,
                        new_start_on_plus_strand,
                        start(gr))
    new_end <- ifelse(on_minus_strand & has_hit,
                      new_end_on_minus_strand,
                      end(gr))
    ranges(gr) <- IRanges(new_start, new_end, names=names(gr))

    gr
}

gnflanks2 <- trimFivePrimeSide(gnflanks, cds)

This version is semantically equivalent to my previous "naive" version but only takes 0.062 s on my laptop so is about 38000 times faster ;-)

Cheers,

H.

ADD REPLY
0
Entering edit mode

Wow! thank you so much for your help! it worked perfectly!

ADD REPLY

Login before adding your answer.

Traffic: 718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6