Obtain coordinates of 5' upstream region up to closest coding region
1
0
Entering edit mode
eggrandio • 0
@eggrandio-20085
Last seen 2.6 years ago

Hi,

I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:

library(GenomicRanges)
library(TxDb.Athaliana.BioMart.plantsmart28)
gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)


However, some regions overlap with coding regions upstream (other genes).

I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).

I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.

example:

upstream region:

========================(TSS)gene


overlapping gene:

     ======
========================(TSS)gene


If I remove it, I am left with:

=====      =============(TSS)gene


And I want to retrieve only:

           =============


r granges promoter • 422 views
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

So IIUC, for each gene you want the longest upstream region that is <= 2kb and does not overlap with any known CDS. Or, said otherwise, you want to minimally trim (on the 5' side) the upstream regions in gnflanks so that they are CDS-free. Note that this trimming will sometimes result in an empty upstream region. This will happen for genes for which the upstream position immediately adjacent to the gene is already in a known CDS (supposedly from another gene). Maybe that's a very rare and/or unrealistic occurrence but for the sake of robustness our code should handle it. The way it will handle it is by shrinking the upstream region to a zero-width range.

A naive but not very efficient way to do the above is:

cds <- cds(TxDb.Athaliana.BioMart.plantsmart28)

## This lapply() loop takes about 40 min on my laptop!
gnflanks2 <- lapply(seq_along(gnflanks),
function(i) {
gnflank <- gnflanks[i]
upstream_ranges <- setdiff(gnflank, cds)
if (as.logical(strand(gnflank) == "+")) {
## We're on the plus strand
upstream_range <- tail(upstream_ranges, n=1)
if (length(upstream_range) == 0 ||
end(upstream_range) != end(gnflank))
{
## Shrink to zero-width range
upstream_range <- gnflank
start(upstream_range) <- end(upstream_range) + 1
}
} else {
## We're on the minus strand
if (length(upstream_range) == 0 ||
start(upstream_range) != start(gnflank))
{
## Shrink to zero-width range
upstream_range <- gnflank
end(upstream_range) <- start(upstream_range) - 1
}
}
upstream_range
})

gnflanks2 <- do.call("c", gnflanks2)
names(gnflanks2) <- names(gnflanks)  # propagate the gene ids


gnflanks2 is a GRanges object parallel to GRanges object gnflanks, that is, the 2 objects have the same length and the i-th range in one object corresponds to the i-th range in the other. Furthermore, each range in gnflanks2 is either the same as the corresponding range in gnflanks (if no CDS got in the way), or a trimmed version of it (if one or more CDS got in the way).

Lightly tested only. Hope this helps.

H.

1
Entering edit mode

And FWIW here is the fast version of it:

trimFivePrimeSide <- function(gr, remove)
{
stopifnot(is(gr, "GenomicRanges"), is(remove, "GenomicRanges"))

mapping <- as(findOverlaps(gr, remove), "IntegerList")

new_start_on_plus_strand <- pmin(max(extractList(end(remove),
mapping)),
end(gr)) + 1L
new_end_on_minus_strand <- pmax(min(extractList(start(remove),
mapping)),
start(gr)) - 1L

on_plus_strand <- as.logical(strand(gr) == "+")
on_minus_strand <- as.logical(strand(gr) == "-")
has_hit <- lengths(mapping) != 0L

new_start <- ifelse(on_plus_strand & has_hit,
new_start_on_plus_strand,
start(gr))
new_end <- ifelse(on_minus_strand & has_hit,
new_end_on_minus_strand,
end(gr))
ranges(gr) <- IRanges(new_start, new_end, names=names(gr))

gr
}

gnflanks2 <- trimFivePrimeSide(gnflanks, cds)


This version is semantically equivalent to my previous "naive" version but only takes 0.062 s on my laptop so is about 38000 times faster ;-)

Cheers,

H.

0
Entering edit mode

Wow! thank you so much for your help! it worked perfectly!