Question: Calculating and plotting AT content in a tiling window across the genome with GRanges
0
12 days ago by
atotalamateur0 wrote:

Hi, Very new to R here. I have been trying to plot the relationship between AT content and proteins binding to a particualr stretch of DNA, as in a heat map or the figures seen in this paper: https://www.ncbi.nlm.nih.gov/pubmed/29267285. However, I have been having a lot of confusion with how exactly to go about doing this. I understand how to tile the genome into windows with tileGenome, this outputs a GRanges List, not an object. I also understand how to calcualte the GC content of specified ranges, such as "gcContent(Hsapiens[["chr1"]])". I don't understand how to go about merging these two approaches to calculate the AT content (1-GC content) of each interval in the GRanges List, as if I had a single GRanges object I could use something like:

> windowViews <- Views(BSgenome.Mmusculus.UCSC.mm10, windowRanges) gcFrequency <- letterFrequency(windowViews, letters="GC", as.prob=TRUE)

or

letterFrequency2 <- function(x, letters, OR="|",
as.prob=FALSE, ...)   {
stopifnot(is(x, "BSgenomeViews"))
chunksize <- 500000L
chunks <- breakInChunks(length(x), chunksize)
chunks <- as(chunks, "IRanges")
ans_chunks <- lapply(seq_along(chunks),
function(i) {
x_chunk <- extractROWS(x, chunks[i])
letterFrequency(x_chunk, letters, OR=OR,
as.prob=as.prob)
})
do.call(rbind, ans_chunks)


}

But I clearly am confused on what arguments to pass to View, how to store the AT content calculation for each range, and how to plot this (at all, or against the binding preferences of a specific protein). In short, I am not sure how to calculate and store the GC/AT contents for windows of arbitrary size across the genome. Any advice you can provide on how to do this would be great, sorry if this is extremely trivial!

Also, the above code was take from H. Page in a previous question, full credit.

modified 7 days ago by Hervé Pagès ♦♦ 14k • written 12 days ago by atotalamateur0
Answer: Calculating and plotting AT content in a tiling window across the genome with GR
0
7 days ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

From ?tileGenome:

Value:

If ‘cut.last.tile.in.chrom’ is ‘FALSE’ (the default), a
GRangesList object with one list element per tile, each
of them containing a number of genomic ranges equal
to the number of chromosomes it overlaps with. Note
that when the tiles are small (i.e.  much smaller than the
chromosomes), most of them only overlap with a single
chromosome.

If ‘cut.last.tile.in.chrom’ is ‘TRUE’, a GRanges object
with one element (i.e. one genomic range) per tile.


So:

library(BSgenome.Mmusculus.UCSC.mm10)
library(GenomicRanges)

tiles <- tileGenome(seqinfo(BSgenome.Mmusculus.UCSC.mm10), tilewidth=1e6,
cut.last.tile.in.chrom=TRUE)
genome_views <- Views(BSgenome.Mmusculus.UCSC.mm10, tiles)

### Loads the entire genome (3Gb) in memory at once.
AT_frequency <- letterFrequency(genome_views, letters="AT")


Takes about 24 sec on my laptop.

If memory usage is an issue, letterFrequency2() can be used instead of the letterFrequency(). It reduces memory usage by processing the views by batch so loading only the part of the genome that corresponds to the current batch. Here is a parallelized version of letterFrequency2():

library(BiocParallel)
letterFrequency2p <- function(x, letters, OR="|", as.prob=FALSE, ...,
BPPARAM=NULL)
{
stopifnot(is(x, "BSgenomeViews"))
nchunk <- ceiling(sum(width(x)) / 5e8)
if (!is.null(BPPARAM)) {
stopifnot(is(BPPARAM, "BiocParallelParam"))
nchunk <- nchunk * bpnworkers(BPPARAM)
}
chunks <- breakInChunks(length(x), nchunk=nchunk)
chunks <- as(chunks, "IRanges")
process_chunk <- function(i) {
x_chunk <- extractROWS(x, chunks[i])
letterFrequency(x_chunk, letters, OR=OR, as.prob=as.prob, ...)
}
if (is.null(BPPARAM)) {
ans_chunks <- lapply(seq_along(chunks), process_chunk)
} else {
ans_chunks <- bplapply(seq_along(chunks), process_chunk,
BPPARAM=BPPARAM)
}
do.call(rbind, ans_chunks)
}


letterFrequency2p(genome_views, letters="AT", BPPARAM=MulticoreParam(6)) takes 7 sec on my laptop.

H.