Entering edit mode
Eva Benito Garagorri
▴
70
@eva-benito-garagorri-4263
Last seen 10.3 years ago
Dear list,
Hi again.
Thank you once more for your input about the coverage plot. It has
taken a bit but I finally got to try the idea Hervé proposed for
calculating the coverage around the TSS. I know it's probably not very
efficient since it uses a pretty long 'for' loop, but it works and it
seems to do what it should. Any ideas to make the code faster or
better are welcome.
I am, however, a bit puzzled about the aspect of the final plot. I was
expecting to have a certain accumulation of reads downstream of the
TSS, but I seem to be getting the inverted
tendency(http://img199.imageshack.us/i/profileplottest.png/) (Sorry
about the ads in the page, just looked for a free hosting site). We
ChIPed an acetylated histone mark. I may be missing something really
obvious about genes on the + or the - strand, I did review the code
once and again and I can't find the mistake
So I am providing the code with an example and if anyone can spot a
mistake in it, I'd be grateful if they could point it out.
Thanks a lot for your help.
Here's the code:
library(GenomicRanges)
library(GenomicFeatures)
gr = GRanges(seqnames=Rle(c('chr1','chr2','chr3'),c(4,5,1)),ranges=IRa
nges(1000:1009,end=1034:1043),strand=Rle(strand(c('+','-','+','-')),c(
3,1,2,4)))
gr_extend = resize(gr, 200) ## Extend the reads to approx. fragment
size. 'resize' seems to take into account the strand
refseq = GRanges(seqnames=Rle(c('chr1','chr2','chr3'),c(2,6,2)),ranges
=IRanges(1000:1009,end=2000:2009),strand=Rle(strand(c('+','-','+','-')
),c(1,2,4,3))) ##Fictious refseq object
#############################################
########## Downstream windows
#############################################
downstreamRefseq = refseq
## First window 50bp downstream of the TSS
idx = strand(downstreamRefseq) == '+'
end(downstreamRefseq[idx]) = start(downstreamRefseq[idx])+49
start(downstreamRefseq[!idx]) = end(downstreamRefseq[!idx]) -49
downstreamWindow1 = countOverlaps(downstreamRefseq, gr_extend)
### Now generate remaining windows
downstreamRefseqs = GRangesList()
for (i in 1:99)
{
downstreamRefseqs[[i]] = shift(downstreamRefseq,
as.integer(ifelse(strand(downstreamRefseq) =='+',i*50,-
i*50)))
} ## This loop doesn't take too long to run, generates all the
windows downstream of the TSS
downstreamOlaps = lapply(downstreamRefseqs, function(x)
countOverlaps(x,gr_extend))
downstreamOlaps1 = lapply(downstreamOlaps, mean) ## Calculate the mean
number of reads per window
downstreamCoverage =
c(mean(downstreamWindow1),unlist(downstreamOlaps1))/50 ## Incorporate
window 1 and calculate the coverage per 50bp
#############################################
########## Upstream windows
#############################################
upstreamRefseq = refseq
end(upstreamRefseq[idx]) = start(upstreamRefseq[idx])
start(upstreamRefseq[idx]) = start(upstreamRefseq[idx])-49
start(upstreamRefseq[!idx]) = end(upstreamRefseq[!idx])
end(upstreamRefseq[!idx]) = end(upstreamRefseq[!idx])+49
upstreamWindow1 = countOverlaps(downstreamRefseq, gr_extend)
upstreamRefseqs = GRangesList()
for (i in 1:99)
{
upstreamRefseqs[[i]] = shift(upstreamRefseq,
as.integer(ifelse(strand(upstreamRefseq) =='+',-i*50,+
i*50)))
}
upstreamOlaps = lapply(upstreamRefseqs, function(x)
countOverlaps(x,gr_extend))
upstreamOlaps1 = lapply(upstreamOlaps, mean)
upstreamCoverage = c(mean(upstreamWindow1),unlist(upstreamOlaps1))/50
#############################################
########## Plotting
#############################################
## The plot looks horrible, it was just as an example...
x = seq(from = 1, to = 5000,by=50)
y = seq(from = -1, to = -5000,by=-50)
plot(x,downstreamCoverage,pch=20,xlim=c(-5100,5100))
lines(x, downstreamCoverage)
points(y,upstreamCoverage,pch=20)
lines(y,upstreamCoverage)
> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9
loaded via a namespace (and not attached):
[1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.4
BSgenome_1.18.3
[5] DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-4
rtracklayer_1.10.6
[9] tools_2.12.0 XML_3.2-0
On 30/03/2011, at 22:21, Hervé Pagès wrote:
> Hi Eva,
>
> One solution that involves a little bit of programming from your
part
> is to make those bins "by hand" by using things like
> 'start(gr) <- value' and 'end(gr) <- value' to adjust the starts
> and ends of your GRanges object. For example:
>
> refseqGR1 <- refseqGR
> ## Adjust 'end(refseqGR1)' so that each range in it corresponds to
> ## the region of length 1000 starting at 'start(refseqGR))' (let's
> ## call this region the "1st bin"):
> end(refseqGR1) <- start(refseqGR1) + 999
> olaps1 <- findOverlaps(refseqGR1, simTags)
>
> ## Then for the "2nd bin":
> refseqGR2 <- shift(refseqGR1, 1000)
> olaps2 <- findOverlaps(refseqGR2, simTags)
>
> ## etc...
>
> You might want to put this in a loop, decide how many loops you want
> to do, or maybe define some criteria to exit the loop prematurely...
>
> An important thing to be aware of though is that 'start(refseqGR)'
is
> always the position of the left-most base of the transcript, that
is,
> it is its 5' end if it's on the + strand and its 3' end if it's on
the
> - strand. So you will probably need to adapt the code above to do
> something like:
>
> ## 1st bin:
> refseqGR1 <- refseqGR
> idx <- strand(refseqGR) == "+"
> end(refseqGR1[idx]) <- start(refseqGR1[idx]) + 999
> start(refseqGR1[!idx]) <- end(refseqGR1[!idx]) - 999
>
> ## 2nd bin:
> refseqGR2 <- shift(refseqGR1, as.vector(ifelse(strand(refseqGR1) ==
"+", 1000, -1000)))
>
> Cheers,
> H.
>
>
> On 03/30/2011 08:37 AM, Eva Benito Garagorri wrote:
>> Dear list,
>>
>> I am trying to generate a tag density plot from a ChIP-Seq
experiment for regions around the TSS of known transcripts. I suppose
this is not too complicated, but I am having difficulties approaching
this issue. I think maybe the most straightforward way is to use the
GRanges capabilities. I generated a GRanges object by downloading the
mouse refseq database from UCSC (see code below). This gives me a
GRanges object containing the start and end coordinate of each refseq.
I could cross this with my tag file with "findOverlaps", but that
would give me the overlap of my tags with the entire span of the
transcript and I would like to just keep (and make bins around) the
region at either side of the TSS. I couldn't find a way to generate a
sliding window in the refseq database to calculate the overlap between
each bin and my tag file.
>>
>> If anyone could point me to some package/functionality/reading
material or maybe to a previous thread (I did search the mailing list
but maybe I missed something) or else help with the strategy, I would
be most grateful.
>>
>> Thank you very much in advance.
>>
>> Best regards,
>>
>> Eva
>>
>>
>>
>> library(GenomicRanges)
>> library(GenomicFeatures)
>>
>> refseq = makeTranscriptDbFromUCSC('mm9', tablename='refGene')
>>
>> refseqGR = transcripts(refseq)
>>
>> head(refseqGR)
>>
>> GRanges with 6 ranges and 2 elementMetadata values
>> seqnames ranges strand | tx_id tx_name
>> <rle> <iranges> <rle> |<integer> <character>
>> [1] chr1 [4797974, 4836816] + | 490 NM_008866
>> [2] chr1 [4847775, 4887990] + | 73 NM_011541
>> [3] chr1 [4847775, 4887990] + | 78 NM_001159750
>> [4] chr1 [4848409, 4887990] + | 75 NM_001159751
>> [5] chr1 [5073254, 5152630] + | 77 NM_133826
>> [6] chr1 [5578574, 5596214] + | 494 NM_001204371
>>
>> seqlengths
>> chr1 chr2 chr3 chr4 chr5
chr6 ... chr7_random chr8_random chr9_random chrUn_random
chrX_random chrY_random
>> 197195432 181748087 159599783 155630120 152537259
149517037 ... 362490 849593 449403 5900358
1785075 58682461
>>
>> ### Simulate a tag file as a sample of the above
>>
>> simTags = sample(refseqGR, 1000)
>>
>>
>> #### It would now be possible to get the overlap between the two by
doing:
>>
>> olaps = findOverlaps(refseqGR, simTags)
>>
>> ### But how do I divide the refseqGR into bins of equal size around
the start coordinate?
>>
>>
>> sessionInfo()
>>
>>
>> R version 2.12.0 (2010-10-15)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods
base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.0
BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3
rtracklayer_1.10.6
>> [9] tools_2.12.0 XML_3.2-0
>>
>> ----------
>> Eva Benito Garagorri
>> PhD program in Neurosciences
>> Institute for Neurosciences in Alicante
>> UMH-CSIC
>> San Juan de Alicante
>> 03550
>> Spain
>> ebenito@umh.es
>> (34) 965 91 92 33
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
----------
Eva Benito Garagorri
PhD program in Neurosciences
Institute for Neurosciences in Alicante
UMH-CSIC
San Juan de Alicante
03550
Spain
ebenito@umh.es
(34) 965 91 92 33
[[alternative HTML version deleted]]