Chip-Seq tag density plot

0

Entering edit mode

Eva Benito Garagorri ▴ 70

@eva-benito-garagorri-4263

Last seen 9.6 years ago

Dear list, Hi again. Thank you once more for your input about the coverage plot. It has taken a bit but I finally got to try the idea Hervé proposed for calculating the coverage around the TSS. I know it's probably not very efficient since it uses a pretty long 'for' loop, but it works and it seems to do what it should. Any ideas to make the code faster or better are welcome. I am, however, a bit puzzled about the aspect of the final plot. I was expecting to have a certain accumulation of reads downstream of the TSS, but I seem to be getting the inverted tendency(http://img199.imageshack.us/i/profileplottest.png/) (Sorry about the ads in the page, just looked for a free hosting site). We ChIPed an acetylated histone mark. I may be missing something really obvious about genes on the + or the - strand, I did review the code once and again and I can't find the mistake So I am providing the code with an example and if anyone can spot a mistake in it, I'd be grateful if they could point it out. Thanks a lot for your help. Here's the code: library(GenomicRanges) library(GenomicFeatures) gr = GRanges(seqnames=Rle(c('chr1','chr2','chr3'),c(4,5,1)),ranges=IRa nges(1000:1009,end=1034:1043),strand=Rle(strand(c('+','-','+','-')),c( 3,1,2,4))) gr_extend = resize(gr, 200) ## Extend the reads to approx. fragment size. 'resize' seems to take into account the strand refseq = GRanges(seqnames=Rle(c('chr1','chr2','chr3'),c(2,6,2)),ranges =IRanges(1000:1009,end=2000:2009),strand=Rle(strand(c('+','-','+','-') ),c(1,2,4,3))) ##Fictious refseq object ############################################# ########## Downstream windows ############################################# downstreamRefseq = refseq ## First window 50bp downstream of the TSS idx = strand(downstreamRefseq) == '+' end(downstreamRefseq[idx]) = start(downstreamRefseq[idx])+49 start(downstreamRefseq[!idx]) = end(downstreamRefseq[!idx]) -49 downstreamWindow1 = countOverlaps(downstreamRefseq, gr_extend) ### Now generate remaining windows downstreamRefseqs = GRangesList() for (i in 1:99) { downstreamRefseqs[[i]] = shift(downstreamRefseq, as.integer(ifelse(strand(downstreamRefseq) =='+',i*50,- i*50))) } ## This loop doesn't take too long to run, generates all the windows downstream of the TSS downstreamOlaps = lapply(downstreamRefseqs, function(x) countOverlaps(x,gr_extend)) downstreamOlaps1 = lapply(downstreamOlaps, mean) ## Calculate the mean number of reads per window downstreamCoverage = c(mean(downstreamWindow1),unlist(downstreamOlaps1))/50 ## Incorporate window 1 and calculate the coverage per 50bp ############################################# ########## Upstream windows ############################################# upstreamRefseq = refseq end(upstreamRefseq[idx]) = start(upstreamRefseq[idx]) start(upstreamRefseq[idx]) = start(upstreamRefseq[idx])-49 start(upstreamRefseq[!idx]) = end(upstreamRefseq[!idx]) end(upstreamRefseq[!idx]) = end(upstreamRefseq[!idx])+49 upstreamWindow1 = countOverlaps(downstreamRefseq, gr_extend) upstreamRefseqs = GRangesList() for (i in 1:99) { upstreamRefseqs[[i]] = shift(upstreamRefseq, as.integer(ifelse(strand(upstreamRefseq) =='+',-i*50,+ i*50))) } upstreamOlaps = lapply(upstreamRefseqs, function(x) countOverlaps(x,gr_extend)) upstreamOlaps1 = lapply(upstreamOlaps, mean) upstreamCoverage = c(mean(upstreamWindow1),unlist(upstreamOlaps1))/50 ############################################# ########## Plotting ############################################# ## The plot looks horrible, it was just as an example... x = seq(from = 1, to = 5000,by=50) y = seq(from = -1, to = -5000,by=-50) plot(x,downstreamCoverage,pch=20,xlim=c(-5100,5100)) lines(x, downstreamCoverage) points(y,upstreamCoverage,pch=20) lines(y,upstreamCoverage) > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 loaded via a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.4 BSgenome_1.18.3 [5] DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-4 rtracklayer_1.10.6 [9] tools_2.12.0 XML_3.2-0 On 30/03/2011, at 22:21, Hervé Pagès wrote: > Hi Eva, > > One solution that involves a little bit of programming from your part > is to make those bins "by hand" by using things like > 'start(gr) <- value' and 'end(gr) <- value' to adjust the starts > and ends of your GRanges object. For example: > > refseqGR1 <- refseqGR > ## Adjust 'end(refseqGR1)' so that each range in it corresponds to > ## the region of length 1000 starting at 'start(refseqGR))' (let's > ## call this region the "1st bin"): > end(refseqGR1) <- start(refseqGR1) + 999 > olaps1 <- findOverlaps(refseqGR1, simTags) > > ## Then for the "2nd bin": > refseqGR2 <- shift(refseqGR1, 1000) > olaps2 <- findOverlaps(refseqGR2, simTags) > > ## etc... > > You might want to put this in a loop, decide how many loops you want > to do, or maybe define some criteria to exit the loop prematurely... > > An important thing to be aware of though is that 'start(refseqGR)' is > always the position of the left-most base of the transcript, that is, > it is its 5' end if it's on the + strand and its 3' end if it's on the > - strand. So you will probably need to adapt the code above to do > something like: > > ## 1st bin: > refseqGR1 <- refseqGR > idx <- strand(refseqGR) == "+" > end(refseqGR1[idx]) <- start(refseqGR1[idx]) + 999 > start(refseqGR1[!idx]) <- end(refseqGR1[!idx]) - 999 > > ## 2nd bin: > refseqGR2 <- shift(refseqGR1, as.vector(ifelse(strand(refseqGR1) == "+", 1000, -1000))) > > Cheers, > H. > > > On 03/30/2011 08:37 AM, Eva Benito Garagorri wrote: >> Dear list, >> >> I am trying to generate a tag density plot from a ChIP-Seq experiment for regions around the TSS of known transcripts. I suppose this is not too complicated, but I am having difficulties approaching this issue. I think maybe the most straightforward way is to use the GRanges capabilities. I generated a GRanges object by downloading the mouse refseq database from UCSC (see code below). This gives me a GRanges object containing the start and end coordinate of each refseq. I could cross this with my tag file with "findOverlaps", but that would give me the overlap of my tags with the entire span of the transcript and I would like to just keep (and make bins around) the region at either side of the TSS. I couldn't find a way to generate a sliding window in the refseq database to calculate the overlap between each bin and my tag file. >> >> If anyone could point me to some package/functionality/reading material or maybe to a previous thread (I did search the mailing list but maybe I missed something) or else help with the strategy, I would be most grateful. >> >> Thank you very much in advance. >> >> Best regards, >> >> Eva >> >> >> >> library(GenomicRanges) >> library(GenomicFeatures) >> >> refseq = makeTranscriptDbFromUCSC('mm9', tablename='refGene') >> >> refseqGR = transcripts(refseq) >> >> head(refseqGR) >> >> GRanges with 6 ranges and 2 elementMetadata values >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> |<integer> <character> >> [1] chr1 [4797974, 4836816] + | 490 NM_008866 >> [2] chr1 [4847775, 4887990] + | 73 NM_011541 >> [3] chr1 [4847775, 4887990] + | 78 NM_001159750 >> [4] chr1 [4848409, 4887990] + | 75 NM_001159751 >> [5] chr1 [5073254, 5152630] + | 77 NM_133826 >> [6] chr1 [5578574, 5596214] + | 494 NM_001204371 >> >> seqlengths >> chr1 chr2 chr3 chr4 chr5 chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random chrY_random >> 197195432 181748087 159599783 155630120 152537259 149517037 ... 362490 849593 449403 5900358 1785075 58682461 >> >> ### Simulate a tag file as a sample of the above >> >> simTags = sample(refseqGR, 1000) >> >> >> #### It would now be possible to get the overlap between the two by doing: >> >> olaps = findOverlaps(refseqGR, simTags) >> >> ### But how do I divide the refseqGR into bins of equal size around the start coordinate? >> >> >> sessionInfo() >> >> >> R version 2.12.0 (2010-10-15) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.0 BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3 rtracklayer_1.10.6 >> [9] tools_2.12.0 XML_3.2-0 >> >> ---------- >> Eva Benito Garagorri >> PhD program in Neurosciences >> Institute for Neurosciences in Alicante >> UMH-CSIC >> San Juan de Alicante >> 03550 >> Spain >> ebenito@umh.es >> (34) 965 91 92 33 >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 ---------- Eva Benito Garagorri PhD program in Neurosciences Institute for Neurosciences in Alicante UMH-CSIC San Juan de Alicante 03550 Spain ebenito@umh.es (34) 965 91 92 33 [[alternative HTML version deleted]]

Coverage Cancer Coverage Cancer • 1.7k views

ADD COMMENT • link 13.1 years ago Eva Benito Garagorri ▴ 70

Login before adding your answer.