Chip-Seq tag density plot

0

Entering edit mode

Eva Benito Garagorri ▴ 70

@eva-benito-garagorri-4263

Last seen 9.6 years ago

Dear list, I am trying to generate a tag density plot from a ChIP-Seq experiment for regions around the TSS of known transcripts. I suppose this is not too complicated, but I am having difficulties approaching this issue. I think maybe the most straightforward way is to use the GRanges capabilities. I generated a GRanges object by downloading the mouse refseq database from UCSC (see code below). This gives me a GRanges object containing the start and end coordinate of each refseq. I could cross this with my tag file with "findOverlaps", but that would give me the overlap of my tags with the entire span of the transcript and I would like to just keep (and make bins around) the region at either side of the TSS. I couldn't find a way to generate a sliding window in the refseq database to calculate the overlap between each bin and my tag file. If anyone could point me to some package/functionality/reading material or maybe to a previous thread (I did search the mailing list but maybe I missed something) or else help with the strategy, I would be most grateful. Thank you very much in advance. Best regards, Eva library(GenomicRanges) library(GenomicFeatures) refseq = makeTranscriptDbFromUCSC('mm9', tablename='refGene') refseqGR = transcripts(refseq) head(refseqGR) GRanges with 6 ranges and 2 elementMetadata values seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr1 [4797974, 4836816] + | 490 NM_008866 [2] chr1 [4847775, 4887990] + | 73 NM_011541 [3] chr1 [4847775, 4887990] + | 78 NM_001159750 [4] chr1 [4848409, 4887990] + | 75 NM_001159751 [5] chr1 [5073254, 5152630] + | 77 NM_133826 [6] chr1 [5578574, 5596214] + | 494 NM_001204371 seqlengths chr1 chr2 chr3 chr4 chr5 chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random chrY_random 197195432 181748087 159599783 155630120 152537259 149517037 ... 362490 849593 449403 5900358 1785075 58682461 ### Simulate a tag file as a sample of the above simTags = sample(refseqGR, 1000) #### It would now be possible to get the overlap between the two by doing: olaps = findOverlaps(refseqGR, simTags) ### But how do I divide the refseqGR into bins of equal size around the start coordinate? sessionInfo() R version 2.12.0 (2010-10-15) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 loaded via a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.0 BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3 rtracklayer_1.10.6 [9] tools_2.12.0 XML_3.2-0 ---------- Eva Benito Garagorri PhD program in Neurosciences Institute for Neurosciences in Alicante UMH-CSIC San Juan de Alicante 03550 Spain ebenito@umh.es (34) 965 91 92 33 [[alternative HTML version deleted]]

• 2.1k views

ADD COMMENT • link updated 13.1 years ago by Hervé Pagès 16k • written 13.1 years ago by Eva Benito Garagorri ▴ 70

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 5 weeks ago

United States

code for computing a set of disjoint bins of specified sizes about all ranges in a specific form of GRanges list is in cisProxScores function of devel version of GGtools. it is not particularly pretty and probably should be factored out. if you study the computations I+n and I-n where I is a set of ranges and n is a number you can do these sorts of things very concisely. the *Ranges developers will likely have more insight. On Wed, Mar 30, 2011 at 11:37 AM, Eva Benito Garagorri <ebenito@umh.es>wrote: > Dear list, > > I am trying to generate a tag density plot from a ChIP-Seq experiment for > regions around the TSS of known transcripts. I suppose this is not too > complicated, but I am having difficulties approaching this issue. I think > maybe the most straightforward way is to use the GRanges capabilities. I > generated a GRanges object by downloading the mouse refseq database from > UCSC (see code below). This gives me a GRanges object containing the start > and end coordinate of each refseq. I could cross this with my tag file with > "findOverlaps", but that would give me the overlap of my tags with the > entire span of the transcript and I would like to just keep (and make bins > around) the region at either side of the TSS. I couldn't find a way to > generate a sliding window in the refseq database to calculate the overlap > between each bin and my tag file. > > If anyone could point me to some package/functionality/reading material or > maybe to a previous thread (I did search the mailing list but maybe I missed > something) or else help with the strategy, I would be most grateful. > > Thank you very much in advance. > > Best regards, > > Eva > > > > library(GenomicRanges) > library(GenomicFeatures) > > refseq = makeTranscriptDbFromUCSC('mm9', tablename='refGene') > > refseqGR = transcripts(refseq) > > head(refseqGR) > > GRanges with 6 ranges and 2 elementMetadata values > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr1 [4797974, 4836816] + | 490 NM_008866 > [2] chr1 [4847775, 4887990] + | 73 NM_011541 > [3] chr1 [4847775, 4887990] + | 78 NM_001159750 > [4] chr1 [4848409, 4887990] + | 75 NM_001159751 > [5] chr1 [5073254, 5152630] + | 77 NM_133826 > [6] chr1 [5578574, 5596214] + | 494 NM_001204371 > > seqlengths > chr1 chr2 chr3 chr4 chr5 > chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random > chrY_random > 197195432 181748087 159599783 155630120 152537259 > 149517037 ... 362490 849593 449403 5900358 > 1785075 58682461 > > ### Simulate a tag file as a sample of the above > > simTags = sample(refseqGR, 1000) > > > #### It would now be possible to get the overlap between the two by doing: > > olaps = findOverlaps(refseqGR, simTags) > > ### But how do I divide the refseqGR into bins of equal size around the > start coordinate? > > > sessionInfo() > > > R version 2.12.0 (2010-10-15) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 > > loaded via a namespace (and not attached): > [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.0 > BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3 > rtracklayer_1.10.6 > [9] tools_2.12.0 XML_3.2-0 > > ---------- > Eva Benito Garagorri > PhD program in Neurosciences > Institute for Neurosciences in Alicante > UMH-CSIC > San Juan de Alicante > 03550 > Spain > ebenito@umh.es > (34) 965 91 92 33 > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.1 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 52 minutes ago

Seattle, WA, United States

Hi Eva, One solution that involves a little bit of programming from your part is to make those bins "by hand" by using things like 'start(gr) <- value' and 'end(gr) <- value' to adjust the starts and ends of your GRanges object. For example: refseqGR1 <- refseqGR ## Adjust 'end(refseqGR1)' so that each range in it corresponds to ## the region of length 1000 starting at 'start(refseqGR))' (let's ## call this region the "1st bin"): end(refseqGR1) <- start(refseqGR1) + 999 olaps1 <- findOverlaps(refseqGR1, simTags) ## Then for the "2nd bin": refseqGR2 <- shift(refseqGR1, 1000) olaps2 <- findOverlaps(refseqGR2, simTags) ## etc... You might want to put this in a loop, decide how many loops you want to do, or maybe define some criteria to exit the loop prematurely... An important thing to be aware of though is that 'start(refseqGR)' is always the position of the left-most base of the transcript, that is, it is its 5' end if it's on the + strand and its 3' end if it's on the - strand. So you will probably need to adapt the code above to do something like: ## 1st bin: refseqGR1 <- refseqGR idx <- strand(refseqGR) == "+" end(refseqGR1[idx]) <- start(refseqGR1[idx]) + 999 start(refseqGR1[!idx]) <- end(refseqGR1[!idx]) - 999 ## 2nd bin: refseqGR2 <- shift(refseqGR1, as.vector(ifelse(strand(refseqGR1) == "+", 1000, -1000))) Cheers, H. On 03/30/2011 08:37 AM, Eva Benito Garagorri wrote: > Dear list, > > I am trying to generate a tag density plot from a ChIP-Seq experiment for regions around the TSS of known transcripts. I suppose this is not too complicated, but I am having difficulties approaching this issue. I think maybe the most straightforward way is to use the GRanges capabilities. I generated a GRanges object by downloading the mouse refseq database from UCSC (see code below). This gives me a GRanges object containing the start and end coordinate of each refseq. I could cross this with my tag file with "findOverlaps", but that would give me the overlap of my tags with the entire span of the transcript and I would like to just keep (and make bins around) the region at either side of the TSS. I couldn't find a way to generate a sliding window in the refseq database to calculate the overlap between each bin and my tag file. > > If anyone could point me to some package/functionality/reading material or maybe to a previous thread (I did search the mailing list but maybe I missed something) or else help with the strategy, I would be most grateful. > > Thank you very much in advance. > > Best regards, > > Eva > > > > library(GenomicRanges) > library(GenomicFeatures) > > refseq = makeTranscriptDbFromUCSC('mm9', tablename='refGene') > > refseqGR = transcripts(refseq) > > head(refseqGR) > > GRanges with 6 ranges and 2 elementMetadata values > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> |<integer> <character> > [1] chr1 [4797974, 4836816] + | 490 NM_008866 > [2] chr1 [4847775, 4887990] + | 73 NM_011541 > [3] chr1 [4847775, 4887990] + | 78 NM_001159750 > [4] chr1 [4848409, 4887990] + | 75 NM_001159751 > [5] chr1 [5073254, 5152630] + | 77 NM_133826 > [6] chr1 [5578574, 5596214] + | 494 NM_001204371 > > seqlengths > chr1 chr2 chr3 chr4 chr5 chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random chrY_random > 197195432 181748087 159599783 155630120 152537259 149517037 ... 362490 849593 449403 5900358 1785075 58682461 > > ### Simulate a tag file as a sample of the above > > simTags = sample(refseqGR, 1000) > > > #### It would now be possible to get the overlap between the two by doing: > > olaps = findOverlaps(refseqGR, simTags) > > ### But how do I divide the refseqGR into bins of equal size around the start coordinate? > > > sessionInfo() > > > R version 2.12.0 (2010-10-15) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 > > loaded via a namespace (and not attached): > [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.0 BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3 rtracklayer_1.10.6 > [9] tools_2.12.0 XML_3.2-0 > > ---------- > Eva Benito Garagorri > PhD program in Neurosciences > Institute for Neurosciences in Alicante > UMH-CSIC > San Juan de Alicante > 03550 > Spain > ebenito at umh.es > (34) 965 91 92 33 > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hervé, Vince, Thanks for your answers, I will give it a go and see how far I get. Best regards, Eva On 30/03/2011, at 22:21, Hervé Pagès wrote: > Hi Eva, > > One solution that involves a little bit of programming from your part > is to make those bins "by hand" by using things like > 'start(gr) <- value' and 'end(gr) <- value' to adjust the starts > and ends of your GRanges object. For example: > > refseqGR1 <- refseqGR > ## Adjust 'end(refseqGR1)' so that each range in it corresponds to > ## the region of length 1000 starting at 'start(refseqGR))' (let's > ## call this region the "1st bin"): > end(refseqGR1) <- start(refseqGR1) + 999 > olaps1 <- findOverlaps(refseqGR1, simTags) > > ## Then for the "2nd bin": > refseqGR2 <- shift(refseqGR1, 1000) > olaps2 <- findOverlaps(refseqGR2, simTags) > > ## etc... > > You might want to put this in a loop, decide how many loops you want > to do, or maybe define some criteria to exit the loop prematurely... > > An important thing to be aware of though is that 'start(refseqGR)' is > always the position of the left-most base of the transcript, that is, > it is its 5' end if it's on the + strand and its 3' end if it's on the > - strand. So you will probably need to adapt the code above to do > something like: > > ## 1st bin: > refseqGR1 <- refseqGR > idx <- strand(refseqGR) == "+" > end(refseqGR1[idx]) <- start(refseqGR1[idx]) + 999 > start(refseqGR1[!idx]) <- end(refseqGR1[!idx]) - 999 > > ## 2nd bin: > refseqGR2 <- shift(refseqGR1, as.vector(ifelse(strand(refseqGR1) == "+", 1000, -1000))) > > Cheers, > H. > > > On 03/30/2011 08:37 AM, Eva Benito Garagorri wrote: >> Dear list, >> >> I am trying to generate a tag density plot from a ChIP-Seq experiment for regions around the TSS of known transcripts. I suppose this is not too complicated, but I am having difficulties approaching this issue. I think maybe the most straightforward way is to use the GRanges capabilities. I generated a GRanges object by downloading the mouse refseq database from UCSC (see code below). This gives me a GRanges object containing the start and end coordinate of each refseq. I could cross this with my tag file with "findOverlaps", but that would give me the overlap of my tags with the entire span of the transcript and I would like to just keep (and make bins around) the region at either side of the TSS. I couldn't find a way to generate a sliding window in the refseq database to calculate the overlap between each bin and my tag file. >> >> If anyone could point me to some package/functionality/reading material or maybe to a previous thread (I did search the mailing list but maybe I missed something) or else help with the strategy, I would be most grateful. >> >> Thank you very much in advance. >> >> Best regards, >> >> Eva >> >> >> >> library(GenomicRanges) >> library(GenomicFeatures) >> >> refseq = makeTranscriptDbFromUCSC('mm9', tablename='refGene') >> >> refseqGR = transcripts(refseq) >> >> head(refseqGR) >> >> GRanges with 6 ranges and 2 elementMetadata values >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> |<integer> <character> >> [1] chr1 [4797974, 4836816] + | 490 NM_008866 >> [2] chr1 [4847775, 4887990] + | 73 NM_011541 >> [3] chr1 [4847775, 4887990] + | 78 NM_001159750 >> [4] chr1 [4848409, 4887990] + | 75 NM_001159751 >> [5] chr1 [5073254, 5152630] + | 77 NM_133826 >> [6] chr1 [5578574, 5596214] + | 494 NM_001204371 >> >> seqlengths >> chr1 chr2 chr3 chr4 chr5 chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random chrY_random >> 197195432 181748087 159599783 155630120 152537259 149517037 ... 362490 849593 449403 5900358 1785075 58682461 >> >> ### Simulate a tag file as a sample of the above >> >> simTags = sample(refseqGR, 1000) >> >> >> #### It would now be possible to get the overlap between the two by doing: >> >> olaps = findOverlaps(refseqGR, simTags) >> >> ### But how do I divide the refseqGR into bins of equal size around the start coordinate? >> >> >> sessionInfo() >> >> >> R version 2.12.0 (2010-10-15) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.0 BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-3 rtracklayer_1.10.6 >> [9] tools_2.12.0 XML_3.2-0 >> >> ---------- >> Eva Benito Garagorri >> PhD program in Neurosciences >> Institute for Neurosciences in Alicante >> UMH-CSIC >> San Juan de Alicante >> 03550 >> Spain >> ebenito@umh.es >> (34) 965 91 92 33 >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 ---------- Eva Benito Garagorri PhD program in Neurosciences Institute for Neurosciences in Alicante UMH-CSIC San Juan de Alicante 03550 Spain ebenito@umh.es (34) 965 91 92 33 [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Eva Benito Garagorri ▴ 70

Login before adding your answer.