ChIPpeakAnno Strandedness and distance calculation
4
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 3 months ago
United States
Hi Dario, You can create the annotation with strand = c(?+?). For example, AnnotationRangedData = RangedData(IRanges(start = c(967659, 2010898, 2496700, 3075866, + 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names = c("t1", + "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand =c("+")) Please take a look at the examples given on the paper just published on BMC Bioinformatics http://www.biomedcentral.com/1471-2105/11/237. In case you could not open the link, I also attached the pdf file. Regarding your other question about distance calculation, I suggest to create your AnnotationRangedData and PeakRangedData with start=midpoint to get the distance between midpoints. The distance is calculated differently for features in plus strand and minus strand. For example, to calculate the distance between peak and TSS, the distance is calculated as the distance between the start of the binding site and the TSS, which is the gene start for genes located on the forward strand and the gene end for genes located on the reverse strand. Therefore, adding another parameter would mean to overwrite the way how the distance is calculated based on strandedness. After you tried the above suggested way and still prefer having a new parameter, I will be happy to add it to the next release. Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program in Gene Function and Expression Program in Molecular Medicine University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm On 5/13/10 9:00 PM, "Dario Strbenac" <d.strbenac at="" garvan.org.au=""> wrote: > Hello again, > > Just one more question. When we are looking at DNA methtylation, we don't have > the strand of the peak (because the reverse complement of CG is CG). It seems > that it might not be possible to do this with ChipPeakAnno ? > > e.g. > >> > head(peaksT) > chr start end > 1 chr13 83351701 83352000 > 2 chr13 83351401 83351700 > 3 chr20 25011901 25012200 > 4 chr13 83352001 83352300 > 5 chr8 143402101 143402400 > 6 chr2 238246801 238247100 > >> > head(featTable) > name chr strand start end > 1 7896759 chr1 + 781253 783614 > 2 7896761 chr1 + 850983 869824 > 3 7896779 chr1 + 885829 890958 > 4 7896798 chr1 + 891739 900345 > 5 7896817 chr1 + 938709 939782 > 6 7896822 chr1 + 945365 981355 > > Also, sometimes our feature table is a table of CpG islands, which don't have > a strand associated with them. > > e.g. > >> > head(featTable2) > chr start end CpG Island Name > 1 chr1 18598 19673 CpG:_116 > 2 chr1 124987 125426 CpG:_30 > 3 chr1 317653 318092 CpG:_29 > 4 chr1 427014 428027 CpG:_84 > 5 chr1 439136 440407 CpG:_99 > 6 chr1 523082 523977 CpG:_94 > > Is it possible to do this annotation with ChipPeakAnno ? Currently, the > annotatePeakInBatch function gives me an error when I don't give it strand > information when I create my RangedData object. > > Thanks, > Dario. > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia > > On 5/13/10 8:10 PM, "Dario Strbenac" <d.strbenac at="" garvan.org.au=""> wrote: > Hello, > > Firstly, thank you for making this package. It seems so useful ! We were > thinking of writing something like this ourselves, until I saw your package, > because we do a lot of ChIP-Seq here. > > I just have a small feature request. In your distance calculation, you do > start of peak - start of feature. Would it be possible to allow the user to > choose if they want the distance calculation to use the start or the middle of > the feature (and also for the peak) ? This is because we do a lot of > methylation studies, and for CpG island features, we like to use the midpoint > as the position of our feature. It would also be nice to be able to use the > midpoint of the peak as the peak's position, since this is usually where the > signal is strongest. > > Thanks, > Dario. > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 3 months ago
United States
Hi Dario, Thanks for the clarification! So you still want the nearest feature and insideFeature column to be calculated using start (+ strand) and end (- strand). You only need an extra column with distance from midpoint to midpoint to be included? Best regards, Julie On 5/14/10 12:20 AM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: Oh, actually I thought of a problem with making start = midpoint. If you modify the start position to be the average, then you might get the wrong values in insideFeature column. e.g. Peak Real coordinates -------------------- chr | start | end | ------------------- chr1 | 5000 | 5500 | Peak modified coordinates (start is midpoint) -------------------- chr | start | end | ------------------- chr1 | 5250 | 5500 | A gene ----------------------------- chr | start | end | strand | ----------------------------- chr1 | 1000 | 5100 | - | So, instead of being 'overlapStart', it is called as 'upstream'. It would be good if the package worked out an extra column for the tables, called 'position' and used the position for the distances, and the real start and end positions for the overlapping. e.g. Something like this if("strand" %in% colnames(table)) { table$position = ifelse(table$strand == '+', table$start, table$end) } else { table$position = round((table$start + table$end) / 2) } - Dario. ---- Original message ---- >Date: Thu, 13 May 2010 22:40:46 -0400 >From: "Zhu, Julie" <julie.zhu@umassmed.edu> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac@garvan.org.au" <d.strbenac@garvan.org.au> >Cc: "bioconductor" <bioconductor@stat.math.ethz.ch> > >Hi Dario, > >You can create the annotation with strand = c("+"). For example, > >AnnotationRangedData = RangedData(IRanges(start = c(967659, 2010898, >2496700, 3075866, >+ 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names = >c("t1", >+ "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand >=c("+")) > >Please take a look at the examples given on the paper just published on BMC >Bioinformatics >http://www.biomedcentral.com/1471-2105/11/237. In case you could not open >the link, I also attached the pdf file. > >Regarding your other question about distance calculation, I suggest to >create your AnnotationRangedData and PeakRangedData with start=midpoint to >get the distance between midpoints. The distance is calculated differently >for features in plus strand and minus strand. For example, to calculate the >distance between peak and TSS, the distance is calculated as the distance >between the start of the binding site and the TSS, which is the gene start >for genes located on the forward strand and the gene end for genes located >on the reverse strand. Therefore, adding another parameter would mean to >overwrite the way how the distance is calculated based on strandedness. >After you tried the above suggested way and still prefer having a new >parameter, I will be happy to add it to the next release. > >Best regards, > >Julie > > >******************************************* >Lihua Julie Zhu, Ph.D >Research Associate Professor >Program in Gene Function and Expression >Program in Molecular Medicine >University of Massachusetts Medical School >364 Plantation Street, Room 613 >Worcester, MA 01605 >508-856-5256 >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: > >> Hello again, >> >> Just one more question. When we are looking at DNA methtylation, we don't have >> the strand of the peak (because the reverse complement of CG is CG). It seems >> that it might not be possible to do this with ChipPeakAnno ? >> >> e.g. >> >>> > head(peaksT) >> chr start end >> 1 chr13 83351701 83352000 >> 2 chr13 83351401 83351700 >> 3 chr20 25011901 25012200 >> 4 chr13 83352001 83352300 >> 5 chr8 143402101 143402400 >> 6 chr2 238246801 238247100 >> >>> > head(featTable) >> name chr strand start end >> 1 7896759 chr1 + 781253 783614 >> 2 7896761 chr1 + 850983 869824 >> 3 7896779 chr1 + 885829 890958 >> 4 7896798 chr1 + 891739 900345 >> 5 7896817 chr1 + 938709 939782 >> 6 7896822 chr1 + 945365 981355 >> >> Also, sometimes our feature table is a table of CpG islands, which don't have >> a strand associated with them. >> >> e.g. >> >>> > head(featTable2) >> chr start end CpG Island Name >> 1 chr1 18598 19673 CpG:_116 >> 2 chr1 124987 125426 CpG:_30 >> 3 chr1 317653 318092 CpG:_29 >> 4 chr1 427014 428027 CpG:_84 >> 5 chr1 439136 440407 CpG:_99 >> 6 chr1 523082 523977 CpG:_94 >> >> Is it possible to do this annotation with ChipPeakAnno ? Currently, the >> annotatePeakInBatch function gives me an error when I don't give it strand >> information when I create my RangedData object. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia >> >> > >On 5/13/10 8:10 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: > >> Hello, >> >> Firstly, thank you for making this package. It seems so useful ! We were >> thinking of writing something like this ourselves, until I saw your package, >> because we do a lot of ChIP-Seq here. >> >> I just have a small feature request. In your distance calculation, you do >> start of peak - start of feature. Would it be possible to allow the user to >> choose if they want the distance calculation to use the start or the middle of >> the feature (and also for the peak) ? This is because we do a lot of >> methylation studies, and for CpG island features, we like to use the midpoint >> as the position of our feature. It would also be nice to be able to use the >> midpoint of the peak as the peak's position, since this is usually where the >> signal is strongest. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia > > -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia [[alternative HTML version deleted]] ADD COMMENT 0 Entering edit mode Not quite. I mean that midpoint to midpoint distance would be good when the feature didn't have a strand. If the feature does have strand information, then it'd be nice to have midpoint of peak to start (start for + strand, end for - strand) of feature. Thanks, Dario. ---- Original message ---- >Date: Fri, 14 May 2010 07:43:38 -0400 >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac at garvan.org.au" <d.strbenac at="" garvan.org.au=""> >Cc: "bioconductor" <bioconductor at="" stat.math.ethz.ch=""> > > Hi Dario, > > Thanks for the clarification! So you still want the > nearest feature and insideFeature column to be > calculated using start (+ strand) and end (- > strand). You only need an extra column with distance > from midpoint to midpoint to be included? > > Best regards, > > Julie > > On 5/14/10 12:20 AM, "Dario Strbenac" > <d.strbenac at="" garvan.org.au=""> wrote: > > Oh, actually I thought of a problem with making > start = midpoint. If you modify the start position > to be the average, then you might get the wrong > values in insideFeature column. > > e.g. > > Peak > Real coordinates > -------------------- > chr | start | end | > ------------------- > chr1 | 5000 | 5500 | > > Peak > modified coordinates (start is midpoint) > -------------------- > chr | start | end | > ------------------- > chr1 | 5250 | 5500 | > > A gene > ----------------------------- > chr | start | end | strand | > ----------------------------- > chr1 | 1000 | 5100 | - | > > So, instead of being 'overlapStart', it is called > as 'upstream'. > > It would be good if the package worked out an > extra column for the tables, called 'position' and > used the position for the distances, and the real > start and end positions for the overlapping. > > e.g. Something like this > > if("strand" %in% colnames(table)) > { > table$position = ifelse(table$strand == '+', > table$start, table$end) > } else { > table$position = round((table$start + table$end) > / 2) > } > > - Dario. > ---- Original message ---- > >Date: Thu, 13 May 2010 22:40:46 -0400 > >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> > >Subject: Re: ChIPpeakAnno Strandedness and > distance calculation > >To: "D.Strbenac at garvan.org.au" > <d.strbenac at="" garvan.org.au=""> > >Cc: "bioconductor" > <bioconductor at="" stat.math.ethz.ch=""> > > > >Hi Dario, > > > >You can create the annotation with strand = > c("+"). For example, > > > >AnnotationRangedData = RangedData(IRanges(start = > c(967659, 2010898, > >2496700, 3075866, > >+ 3123260), end = c(967869, 2011108, 2496920, > 3076166, 3123470), names = > >c("t1", > >+ "t2", "t3", "t4", "t5")), space = c("1", "2", > "3", "1", "2"), strand > >=c("+")) > > > >Please take a look at the examples given on the > paper just published on BMC > >Bioinformatics > >http://www.biomedcentral.com/1471-2105/11/237. In > case you could not open > >the link, I also attached the pdf file. > > > >Regarding your other question about distance > calculation, I suggest to > >create your AnnotationRangedData and > PeakRangedData with start=midpoint to > >get the distance between midpoints. The distance > is calculated differently > >for features in plus strand and minus strand. For > example, to calculate the > >distance between peak and TSS, the distance is > calculated as the distance > >between the start of the binding site and the > TSS, which is the gene start > >for genes located on the forward strand and the > gene end for genes located > >on the reverse strand. Therefore, adding another > parameter would mean to > >overwrite the way how the distance is calculated > based on strandedness. > >After you tried the above suggested way and still > prefer having a new > >parameter, I will be happy to add it to the next > release. > > > >Best regards, > > > >Julie > > > > > >******************************************* > >Lihua Julie Zhu, Ph.D > >Research Associate Professor > >Program in Gene Function and Expression > >Program in Molecular Medicine > >University of Massachusetts Medical School > >364 Plantation Street, Room 613 > >Worcester, MA 01605 > >508-856-5256 > >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > > > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" > <d.strbenac at="" garvan.org.au=""> wrote: > > > >> Hello again, > >> > >> Just one more question. When we are looking at > DNA methtylation, we don't have > >> the strand of the peak (because the reverse > complement of CG is CG). It seems > >> that it might not be possible to do this with > ChipPeakAnno ? > >> > >> e.g. > >> > >>> > head(peaksT) > >> chr start end > >> 1 chr13 83351701 83352000 > >> 2 chr13 83351401 83351700 > >> 3 chr20 25011901 25012200 > >> 4 chr13 83352001 83352300 > >> 5 chr8 143402101 143402400 > >> 6 chr2 238246801 238247100 > >> > >>> > head(featTable) > >> name chr strand start end > >> 1 7896759 chr1 + 781253 783614 > >> 2 7896761 chr1 + 850983 869824 > >> 3 7896779 chr1 + 885829 890958 > >> 4 7896798 chr1 + 891739 900345 > >> 5 7896817 chr1 + 938709 939782 > >> 6 7896822 chr1 + 945365 981355 > >> > >> Also, sometimes our feature table is a table of > CpG islands, which don't have > >> a strand associated with them. > >> > >> e.g. > >> > >>> > head(featTable2) > >> chr start end CpG Island Name > >> 1 chr1 18598 19673 CpG:_116 > >> 2 chr1 124987 125426 CpG:_30 > >> 3 chr1 317653 318092 CpG:_29 > >> 4 chr1 427014 428027 CpG:_84 > >> 5 chr1 439136 440407 CpG:_99 > >> 6 chr1 523082 523977 CpG:_94 > >> > >> Is it possible to do this annotation with > ChipPeakAnno ? Currently, the > >> annotatePeakInBatch function gives me an error > when I don't give it strand > >> information when I create my RangedData object. > >> > >> Thanks, > >> Dario. > >> > >> -------------------------------------- > >> Dario Strbenac > >> Research Assistant > >> Cancer Epigenetics > >> Garvan Institute of Medical Research > >> Darlinghurst NSW 2010 > >> Australia > >> > >> > > > >On 5/13/10 8:10 PM, "Dario Strbenac" > <d.strbenac at="" garvan.org.au=""> wrote: > > > >> Hello, > >> > >> Firstly, thank you for making this package. It > seems so useful ! We were > >> thinking of writing something like this > ourselves, until I saw your package, > >> because we do a lot of ChIP-Seq here. > >> > >> I just have a small feature request. In your > distance calculation, you do > >> start of peak - start of feature. Would it be > possible to allow the user to > >> choose if they want the distance calculation to > use the start or the middle of > >> the feature (and also for the peak) ? This is > because we do a lot of > >> methylation studies, and for CpG island > features, we like to use the midpoint > >> as the position of our feature. It would also > be nice to be able to use the > >> midpoint of the peak as the peak's position, > since this is usually where the > >> signal is strongest. > >> > >> Thanks, > >> Dario. > >> > >> -------------------------------------- > >> Dario Strbenac > >> Research Assistant > >> Cancer Epigenetics > >> Garvan Institute of Medical Research > >> Darlinghurst NSW 2010 > >> Australia > > > > > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia
0
Entering edit mode
Hi Dario, Since this is a feature request, I will add it to the dev branch for you to install. Best, Julie On 5/16/10 7:04 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: Not quite. I mean that midpoint to midpoint distance would be good when the feature didn't have a strand. If the feature does have strand information, then it'd be nice to have midpoint of peak to start (start for + strand, end for - strand) of feature. Thanks, Dario. ---- Original message ---- >Date: Fri, 14 May 2010 07:43:38 -0400 >From: "Zhu, Julie" <julie.zhu@umassmed.edu> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac@garvan.org.au" <d.strbenac@garvan.org.au> >Cc: "bioconductor" <bioconductor@stat.math.ethz.ch> > > Hi Dario, > > Thanks for the clarification! So you still want the > nearest feature and insideFeature column to be > calculated using start (+ strand) and end (- > strand). You only need an extra column with distance > from midpoint to midpoint to be included? > > Best regards, > > Julie > > On 5/14/10 12:20 AM, "Dario Strbenac" > <d.strbenac@garvan.org.au> wrote: > > Oh, actually I thought of a problem with making > start = midpoint. If you modify the start position > to be the average, then you might get the wrong > values in insideFeature column. > > e.g. > > Peak > Real coordinates > -------------------- > chr | start | end | > ------------------- > chr1 | 5000 | 5500 | > > Peak > modified coordinates (start is midpoint) > -------------------- > chr | start | end | > ------------------- > chr1 | 5250 | 5500 | > > A gene > ----------------------------- > chr | start | end | strand | > ----------------------------- > chr1 | 1000 | 5100 | - | > > So, instead of being 'overlapStart', it is called > as 'upstream'. > > It would be good if the package worked out an > extra column for the tables, called 'position' and > used the position for the distances, and the real > start and end positions for the overlapping. > > e.g. Something like this > > if("strand" %in% colnames(table)) > { > table$position = ifelse(table$strand == '+', > table$start, table$end) > } else { > table$position = round((table$start + table$end) > / 2) > } > > - Dario. > ---- Original message ---- > >Date: Thu, 13 May 2010 22:40:46 -0400 > >From: "Zhu, Julie" <julie.zhu@umassmed.edu> > >Subject: Re: ChIPpeakAnno Strandedness and > distance calculation > >To: "D.Strbenac@garvan.org.au" > <d.strbenac@garvan.org.au> > >Cc: "bioconductor" > <bioconductor@stat.math.ethz.ch> > > > >Hi Dario, > > > >You can create the annotation with strand = > c("+"). For example, > > > >AnnotationRangedData = RangedData(IRanges(start = > c(967659, 2010898, > >2496700, 3075866, > >+ 3123260), end = c(967869, 2011108, 2496920, > 3076166, 3123470), names = > >c("t1", > >+ "t2", "t3", "t4", "t5")), space = c("1", "2", > "3", "1", "2"), strand > >=c("+")) > > > >Please take a look at the examples given on the > paper just published on BMC > >Bioinformatics > >http://www.biomedcentral.com/1471-2105/11/237. In > case you could not open > >the link, I also attached the pdf file. > > > >Regarding your other question about distance > calculation, I suggest to > >create your AnnotationRangedData and > PeakRangedData with start=midpoint to > >get the distance between midpoints. The distance > is calculated differently > >for features in plus strand and minus strand. For > example, to calculate the > >distance between peak and TSS, the distance is > calculated as the distance > >between the start of the binding site and the > TSS, which is the gene start > >for genes located on the forward strand and the > gene end for genes located > >on the reverse strand. Therefore, adding another > parameter would mean to > >overwrite the way how the distance is calculated > based on strandedness. > >After you tried the above suggested way and still > prefer having a new > >parameter, I will be happy to add it to the next > release. > > > >Best regards, > > > >Julie > > > > > >******************************************* > >Lihua Julie Zhu, Ph.D > >Research Associate Professor > >Program in Gene Function and Expression > >Program in Molecular Medicine > >University of Massachusetts Medical School > >364 Plantation Street, Room 613 > >Worcester, MA 01605 > >508-856-5256 > >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > > > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" > <d.strbenac@garvan.org.au> wrote: > > > >> Hello again, > >> > >> Just one more question. When we are looking at > DNA methtylation, we don't have > >> the strand of the peak (because the reverse > complement of CG is CG). It seems > >> that it might not be possible to do this with > ChipPeakAnno ? > >> > >> e.g. > >> > >>> > head(peaksT) > >> chr start end > >> 1 chr13 83351701 83352000 > >> 2 chr13 83351401 83351700 > >> 3 chr20 25011901 25012200 > >> 4 chr13 83352001 83352300 > >> 5 chr8 143402101 143402400 > >> 6 chr2 238246801 238247100 > >> > >>> > head(featTable) > >> name chr strand start end > >> 1 7896759 chr1 + 781253 783614 > >> 2 7896761 chr1 + 850983 869824 > >> 3 7896779 chr1 + 885829 890958 > >> 4 7896798 chr1 + 891739 900345 > >> 5 7896817 chr1 + 938709 939782 > >> 6 7896822 chr1 + 945365 981355 > >> > >> Also, sometimes our feature table is a table of > CpG islands, which don't have > >> a strand associated with them. > >> > >> e.g. > >> > >>> > head(featTable2) > >> chr start end CpG Island Name > >> 1 chr1 18598 19673 CpG:_116 > >> 2 chr1 124987 125426 CpG:_30 > >> 3 chr1 317653 318092 CpG:_29 > >> 4 chr1 427014 428027 CpG:_84 > >> 5 chr1 439136 440407 CpG:_99 > >> 6 chr1 523082 523977 CpG:_94 > >> > >> Is it possible to do this annotation with > ChipPeakAnno ? Currently, the > >> annotatePeakInBatch function gives me an error > when I don't give it strand > >> information when I create my RangedData object. > >> > >> Thanks, > >> Dario. > >> > >> -------------------------------------- > >> Dario Strbenac > >> Research Assistant > >> Cancer Epigenetics > >> Garvan Institute of Medical Research > >> Darlinghurst NSW 2010 > >> Australia > >> > >> > > > >On 5/13/10 8:10 PM, "Dario Strbenac" > <d.strbenac@garvan.org.au> wrote: > > > >> Hello, > >> > >> Firstly, thank you for making this package. It > seems so useful ! We were > >> thinking of writing something like this > ourselves, until I saw your package, > >> because we do a lot of ChIP-Seq here. > >> > >> I just have a small feature request. In your > distance calculation, you do > >> start of peak - start of feature. Would it be > possible to allow the user to > >> choose if they want the distance calculation to > use the start or the middle of > >> the feature (and also for the peak) ? This is > because we do a lot of > >> methylation studies, and for CpG island > features, we like to use the midpoint > >> as the position of our feature. It would also > be nice to be able to use the > >> midpoint of the peak as the peak's position, > since this is usually where the > >> signal is strongest. > >> > >> Thanks, > >> Dario. > >> > >> -------------------------------------- > >> Dario Strbenac > >> Research Assistant > >> Cancer Epigenetics > >> Garvan Institute of Medical Research > >> Darlinghurst NSW 2010 > >> Australia > > > > > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia [[alternative HTML version deleted]] ADD REPLY 0 Entering edit mode Hi Dario, I have added the new feature you requested into dev version 1.5.2 and fixed the strandness issue (strand now is optional) in the dev version. Here is the code snippet using your example to test the distance calculations for a few common scenarios. Please let me know if you have problems. Thank you very much for the input! peaksT <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(1000, 2000, 3000), end = c(2000, 3000, 4000)) featuresT <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(1500, 2500, 3500), end = c(2500, 3500, 4500), strand = c("-","+","+")) myPeakList <- RangedData(space = peaksT$chr, ranges = IRanges(start = peaksT$start, end = peaksT$end)) TSS.ordered <- RangedData(space = featuresT$chr, strand = featuresT$strand, ranges = IRanges(start = featuresT$start, end = featuresT$end, names=c("t1","t2","t3"))) library(ChIPpeakAnno) #To be compatible with old versions, by default distance is calculated as peak start to TSS (start for + strand and end for - strand) annotatePeakInBatch(myPeakList, AnnotationData=TSS.ordered) #By specifying PeakLocForDistance to "m" or "middle", the distance now is calculated as middle of the peak to TSS annotatePeakInBatch(myPeakList, AnnotationData=TSS.ordered, PeakLocForDistance="m") #By specifying PeakLocForDistance="m" and FeatureLocForDistance="m", the distance is calculated as middle of the peak to middle of the feature. annotatePeakInBatch(myPeakList, AnnotationData=TSS.ordered, PeakLocForDistance="m",FeatureLocForDistance="m") Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program in Gene Function and Expression Program in Molecular Medicine University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm On 5/16/10 7:04 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: Not quite. I mean that midpoint to midpoint distance would be good when the feature didn't have a strand. If the feature does have strand information, then it'd be nice to have midpoint of peak to start (start for + strand, end for - strand) of feature. Thanks, Dario. ---- Original message ---- >Date: Fri, 14 May 2010 07:43:38 -0400 >From: "Zhu, Julie" <julie.zhu@umassmed.edu> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac@garvan.org.au" <d.strbenac@garvan.org.au> >Cc: "bioconductor" <bioconductor@stat.math.ethz.ch> > > Hi Dario, > > Thanks for the clarification! So you still want the > nearest feature and insideFeature column to be > calculated using start (+ strand) and end (- > strand). You only need an extra column with distance > from midpoint to midpoint to be included? > > Best regards, > > Julie > > On 5/14/10 12:20 AM, "Dario Strbenac" > <d.strbenac@garvan.org.au> wrote: > > Oh, actually I thought of a problem with making > start = midpoint. If you modify the start position > to be the average, then you might get the wrong > values in insideFeature column. > > e.g. > > Peak > Real coordinates > -------------------- > chr | start | end | > ------------------- > chr1 | 5000 | 5500 | > > Peak > modified coordinates (start is midpoint) > -------------------- > chr | start | end | > ------------------- > chr1 | 5250 | 5500 | > > A gene > ----------------------------- > chr | start | end | strand | > ----------------------------- > chr1 | 1000 | 5100 | - | > > So, instead of being 'overlapStart', it is called > as 'upstream'. > > It would be good if the package worked out an > extra column for the tables, called 'position' and > used the position for the distances, and the real > start and end positions for the overlapping. > > e.g. Something like this > > if("strand" %in% colnames(table)) > { > table$position = ifelse(table$strand == '+', > table$start, table$end) > } else { > table$position = round((table$start + table$end) > / 2) > } > > - Dario. > ---- Original message ---- > >Date: Thu, 13 May 2010 22:40:46 -0400 > >From: "Zhu, Julie" <julie.zhu@umassmed.edu> > >Subject: Re: ChIPpeakAnno Strandedness and > distance calculation > >To: "D.Strbenac@garvan.org.au" > <d.strbenac@garvan.org.au> > >Cc: "bioconductor" > <bioconductor@stat.math.ethz.ch> > > > >Hi Dario, > > > >You can create the annotation with strand = > c("+"). For example, > > > >AnnotationRangedData = RangedData(IRanges(start = > c(967659, 2010898, > >2496700, 3075866, > >+ 3123260), end = c(967869, 2011108, 2496920, > 3076166, 3123470), names = > >c("t1", > >+ "t2", "t3", "t4", "t5")), space = c("1", "2", > "3", "1", "2"), strand > >=c("+")) > > > >Please take a look at the examples given on the > paper just published on BMC > >Bioinformatics > >http://www.biomedcentral.com/1471-2105/11/237. In > case you could not open > >the link, I also attached the pdf file. > > > >Regarding your other question about distance > calculation, I suggest to > >create your AnnotationRangedData and > PeakRangedData with start=midpoint to > >get the distance between midpoints. The distance > is calculated differently > >for features in plus strand and minus strand. For > example, to calculate the > >distance between peak and TSS, the distance is > calculated as the distance > >between the start of the binding site and the > TSS, which is the gene start > >for genes located on the forward strand and the > gene end for genes located > >on the reverse strand. Therefore, adding another > parameter would mean to > >overwrite the way how the distance is calculated > based on strandedness. > >After you tried the above suggested way and still > prefer having a new > >parameter, I will be happy to add it to the next > release. > > > >Best regards, > > > >Julie > > > > > >******************************************* > >Lihua Julie Zhu, Ph.D > >Research Associate Professor > >Program in Gene Function and Expression > >Program in Molecular Medicine > >University of Massachusetts Medical School > >364 Plantation Street, Room 613 > >Worcester, MA 01605 > >508-856-5256 > >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > > > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" > <d.strbenac@garvan.org.au> wrote: > > > >> Hello again, > >> > >> Just one more question. When we are looking at > DNA methtylation, we don't have > >> the strand of the peak (because the reverse > complement of CG is CG). It seems > >> that it might not be possible to do this with > ChipPeakAnno ? > >> > >> e.g. > >> > >>> > head(peaksT) > >> chr start end > >> 1 chr13 83351701 83352000 > >> 2 chr13 83351401 83351700 > >> 3 chr20 25011901 25012200 > >> 4 chr13 83352001 83352300 > >> 5 chr8 143402101 143402400 > >> 6 chr2 238246801 238247100 > >> > >>> > head(featTable) > >> name chr strand start end > >> 1 7896759 chr1 + 781253 783614 > >> 2 7896761 chr1 + 850983 869824 > >> 3 7896779 chr1 + 885829 890958 > >> 4 7896798 chr1 + 891739 900345 > >> 5 7896817 chr1 + 938709 939782 > >> 6 7896822 chr1 + 945365 981355 > >> > >> Also, sometimes our feature table is a table of > CpG islands, which don't have > >> a strand associated with them. > >> > >> e.g. > >> > >>> > head(featTable2) > >> chr start end CpG Island Name > >> 1 chr1 18598 19673 CpG:_116 > >> 2 chr1 124987 125426 CpG:_30 > >> 3 chr1 317653 318092 CpG:_29 > >> 4 chr1 427014 428027 CpG:_84 > >> 5 chr1 439136 440407 CpG:_99 > >> 6 chr1 523082 523977 CpG:_94 > >> > >> Is it possible to do this annotation with > ChipPeakAnno ? Currently, the > >> annotatePeakInBatch function gives me an error > when I don't give it strand > >> information when I create my RangedData object. > >> > >> Thanks, > >> Dario. > >> > >> -------------------------------------- > >> Dario Strbenac > >> Research Assistant > >> Cancer Epigenetics > >> Garvan Institute of Medical Research > >> Darlinghurst NSW 2010 > >> Australia > >> > >> > > > >On 5/13/10 8:10 PM, "Dario Strbenac" > <d.strbenac@garvan.org.au> wrote: > > > >> Hello, > >> > >> Firstly, thank you for making this package. It > seems so useful ! We were > >> thinking of writing something like this > ourselves, until I saw your package, > >> because we do a lot of ChIP-Seq here. > >> > >> I just have a small feature request. In your > distance calculation, you do > >> start of peak - start of feature. Would it be > possible to allow the user to > >> choose if they want the distance calculation to > use the start or the middle of > >> the feature (and also for the peak) ? This is > because we do a lot of > >> methylation studies, and for CpG island > features, we like to use the midpoint > >> as the position of our feature. It would also > be nice to be able to use the > >> midpoint of the peak as the peak's position, > since this is usually where the > >> signal is strongest. > >> > >> Thanks, > >> Dario. > >> > >> -------------------------------------- > >> Dario Strbenac > >> Research Assistant > >> Cancer Epigenetics > >> Garvan Institute of Medical Research > >> Darlinghurst NSW 2010 > >> Australia > > > > > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia [[alternative HTML version deleted]] ADD REPLY 0 Entering edit mode Thanks, this looks great ! ---- Original message ---- >Date: Tue, 18 May 2010 17:19:17 -0400 >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> >Subject: Re: ChIPpeakAnno: New Features: distance calculation based on user selection of location of the peak and location of feature >To: "D.Strbenac at garvan.org.au" <d.strbenac at="" garvan.org.au=""> >Cc: "bioconductor" <bioconductor at="" stat.math.ethz.ch=""> > > Hi Dario, > > I have added the new feature you requested into dev > version 1.5.2 and fixed the strandness issue (strand > now is optional) in the dev version. > > Here is the code snippet using your example to test > the distance calculations for a few common > scenarios. Please let me know if you have problems. > Thank you very much for the input! > > peaksT <- data.frame(chr = c("chr1", "chr1", > "chr1"), start = c(1000, 2000, 3000), end = c(2000, > 3000, 4000)) > featuresT <- data.frame(chr = c("chr1", "chr1", > "chr1"), start = c(1500, 2500, 3500), end = c(2500, > 3500, 4500), strand = c("-","+","+")) > myPeakList <- RangedData(space = peaksT$chr, ranges > = IRanges(start = peaksT$start, end = peaksT$end)) > TSS.ordered <- RangedData(space = featuresT$chr, > strand = featuresT$strand, ranges = IRanges(start = > featuresT$start, end = featuresT$end, > names=c("t1","t2","t3"))) > > library(ChIPpeakAnno) > #To be compatible with old versions, by default > distance is calculated as peak start to TSS (start > for + strand and end for - strand) > annotatePeakInBatch(myPeakList, > AnnotationData=TSS.ordered) > > #By specifying PeakLocForDistance to "m" or > "middle", the distance now is calculated as middle > of the peak to TSS > annotatePeakInBatch(myPeakList, > AnnotationData=TSS.ordered, PeakLocForDistance="m") > > #By specifying PeakLocForDistance="m" and > FeatureLocForDistance="m", the distance is > calculated as middle of the peak to middle of the > feature. > annotatePeakInBatch(myPeakList, > AnnotationData=TSS.ordered, > PeakLocForDistance="m",FeatureLocForDistance="m") > > Best regards, > > Julie > > ******************************************* > Lihua Julie Zhu, Ph.D > Research Associate Professor > Program in Gene Function and Expression > Program in Molecular Medicine > University of Massachusetts Medical School > 364 Plantation Street, Room 613 > Worcester, MA 01605 > 508-856-5256 > http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > On 5/16/10 7:04 PM, "Dario Strbenac" > <d.strbenac at="" garvan.org.au=""> wrote: > > Not quite. I mean that midpoint to midpoint > distance would be good when the feature didn't > have a strand. If the feature does have strand > information, then it'd be nice to have midpoint of > peak to start (start for + strand, end for - > strand) of feature. > > Thanks, > Dario. > > ---- Original message ---- > >Date: Fri, 14 May 2010 07:43:38 -0400 > >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> > >Subject: Re: ChIPpeakAnno Strandedness and > distance calculation > >To: "D.Strbenac at garvan.org.au" > <d.strbenac at="" garvan.org.au=""> > >Cc: "bioconductor" > <bioconductor at="" stat.math.ethz.ch=""> > > > > Hi Dario, > > > > Thanks for the clarification! So you still > want the > > nearest feature and insideFeature column to > be > > calculated using start (+ strand) and end (- > > strand). You only need an extra column with > distance > > from midpoint to midpoint to be included? > > > > Best regards, > > > > Julie > > > > On 5/14/10 12:20 AM, "Dario Strbenac" > > <d.strbenac at="" garvan.org.au=""> wrote: > > > > Oh, actually I thought of a problem with > making > > start = midpoint. If you modify the start > position > > to be the average, then you might get the > wrong > > values in insideFeature column. > > > > e.g. > > > > Peak > > Real coordinates > > -------------------- > > chr | start | end | > > ------------------- > > chr1 | 5000 | 5500 | > > > > Peak > > modified coordinates (start is midpoint) > > -------------------- > > chr | start | end | > > ------------------- > > chr1 | 5250 | 5500 | > > > > A gene > > ----------------------------- > > chr | start | end | strand | > > ----------------------------- > > chr1 | 1000 | 5100 | - | > > > > So, instead of being 'overlapStart', it is > called > > as 'upstream'. > > > > It would be good if the package worked out > an > > extra column for the tables, called > 'position' and > > used the position for the distances, and the > real > > start and end positions for the overlapping. > > > > e.g. Something like this > > > > if("strand" %in% colnames(table)) > > { > > table$position = ifelse(table$strand == > '+', > > table$start, table$end) > > } else { > > table$position = round((table$start + > table$end) > > / 2) > > } > > > > - Dario. > > ---- Original message ---- > > >Date: Thu, 13 May 2010 22:40:46 -0400 > > >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> > > >Subject: Re: ChIPpeakAnno Strandedness and > > distance calculation > > >To: "D.Strbenac at garvan.org.au" > > <d.strbenac at="" garvan.org.au=""> > > >Cc: "bioconductor" > > <bioconductor at="" stat.math.ethz.ch=""> > > > > > >Hi Dario, > > > > > >You can create the annotation with strand = > > c("+"). For example, > > > > > >AnnotationRangedData = > RangedData(IRanges(start = > > c(967659, 2010898, > > >2496700, 3075866, > > >+ 3123260), end = c(967869, 2011108, > 2496920, > > 3076166, 3123470), names = > > >c("t1", > > >+ "t2", "t3", "t4", "t5")), space = c("1", > "2", > > "3", "1", "2"), strand > > >=c("+")) > > > > > >Please take a look at the examples given on > the > > paper just published on BMC > > >Bioinformatics > > > >http://www.biomedcentral.com/1471-2105/11/237. > In > > case you could not open > > >the link, I also attached the pdf file. > > > > > >Regarding your other question about > distance > > calculation, I suggest to > > >create your AnnotationRangedData and > > PeakRangedData with start=midpoint to > > >get the distance between midpoints. The > distance > > is calculated differently > > >for features in plus strand and minus > strand. For > > example, to calculate the > > >distance between peak and TSS, the distance > is > > calculated as the distance > > >between the start of the binding site and > the > > TSS, which is the gene start > > >for genes located on the forward strand and > the > > gene end for genes located > > >on the reverse strand. Therefore, adding > another > > parameter would mean to > > >overwrite the way how the distance is > calculated > > based on strandedness. > > >After you tried the above suggested way and > still > > prefer having a new > > >parameter, I will be happy to add it to the > next > > release. > > > > > >Best regards, > > > > > >Julie > > > > > > > > >******************************************* > > >Lihua Julie Zhu, Ph.D > > >Research Associate Professor > > >Program in Gene Function and Expression > > >Program in Molecular Medicine > > >University of Massachusetts Medical School > > >364 Plantation Street, Room 613 > > >Worcester, MA 01605 > > >508-856-5256 > > > >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > > > > > > > > > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" > > <d.strbenac at="" garvan.org.au=""> wrote: > > > > > >> Hello again, > > >> > > >> Just one more question. When we are > looking at > > DNA methtylation, we don't have > > >> the strand of the peak (because the > reverse > > complement of CG is CG). It seems > > >> that it might not be possible to do this > with > > ChipPeakAnno ? > > >> > > >> e.g. > > >> > > >>> > head(peaksT) > > >> chr start end > > >> 1 chr13 83351701 83352000 > > >> 2 chr13 83351401 83351700 > > >> 3 chr20 25011901 25012200 > > >> 4 chr13 83352001 83352300 > > >> 5 chr8 143402101 143402400 > > >> 6 chr2 238246801 238247100 > > >> > > >>> > head(featTable) > > >> name chr strand start end > > >> 1 7896759 chr1 + 781253 783614 > > >> 2 7896761 chr1 + 850983 869824 > > >> 3 7896779 chr1 + 885829 890958 > > >> 4 7896798 chr1 + 891739 900345 > > >> 5 7896817 chr1 + 938709 939782 > > >> 6 7896822 chr1 + 945365 981355 > > >> > > >> Also, sometimes our feature table is a > table of > > CpG islands, which don't have > > >> a strand associated with them. > > >> > > >> e.g. > > >> > > >>> > head(featTable2) > > >> chr start end CpG Island Name > > >> 1 chr1 18598 19673 CpG:_116 > > >> 2 chr1 124987 125426 CpG:_30 > > >> 3 chr1 317653 318092 CpG:_29 > > >> 4 chr1 427014 428027 CpG:_84 > > >> 5 chr1 439136 440407 CpG:_99 > > >> 6 chr1 523082 523977 CpG:_94 > > >> > > >> Is it possible to do this annotation with > > ChipPeakAnno ? Currently, the > > >> annotatePeakInBatch function gives me an > error > > when I don't give it strand > > >> information when I create my RangedData > object. > > >> > > >> Thanks, > > >> Dario. > > >> > > >> -------------------------------------- > > >> Dario Strbenac > > >> Research Assistant > > >> Cancer Epigenetics > > >> Garvan Institute of Medical Research > > >> Darlinghurst NSW 2010 > > >> Australia > > >> > > >> > > > > > >On 5/13/10 8:10 PM, "Dario Strbenac" > > <d.strbenac at="" garvan.org.au=""> wrote: > > > > > >> Hello, > > >> > > >> Firstly, thank you for making this > package. It > > seems so useful ! We were > > >> thinking of writing something like this > > ourselves, until I saw your package, > > >> because we do a lot of ChIP-Seq here. > > >> > > >> I just have a small feature request. In > your > > distance calculation, you do > > >> start of peak - start of feature. Would > it be > > possible to allow the user to > > >> choose if they want the distance > calculation to > > use the start or the middle of > > >> the feature (and also for the peak) ? > This is > > because we do a lot of > > >> methylation studies, and for CpG island > > features, we like to use the midpoint > > >> as the position of our feature. It would > also > > be nice to be able to use the > > >> midpoint of the peak as the peak's > position, > > since this is usually where the > > >> signal is strongest. > > >> > > >> Thanks, > > >> Dario. > > >> > > >> -------------------------------------- > > >> Dario Strbenac > > >> Research Assistant > > >> Cancer Epigenetics > > >> Garvan Institute of Medical Research > > >> Darlinghurst NSW 2010 > > >> Australia > > > > > > > > > > -------------------------------------- > > Dario Strbenac > > Research Assistant > > Cancer Epigenetics > > Garvan Institute of Medical Research > > Darlinghurst NSW 2010 > > Australia -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia ADD REPLY 0 Entering edit mode Julie Zhu ★ 4.3k @julie-zhu-3596 Last seen 3 months ago United States Hi Dario, Thanks for the clarification! Use case 1 is the default behavior in annotatePeakInBatch. Are you using the current version of ChIPpeakAnno? Could you please share your code snippet and session info? Thanks! Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program in Gene Function and Expression Program in Molecular Medicine University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm On 5/13/10 11:26 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: Hello, Sorry, I didn't make the use case clear. So, when I have this example : > head(peaksT) chr start end 1 chr1 2000010 2000310 2 chr1 19000000 19000300 3 chr1 30000000 30000300 4 chr2 300 600 5 chr2 5500 5800 6 chr2 100000 100300 > head(featuresT) chr start end strand 1 chr1 1000000 2000000 + 2 chr1 10000000 20000000 - 3 chr1 15000000 22000000 + 4 chr2 1000 5000 + 5 chr2 6000 7000 - 6 chr2 10000 15000 + I'd like to calculate the distance from the midpoint of the table which has no strand, to the start position of the table with the strand. Changing all of featuresT$strand to be '+' and making a strand = '+' column for peaksT will be bad because distances would get calculated wrong (the start for -ve strand is actually featuresT$end). Is it possible to make the package handle the use case when one or both of the tables don't have the strand column ? The use case of only the peakTable having no strand would be relating DNA methylation regions to genes' TSSs. The biological use case of both tables not having a strand column would be relating DNA methylation regions to CpG islands. Use case 2 could easily be done with making midpoint = start, and strand = '+' in both tables, but this won't cover use case 1. Thanks, Dario. ---- Original message ---- >Date: Thu, 13 May 2010 22:40:46 -0400 >From: "Zhu, Julie" <julie.zhu@umassmed.edu> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac@garvan.org.au" <d.strbenac@garvan.org.au> >Cc: "bioconductor" <bioconductor@stat.math.ethz.ch> > >Hi Dario, > >You can create the annotation with strand = c("+"). For example, > >AnnotationRangedData = RangedData(IRanges(start = c(967659, 2010898, >2496700, 3075866, >+ 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names = >c("t1", >+ "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand >=c("+")) > >Please take a look at the examples given on the paper just published on BMC >Bioinformatics >http://www.biomedcentral.com/1471-2105/11/237. In case you could not open >the link, I also attached the pdf file. > >Regarding your other question about distance calculation, I suggest to >create your AnnotationRangedData and PeakRangedData with start=midpoint to >get the distance between midpoints. The distance is calculated differently >for features in plus strand and minus strand. For example, to calculate the >distance between peak and TSS, the distance is calculated as the distance >between the start of the binding site and the TSS, which is the gene start >for genes located on the forward strand and the gene end for genes located >on the reverse strand. Therefore, adding another parameter would mean to >overwrite the way how the distance is calculated based on strandedness. >After you tried the above suggested way and still prefer having a new >parameter, I will be happy to add it to the next release. > >Best regards, > >Julie > > >******************************************* >Lihua Julie Zhu, Ph.D >Research Associate Professor >Program in Gene Function and Expression >Program in Molecular Medicine >University of Massachusetts Medical School >364 Plantation Street, Room 613 >Worcester, MA 01605 >508-856-5256 >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: > >> Hello again, >> >> Just one more question. When we are looking at DNA methtylation, we don't have >> the strand of the peak (because the reverse complement of CG is CG). It seems >> that it might not be possible to do this with ChipPeakAnno ? >> >> e.g. >> >>> > head(peaksT) >> chr start end >> 1 chr13 83351701 83352000 >> 2 chr13 83351401 83351700 >> 3 chr20 25011901 25012200 >> 4 chr13 83352001 83352300 >> 5 chr8 143402101 143402400 >> 6 chr2 238246801 238247100 >> >>> > head(featTable) >> name chr strand start end >> 1 7896759 chr1 + 781253 783614 >> 2 7896761 chr1 + 850983 869824 >> 3 7896779 chr1 + 885829 890958 >> 4 7896798 chr1 + 891739 900345 >> 5 7896817 chr1 + 938709 939782 >> 6 7896822 chr1 + 945365 981355 >> >> Also, sometimes our feature table is a table of CpG islands, which don't have >> a strand associated with them. >> >> e.g. >> >>> > head(featTable2) >> chr start end CpG Island Name >> 1 chr1 18598 19673 CpG:_116 >> 2 chr1 124987 125426 CpG:_30 >> 3 chr1 317653 318092 CpG:_29 >> 4 chr1 427014 428027 CpG:_84 >> 5 chr1 439136 440407 CpG:_99 >> 6 chr1 523082 523977 CpG:_94 >> >> Is it possible to do this annotation with ChipPeakAnno ? Currently, the >> annotatePeakInBatch function gives me an error when I don't give it strand >> information when I create my RangedData object. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia >> >> > >On 5/13/10 8:10 PM, "Dario Strbenac" <d.strbenac@garvan.org.au> wrote: > >> Hello, >> >> Firstly, thank you for making this package. It seems so useful ! We were >> thinking of writing something like this ourselves, until I saw your package, >> because we do a lot of ChIP-Seq here. >> >> I just have a small feature request. In your distance calculation, you do >> start of peak - start of feature. Would it be possible to allow the user to >> choose if they want the distance calculation to use the start or the middle of >> the feature (and also for the peak) ? This is because we do a lot of >> methylation studies, and for CpG island features, we like to use the midpoint >> as the position of our feature. It would also be nice to be able to use the >> midpoint of the peak as the peak's position, since this is usually where the >> signal is strongest. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia > > -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia [[alternative HTML version deleted]] ADD COMMENT 0 Entering edit mode Dario Strbenac ★ 1.5k @dario-strbenac-5916 Last seen 10 hours ago Australia Hello, Sorry, I didn't make the use case clear. So, when I have this example : > head(peaksT) chr start end 1 chr1 2000010 2000310 2 chr1 19000000 19000300 3 chr1 30000000 30000300 4 chr2 300 600 5 chr2 5500 5800 6 chr2 100000 100300 > head(featuresT) chr start end strand 1 chr1 1000000 2000000 + 2 chr1 10000000 20000000 - 3 chr1 15000000 22000000 + 4 chr2 1000 5000 + 5 chr2 6000 7000 - 6 chr2 10000 15000 + I'd like to calculate the distance from the midpoint of the table which has no strand, to the start position of the table with the strand. Changing all of featuresT$strand to be '+' and making a strand = '+' column for peaksT will be bad because distances would get calculated wrong (the start for -ve strand is actually featuresT$end). Is it possible to make the package handle the use case when one or both of the tables don't have the strand column ? The use case of only the peakTable having no strand would be relating DNA methylation regions to genes' TSSs. The biological use case of both tables not having a strand column would be relating DNA methylation regions to CpG islands. Use case 2 could easily be done with making midpoint = start, and strand = '+' in both tables, but this won't cover use case 1. Thanks, Dario. ---- Original message ---- >Date: Thu, 13 May 2010 22:40:46 -0400 >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac at garvan.org.au" <d.strbenac at="" garvan.org.au=""> >Cc: "bioconductor" <bioconductor at="" stat.math.ethz.ch=""> > >Hi Dario, > >You can create the annotation with strand = c(?+?). For example, > >AnnotationRangedData = RangedData(IRanges(start = c(967659, 2010898, >2496700, 3075866, >+ 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names = >c("t1", >+ "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand >=c("+")) > >Please take a look at the examples given on the paper just published on BMC >Bioinformatics >http://www.biomedcentral.com/1471-2105/11/237. In case you could not open >the link, I also attached the pdf file. > >Regarding your other question about distance calculation, I suggest to >create your AnnotationRangedData and PeakRangedData with start=midpoint to >get the distance between midpoints. The distance is calculated differently >for features in plus strand and minus strand. For example, to calculate the >distance between peak and TSS, the distance is calculated as the distance >between the start of the binding site and the TSS, which is the gene start >for genes located on the forward strand and the gene end for genes located >on the reverse strand. Therefore, adding another parameter would mean to >overwrite the way how the distance is calculated based on strandedness. >After you tried the above suggested way and still prefer having a new >parameter, I will be happy to add it to the next release. > >Best regards, > >Julie > > >******************************************* >Lihua Julie Zhu, Ph.D >Research Associate Professor >Program in Gene Function and Expression >Program in Molecular Medicine >University of Massachusetts Medical School >364 Plantation Street, Room 613 >Worcester, MA 01605 >508-856-5256 >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" <d.strbenac at="" garvan.org.au=""> wrote: > >> Hello again, >> >> Just one more question. When we are looking at DNA methtylation, we don't have >> the strand of the peak (because the reverse complement of CG is CG). It seems >> that it might not be possible to do this with ChipPeakAnno ? >> >> e.g. >> >>> > head(peaksT) >> chr start end >> 1 chr13 83351701 83352000 >> 2 chr13 83351401 83351700 >> 3 chr20 25011901 25012200 >> 4 chr13 83352001 83352300 >> 5 chr8 143402101 143402400 >> 6 chr2 238246801 238247100 >> >>> > head(featTable) >> name chr strand start end >> 1 7896759 chr1 + 781253 783614 >> 2 7896761 chr1 + 850983 869824 >> 3 7896779 chr1 + 885829 890958 >> 4 7896798 chr1 + 891739 900345 >> 5 7896817 chr1 + 938709 939782 >> 6 7896822 chr1 + 945365 981355 >> >> Also, sometimes our feature table is a table of CpG islands, which don't have >> a strand associated with them. >> >> e.g. >> >>> > head(featTable2) >> chr start end CpG Island Name >> 1 chr1 18598 19673 CpG:_116 >> 2 chr1 124987 125426 CpG:_30 >> 3 chr1 317653 318092 CpG:_29 >> 4 chr1 427014 428027 CpG:_84 >> 5 chr1 439136 440407 CpG:_99 >> 6 chr1 523082 523977 CpG:_94 >> >> Is it possible to do this annotation with ChipPeakAnno ? Currently, the >> annotatePeakInBatch function gives me an error when I don't give it strand >> information when I create my RangedData object. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia >> >> > >On 5/13/10 8:10 PM, "Dario Strbenac" <d.strbenac at="" garvan.org.au=""> wrote: > >> Hello, >> >> Firstly, thank you for making this package. It seems so useful ! We were >> thinking of writing something like this ourselves, until I saw your package, >> because we do a lot of ChIP-Seq here. >> >> I just have a small feature request. In your distance calculation, you do >> start of peak - start of feature. Would it be possible to allow the user to >> choose if they want the distance calculation to use the start or the middle of >> the feature (and also for the peak) ? This is because we do a lot of >> methylation studies, and for CpG island features, we like to use the midpoint >> as the position of our feature. It would also be nice to be able to use the >> midpoint of the peak as the peak's position, since this is usually where the >> signal is strongest. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia > > -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia ADD COMMENT 0 Entering edit mode Dario Strbenac ★ 1.5k @dario-strbenac-5916 Last seen 10 hours ago Australia Oh, actually I thought of a problem with making start = midpoint. If you modify the start position to be the average, then you might get the wrong values in insideFeature column. e.g. Peak Real coordinates -------------------- chr | start | end | ------------------- chr1 | 5000 | 5500 | Peak modified coordinates (start is midpoint) -------------------- chr | start | end | ------------------- chr1 | 5250 | 5500 | A gene ----------------------------- chr | start | end | strand | ----------------------------- chr1 | 1000 | 5100 | - | So, instead of being 'overlapStart', it is called as 'upstream'. It would be good if the package worked out an extra column for the tables, called 'position' and used the position for the distances, and the real start and end positions for the overlapping. e.g. Something like this if("strand" %in% colnames(table)) { table$position = ifelse(table$strand == '+', table$start, table$end) } else { table$position = round((table$start + table$end) / 2) } - Dario. ---- Original message ---- >Date: Thu, 13 May 2010 22:40:46 -0400 >From: "Zhu, Julie" <julie.zhu at="" umassmed.edu=""> >Subject: Re: ChIPpeakAnno Strandedness and distance calculation >To: "D.Strbenac at garvan.org.au" <d.strbenac at="" garvan.org.au=""> >Cc: "bioconductor" <bioconductor at="" stat.math.ethz.ch=""> > >Hi Dario, > >You can create the annotation with strand = c(?+?). For example, > >AnnotationRangedData = RangedData(IRanges(start = c(967659, 2010898, >2496700, 3075866, >+ 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names = >c("t1", >+ "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand >=c("+")) > >Please take a look at the examples given on the paper just published on BMC >Bioinformatics >http://www.biomedcentral.com/1471-2105/11/237. In case you could not open >the link, I also attached the pdf file. > >Regarding your other question about distance calculation, I suggest to >create your AnnotationRangedData and PeakRangedData with start=midpoint to >get the distance between midpoints. The distance is calculated differently >for features in plus strand and minus strand. For example, to calculate the >distance between peak and TSS, the distance is calculated as the distance >between the start of the binding site and the TSS, which is the gene start >for genes located on the forward strand and the gene end for genes located >on the reverse strand. Therefore, adding another parameter would mean to >overwrite the way how the distance is calculated based on strandedness. >After you tried the above suggested way and still prefer having a new >parameter, I will be happy to add it to the next release. > >Best regards, > >Julie > > >******************************************* >Lihua Julie Zhu, Ph.D >Research Associate Professor >Program in Gene Function and Expression >Program in Molecular Medicine >University of Massachusetts Medical School >364 Plantation Street, Room 613 >Worcester, MA 01605 >508-856-5256 >http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > > >On 5/13/10 9:00 PM, "Dario Strbenac" <d.strbenac at="" garvan.org.au=""> wrote: > >> Hello again, >> >> Just one more question. When we are looking at DNA methtylation, we don't have >> the strand of the peak (because the reverse complement of CG is CG). It seems >> that it might not be possible to do this with ChipPeakAnno ? >> >> e.g. >> >>> > head(peaksT) >> chr start end >> 1 chr13 83351701 83352000 >> 2 chr13 83351401 83351700 >> 3 chr20 25011901 25012200 >> 4 chr13 83352001 83352300 >> 5 chr8 143402101 143402400 >> 6 chr2 238246801 238247100 >> >>> > head(featTable) >> name chr strand start end >> 1 7896759 chr1 + 781253 783614 >> 2 7896761 chr1 + 850983 869824 >> 3 7896779 chr1 + 885829 890958 >> 4 7896798 chr1 + 891739 900345 >> 5 7896817 chr1 + 938709 939782 >> 6 7896822 chr1 + 945365 981355 >> >> Also, sometimes our feature table is a table of CpG islands, which don't have >> a strand associated with them. >> >> e.g. >> >>> > head(featTable2) >> chr start end CpG Island Name >> 1 chr1 18598 19673 CpG:_116 >> 2 chr1 124987 125426 CpG:_30 >> 3 chr1 317653 318092 CpG:_29 >> 4 chr1 427014 428027 CpG:_84 >> 5 chr1 439136 440407 CpG:_99 >> 6 chr1 523082 523977 CpG:_94 >> >> Is it possible to do this annotation with ChipPeakAnno ? Currently, the >> annotatePeakInBatch function gives me an error when I don't give it strand >> information when I create my RangedData object. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia >> >> > >On 5/13/10 8:10 PM, "Dario Strbenac" <d.strbenac at="" garvan.org.au=""> wrote: > >> Hello, >> >> Firstly, thank you for making this package. It seems so useful ! We were >> thinking of writing something like this ourselves, until I saw your package, >> because we do a lot of ChIP-Seq here. >> >> I just have a small feature request. In your distance calculation, you do >> start of peak - start of feature. Would it be possible to allow the user to >> choose if they want the distance calculation to use the start or the middle of >> the feature (and also for the peak) ? This is because we do a lot of >> methylation studies, and for CpG island features, we like to use the midpoint >> as the position of our feature. It would also be nice to be able to use the >> midpoint of the peak as the peak's position, since this is usually where the >> signal is strongest. >> >> Thanks, >> Dario. >> >> -------------------------------------- >> Dario Strbenac >> Research Assistant >> Cancer Epigenetics >> Garvan Institute of Medical Research >> Darlinghurst NSW 2010 >> Australia > > -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia