[Bioc-sig-seq] ChIPpeakAnno annotatePeakInBatch function

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Hi Amy, Thank you very much for the feedback! The inside feature is TRUE if the peakSummit (stored as Start in the RangedData) is inside the nearest annotated feature, FALSE otherwise. The feature you are proposing is useful for finding overlapping fragments between query peak ranges and target ranges. The timing of your suggestion cannot be better. We plan to add a function called FindOverlappingPeaks that will be added to the dev version, your ideas will be incorporated. Please let me know if you and others think that it is better to incorporate your ideas into annotatePeakInBatch. Thanks again for your help making this package more useful. I have a favor to ask you and those who have used or are aware of the ChIPpeakAnno package. We submitted a paper describing this package and just got the reviewers' comments back. One question comes up is that what this package offers that the existing annotation tools such as Cisgenome and CEAS do not. I thought it might be useful to get the feedback from the users of this package. If possible, could you please send me your thoughts on this, especially the reasons you chose using this package? Thanks a lot for your time and help! Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program Gene Function and Expression University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com> wrote: Firstly I'd like to thank the authors of the very useful package ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno annotatePeakInBatch function results that other users may or may not be aware of. I also propose improvements to compensate. The resulting insideFeature column reports TRUE if the query peak is either contained within an annotated feature, and also reports TRUE if it overlaps the end of an annotated feature. I think its worth noting that it reports FALSE if the peak overlaps the beginning of an annotated feature, and also reports FALSE if the peak overlaps in entirety an annotated feature(s). So, perhaps the insideFeature column (or additional new column called overlappingFeature) could report five options: ("false","inside","overlapStart","overlapEnd","super"). I haven't looked into the effects on how distanceToFeature should/could be called for each different scenario. Apologies if this has already been addressed, or if others do not consider this useful. Details with dummy example are described below. Many thanks, Amy. ##### In the dummy example below, p1 is bigger than f1 and consequently p1 overlaps it in entirety. It would be nice if ChIPpeakAnno could report this - although I accept it may overlap more than one feature, so would need to consider how to deal with that. And another example from below, p3 in fact overlaps with the start of f3, but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno could report it as OverlapStart. p4 is called as insideFeature = TRUE for overlapping with f4, but it would be nice if ChIPpeakAnno could report it as OverlapEnd or something similar. And correctly p2 is called as insideFeature = TRUE for overlap with f2, in this case p2 ranges are within the f2 ranges as you would expect. library(ChIPpeakAnno) peaks = RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167 889600),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1", "p2","p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) features = RangedData(IRanges(start=c(1549800,1554400,1565000,1569400 ,167888600),end=c(1550599,1560799,1565399,1571199,167888999),names=c(" f1","f2","f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) as.data.frame(annoPeaks) space start end width names strand feature start_position 1 5 167889600 167893599 4000 p5 1 f5 167888600 2 6 1543200 1555199 12000 p1 1 f1 1549800 3 6 1557200 1560599 3400 p2 1 f2 1554400 4 6 1563000 1565199 2200 p3 1 f3 1565000 5 6 1569800 1573799 4000 p4 1 f4 1569400 end_position insideFeature distancetoFeature 1 167888999 FALSE 1000 2 1550599 FALSE -6600 3 1560799 TRUE 2800 4 1565399 FALSE -2000 5 1571199 TRUE 400 > sessionInfo() R version 2.10.0 (2009-10-26) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C [5] LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ChIPpeakAnno_1.3.0 org.Hs.eg.db_2.3.6 [3] GO.db_2.3.5 RSQLite_0.7-3 [5] DBI_0.2-4 AnnotationDbi_1.8.0 [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 [9] Biostrings_2.14.2 IRanges_1.5.18 [11] multtest_2.2.0 Biobase_2.6.0 [13] biomaRt_2.3.0 loaded via a namespace (and not attached): [1] MASS_7.3-3 RCurl_1.3-0 XML_2.6-0 splines_2.10.0 [5] survival_2.35-7 ----------------------------------------------------------- This e-mail was sent by GlaxoSmithKline Services Unlimited (registered in England and Wales No. 1047315), which is a member of the GlaxoSmithKline group of companies. The registered address of GlaxoSmithKline Services Unlimited is 980 Great West Road, Brentford, Middlesex TW8 9GS. ----------------------------------------------------------- [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing [[alternative HTML version deleted]]

Annotation GO BSgenome BSgenome ChIPpeakAnno Annotation GO BSgenome BSgenome ChIPpeakAnno • 1.4k views

ADD COMMENT • link updated 14.1 years ago by Ivan Gregoretti ▴ 310 • written 14.1 years ago by Julie Zhu ★ 4.3k

0

Entering edit mode

Ivan Gregoretti ▴ 310

@ivan-gregoretti-3975

Last seen 9.6 years ago

Canada

I haven't used Cisgenome or CEAS so I can't speak about them. I use ChIPpeakAnno because, while working in Bioconductor, I get the enriched GO terms in three lines of code. (Four lines, if you count library(ChIPpeakAnno)). By the way, since you are in the process of upgrading the package, would you consider this fixable problem?: say Ard and Srd are RangedData instances. When I do AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd) the order of rows in AannotatedRd is different from the order of rows in Ard. Currently, I fix this by hand doing AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] Would you mind incorporating this fix of row shuffling in the next upgrade? Both Amy and me use the stable release of R and Bioc, not the devel version. I would highly appreciate if you could push the upgrade for both stable and devel but I won't hold you accountable if you don't. Have a great day! Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1592 Fax: 1-301-496-9878 On Tue, Mar 9, 2010 at 10:02 AM, Zhu, Julie <julie.zhu at="" umassmed.edu=""> wrote: > Hi Amy, > > Thank you very much for the feedback! > > The inside feature is TRUE if the peakSummit (stored as Start in the > RangedData) is inside the nearest annotated feature, FALSE otherwise. ?The > feature you are proposing is useful for finding overlapping fragments > between query peak ranges and target ranges. The timing of your suggestion > cannot be better. We plan to add ?a function called FindOverlappingPeaks > that will be added to the dev version, your ideas will be incorporated. > Please let me know if you and others think that it is better to incorporate > your ideas into annotatePeakInBatch. Thanks again for your help making this > package more useful. > > I have a favor to ask you and those who have used or are aware of the > ChIPpeakAnno package. We submitted a paper describing this package and just > got the reviewers? comments back. One question comes up is that what this > package offers that the existing annotation tools such as Cisgenome and CEAS > do not. I thought it might be useful to get the feedback from the users of > this package. If possible, could you please send me your thoughts on this, > especially the reasons you chose using this package? Thanks a lot for your > time and help! > > Best regards, > > Julie > > > ******************************************* > Lihua Julie Zhu, Ph.D > Research Associate Professor > Program Gene Function and Expression > University of Massachusetts Medical School > 364 Plantation Street, Room 613 > Worcester, MA 01605 > 508-856-5256 > http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth at="" gsk.com=""> wrote: > > > > Firstly I'd like to thank the authors of the very useful package > ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno > annotatePeakInBatch function results that other users may or may not be > aware of. I also propose improvements to compensate. > > The resulting insideFeature column reports TRUE if the query peak is either > contained within an annotated feature, and also reports TRUE if it overlaps > the end of an annotated feature. > > I think its worth noting that it reports FALSE if the peak overlaps the > beginning of an annotated feature, and also reports FALSE if the peak > overlaps in entirety an annotated feature(s). > > So, perhaps the insideFeature column (or additional new column called > overlappingFeature) could report five options: > ("false","inside","overlapStart","overlapEnd","super"). I haven't looked > into the effects on how distanceToFeature should/could be called for each > different scenario. > > Apologies if this has already been addressed, or if others do not consider > this useful. > > Details with dummy example are described below. > > Many thanks, > Amy. > > ##### > > In the dummy example below, p1 is bigger than f1 and consequently p1 > overlaps it in entirety. It would be nice if ChIPpeakAnno could report this > - although I accept it may overlap more than one feature, > so would need to consider how to deal with that. > > And another example from below, p3 in fact overlaps with the start of f3, > but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno could > report it as OverlapStart. > > p4 is called as insideFeature = TRUE for overlapping with f4, but it would > be nice if ChIPpeakAnno could report it as OverlapEnd or something similar. > > And correctly p2 is called as insideFeature = TRUE for overlap with f2, in > this case p2 ranges are within the f2 ranges as you would expect. > > > library(ChIPpeakAnno) > peaks = > RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167889600 ),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1","p2"," p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) > features = > ?RangedData(IRanges(start=c(1549800,1554400,1565000,1569400,16788860 0),end=c(1550599,1560799,1565399,1571199,167888999),names=c("f1","f2", "f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) > > annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) > > as.data.frame(annoPeaks) > > ??space ????start ??????end width names strand feature start_position > 1 ????5 167889600 167893599 ?4000 ???p5 ?????1 ?????f5 ?????167888600 > 2 ????6 ??1543200 ??1555199 12000 ???p1 ?????1 ?????f1 ???????1549800 > 3 ????6 ??1557200 ??1560599 ?3400 ???p2 ?????1 ?????f2 ???????1554400 > 4 ????6 ??1563000 ??1565199 ?2200 ???p3 ?????1 ?????f3 ???????1565000 > 5 ????6 ??1569800 ??1573799 ?4000 ???p4 ?????1 ?????f4 ???????1569400 > ??end_position insideFeature distancetoFeature > 1 ???167888999 ????????FALSE ?????????????1000 > 2 ?????1550599 ????????FALSE ????????????-6600 > 3 ?????1560799 ?????????TRUE ?????????????2800 > 4 ?????1565399 ????????FALSE ????????????-2000 > 5 ?????1571199 ?????????TRUE ??????????????400 > > >> sessionInfo() > R version 2.10.0 (2009-10-26) > x86_64-unknown-linux-gnu > > locale: > ?[1] LC_CTYPE=en_GB.UTF-8 ??????LC_NUMERIC=C > ?[3] LC_TIME=en_GB.UTF-8 ???????LC_COLLATE=C > ?[5] LC_MONETARY=C ?????????????LC_MESSAGES=C > ?[7] LC_PAPER=en_GB.UTF-8 ??????LC_NAME=C > ?[9] LC_ADDRESS=C ??????????????LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ????graphics ?grDevices utils ????datasets ?methods ??base > > other attached packages: > ?[1] ChIPpeakAnno_1.3.0 ?????????????????org.Hs.eg.db_2.3.6 > ?[3] GO.db_2.3.5 ????????????????????????RSQLite_0.7-3 > ?[5] DBI_0.2-4 ??????????????????????????AnnotationDbi_1.8.0 > ?[7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 > ?[9] Biostrings_2.14.2 ??????????????????IRanges_1.5.18 > [11] multtest_2.2.0 ?????????????????????Biobase_2.6.0 > [13] biomaRt_2.3.0 > > loaded via a namespace (and not attached): > [1] MASS_7.3-3 ?????RCurl_1.3-0 ????XML_2.6-0 ??????splines_2.10.0 > [5] survival_2.35-7 > > > ----------------------------------------------------------- > This e-mail was sent by GlaxoSmithKline Services Unlimited > (registered in England and Wales No. 1047315), which is a > member of the GlaxoSmithKline group of companies. The > registered address of GlaxoSmithKline Services Unlimited > is 980 Great West Road, Brentford, Middlesex TW8 9GS. > ----------------------------------------------------------- > > ????????[[alternative HTML version deleted]] > > _______________________________________________ > Bioc-sig-sequencing mailing list > Bioc-sig-sequencing at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > >

ADD COMMENT • link 14.1 years ago Ivan Gregoretti ▴ 310

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Hi Ivan, Thank you very much for the feedback! Sure, I will add AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] inside the function. Thanks! Best regards, Julie On 3/9/10 10:24 AM, "Ivan Gregoretti" <ivangreg@gmail.com> wrote: I haven't used Cisgenome or CEAS so I can't speak about them. I use ChIPpeakAnno because, while working in Bioconductor, I get the enriched GO terms in three lines of code. (Four lines, if you count library(ChIPpeakAnno)). By the way, since you are in the process of upgrading the package, would you consider this fixable problem?: say Ard and Srd are RangedData instances. When I do AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd) the order of rows in AannotatedRd is different from the order of rows in Ard. Currently, I fix this by hand doing AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] Would you mind incorporating this fix of row shuffling in the next upgrade? Both Amy and me use the stable release of R and Bioc, not the devel version. I would highly appreciate if you could push the upgrade for both stable and devel but I won't hold you accountable if you don't. Have a great day! Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1592 Fax: 1-301-496-9878 On Tue, Mar 9, 2010 at 10:02 AM, Zhu, Julie <julie.zhu@umassmed.edu> wrote: > Hi Amy, > > Thank you very much for the feedback! > > The inside feature is TRUE if the peakSummit (stored as Start in the > RangedData) is inside the nearest annotated feature, FALSE otherwise. The > feature you are proposing is useful for finding overlapping fragments > between query peak ranges and target ranges. The timing of your suggestion > cannot be better. We plan to add a function called FindOverlappingPeaks > that will be added to the dev version, your ideas will be incorporated. > Please let me know if you and others think that it is better to incorporate > your ideas into annotatePeakInBatch. Thanks again for your help making this > package more useful. > > I have a favor to ask you and those who have used or are aware of the > ChIPpeakAnno package. We submitted a paper describing this package and just > got the reviewers' comments back. One question comes up is that what this > package offers that the existing annotation tools such as Cisgenome and CEAS > do not. I thought it might be useful to get the feedback from the users of > this package. If possible, could you please send me your thoughts on this, > especially the reasons you chose using this package? Thanks a lot for your > time and help! > > Best regards, > > Julie > > > ******************************************* > Lihua Julie Zhu, Ph.D > Research Associate Professor > Program Gene Function and Expression > University of Massachusetts Medical School > 364 Plantation Street, Room 613 > Worcester, MA 01605 > 508-856-5256 > http://www.umassmed.edu/pgfe/faculty/zhu.cfm > > > > On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com> wrote: > > > > Firstly I'd like to thank the authors of the very useful package > ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno > annotatePeakInBatch function results that other users may or may not be > aware of. I also propose improvements to compensate. > > The resulting insideFeature column reports TRUE if the query peak is either > contained within an annotated feature, and also reports TRUE if it overlaps > the end of an annotated feature. > > I think its worth noting that it reports FALSE if the peak overlaps the > beginning of an annotated feature, and also reports FALSE if the peak > overlaps in entirety an annotated feature(s). > > So, perhaps the insideFeature column (or additional new column called > overlappingFeature) could report five options: > ("false","inside","overlapStart","overlapEnd","super"). I haven't looked > into the effects on how distanceToFeature should/could be called for each > different scenario. > > Apologies if this has already been addressed, or if others do not consider > this useful. > > Details with dummy example are described below. > > Many thanks, > Amy. > > ##### > > In the dummy example below, p1 is bigger than f1 and consequently p1 > overlaps it in entirety. It would be nice if ChIPpeakAnno could report this > - although I accept it may overlap more than one feature, > so would need to consider how to deal with that. > > And another example from below, p3 in fact overlaps with the start of f3, > but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno could > report it as OverlapStart. > > p4 is called as insideFeature = TRUE for overlapping with f4, but it would > be nice if ChIPpeakAnno could report it as OverlapEnd or something similar. > > And correctly p2 is called as insideFeature = TRUE for overlap with f2, in > this case p2 ranges are within the f2 ranges as you would expect. > > > library(ChIPpeakAnno) > peaks = > RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167889600 ),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1","p2"," p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) > features = > RangedData(IRanges(start=c(1549800,1554400,1565000,1569400,16788860 0),end=c(1550599,1560799,1565399,1571199,167888999),names=c("f1","f2", "f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) > > annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) > > as.data.frame(annoPeaks) > > space start end width names strand feature start_position > 1 5 167889600 167893599 4000 p5 1 f5 167888600 > 2 6 1543200 1555199 12000 p1 1 f1 1549800 > 3 6 1557200 1560599 3400 p2 1 f2 1554400 > 4 6 1563000 1565199 2200 p3 1 f3 1565000 > 5 6 1569800 1573799 4000 p4 1 f4 1569400 > end_position insideFeature distancetoFeature > 1 167888999 FALSE 1000 > 2 1550599 FALSE -6600 > 3 1560799 TRUE 2800 > 4 1565399 FALSE -2000 > 5 1571199 TRUE 400 > > >> sessionInfo() > R version 2.10.0 (2009-10-26) > x86_64-unknown-linux-gnu > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C > [5] LC_MONETARY=C LC_MESSAGES=C > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ChIPpeakAnno_1.3.0 org.Hs.eg.db_2.3.6 > [3] GO.db_2.3.5 RSQLite_0.7-3 > [5] DBI_0.2-4 AnnotationDbi_1.8.0 > [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 > [9] Biostrings_2.14.2 IRanges_1.5.18 > [11] multtest_2.2.0 Biobase_2.6.0 > [13] biomaRt_2.3.0 > > loaded via a namespace (and not attached): > [1] MASS_7.3-3 RCurl_1.3-0 XML_2.6-0 splines_2.10.0 > [5] survival_2.35-7 > > > ----------------------------------------------------------- > This e-mail was sent by GlaxoSmithKline Services Unlimited > (registered in England and Wales No. 1047315), which is a > member of the GlaxoSmithKline group of companies. The > registered address of GlaxoSmithKline Services Unlimited > is 980 Great West Road, Brentford, Middlesex TW8 9GS. > ----------------------------------------------------------- > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-sig-sequencing mailing list > Bioc-sig-sequencing@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

It seems that we are getting quite excited about ChIPpeakAnno. Good! Two additions to this discussion ########## # Topic 1 ########## Just a comment about AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] That statement will work ONLY if rownames have been defined in AannotatedRd. That means that AannotatedRd must inherit the Ard's rownames. In the simplest case, when you import a BED file into a RangedData instance, the row names will not be defined. My solution, currently in use: So as to preserve the rows in their original order then, what I do is name the rows with numbers as soon as I import my BED. Like this library(rtracklayer) Ard <- import('mypeaks.bed', 'bed') rownames(Ard) <- formatC(1:dim(Ard)[1], width=nchar(dim(Ard)[1]), flag='0') Now AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd) AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] will work like a charm. ########## # Topic 2 ########## I do not know how the AnnotationData (Srd above) that ships with the package is built but this is how I build it on the fly library(RMySQL) con <- dbConnect(MySQL(), user='genome', dbname='mm9', host='genome-mysql.cse.ucsc.edu') Srd <- as(dbGetQuery(con, "SELECT a.chrom AS space, a.txStart+1 AS start, a.txEnd AS end, a.strand AS strand FROM mm9.ensGene AS a GROUP BY a.name2"), "RangedData") rownames(Srd) <- as.list(dbGetQuery(con, "SELECT a.name2 AS name FROM mm9.ensGene AS a GROUP BY a.name2"))[[1]] Srd <- Srd[order(as.data.frame(Srd)$space, start(Srd)), ] dbDisconnect(con) Notice that I get ONE txStart per gene while in fact there may be multiple transcripts per single gene, with different start sites. So, Srd is suboptimal but I need it because ChIPpeakAnno needs gene names. In the ideal case, ChIPpeakAnno should be able to link transcripts to GO terms. That means that, instead of using Srd, we should be using Trd, which you can obtain on the fly like this: con <- dbConnect(MySQL(), user='genome', dbname='mm9', host='genome-mysql.cse.ucsc.edu') Trd <- as(dbGetQuery(con, "SELECT a.chrom AS space, a.txStart+1 AS start, a.txEnd AS end, a.strand AS strand FROM mm9.ensGene AS a GROUP BY a.name"), "RangedData") rownames(Trd) <- as.list(dbGetQuery(con, "SELECT a.name AS name FROM mm9.ensGene AS a GROUP BY a.name"))[[1]] Trd <- Trd[order(as.data.frame(Trd)$space, start(Trd)), ] dbDisconnect(con) I hope this helps improve the package. It's worth it. Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1592 Fax: 1-301-496-9878 On Tue, Mar 9, 2010 at 10:44 AM, Zhu, Julie <julie.zhu at="" umassmed.edu=""> wrote: > Hi Ivan, > > Thank you very much for the feedback! Sure, I will add ?AannotatedRd <- > AannotatedRd[order(rownames(AannotatedRd)),] inside the function. Thanks! > > Best regards, > > Julie > > > On 3/9/10 10:24 AM, "Ivan Gregoretti" <ivangreg at="" gmail.com=""> wrote: > > I haven't used Cisgenome or CEAS so I can't speak about them. > > I use ChIPpeakAnno because, while working in Bioconductor, I get the > enriched GO terms in three lines of code. (Four lines, if you count > library(ChIPpeakAnno)). > > By the way, since you are in the process of upgrading the package, > would you consider this fixable problem?: > > say Ard and Srd are RangedData instances. When I do > > AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd) > > the order of rows in AannotatedRd is different from the order of rows > in Ard. Currently, I fix this by hand doing > > AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] > > Would you mind incorporating this fix of row shuffling in the next upgrade? > > Both Amy and me use the stable release of R and Bioc, not the devel > version. I would highly appreciate if you could push the upgrade for > both stable and devel but I won't hold you accountable if you don't. > > Have a great day! > > Ivan > > > Ivan Gregoretti, PhD > National Institute of Diabetes and Digestive and Kidney Diseases > National Institutes of Health > 5 Memorial Dr, Building 5, Room 205. > Bethesda, MD 20892. USA. > Phone: 1-301-496-1592 > Fax: 1-301-496-9878 > > > > On Tue, Mar 9, 2010 at 10:02 AM, Zhu, Julie <julie.zhu at="" umassmed.edu=""> wrote: >> Hi Amy, >> >> Thank you very much for the feedback! >> >> The inside feature is TRUE if the peakSummit (stored as Start in the >> RangedData) is inside the nearest annotated feature, FALSE otherwise. ?The >> feature you are proposing is useful for finding overlapping fragments >> between query peak ranges and target ranges. The timing of your suggestion >> cannot be better. We plan to add ?a function called FindOverlappingPeaks >> that will be added to the dev version, your ideas will be incorporated. >> Please let me know if you and others think that it is better to >> incorporate >> your ideas into annotatePeakInBatch. Thanks again for your help making >> this >> package more useful. >> >> I have a favor to ask you and those who have used or are aware of the >> ChIPpeakAnno package. We submitted a paper describing this package and >> just >> got the reviewers? comments back. One question comes up is that what this >> package offers that the existing annotation tools such as Cisgenome and >> CEAS >> do not. I thought it might be useful to get the feedback from the users of >> this package. If possible, could you please send me your thoughts on this, >> especially the reasons you chose using this package? Thanks a lot for your >> time and help! >> >> Best regards, >> >> Julie >> >> >> ******************************************* >> Lihua Julie Zhu, Ph.D >> Research Associate Professor >> Program Gene Function and Expression >> University of Massachusetts Medical School >> 364 Plantation Street, Room 613 >> Worcester, MA 01605 >> 508-856-5256 >> http://www.umassmed.edu/pgfe/faculty/zhu.cfm >> >> >> >> On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth at="" gsk.com=""> wrote: >> >> >> >> Firstly I'd like to thank the authors of the very useful package >> ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno >> annotatePeakInBatch function results that other users may or may not be >> aware of. I also propose improvements to compensate. >> >> The resulting insideFeature column reports TRUE if the query peak is >> either >> contained within an annotated feature, and also reports TRUE if it >> overlaps >> the end of an annotated feature. >> >> I think its worth noting that it reports FALSE if the peak overlaps the >> beginning of an annotated feature, and also reports FALSE if the peak >> overlaps in entirety an annotated feature(s). >> >> So, perhaps the insideFeature column (or additional new column called >> overlappingFeature) could report five options: >> ("false","inside","overlapStart","overlapEnd","super"). I haven't looked >> into the effects on how distanceToFeature should/could be called for each >> different scenario. >> >> Apologies if this has already been addressed, or if others do not consider >> this useful. >> >> Details with dummy example are described below. >> >> Many thanks, >> Amy. >> >> ##### >> >> In the dummy example below, p1 is bigger than f1 and consequently p1 >> overlaps it in entirety. It would be nice if ChIPpeakAnno could report >> this >> - although I accept it may overlap more than one feature, >> so would need to consider how to deal with that. >> >> And another example from below, p3 in fact overlaps with the start of f3, >> but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno >> could >> report it as OverlapStart. >> >> p4 is called as insideFeature = TRUE for overlapping with f4, but it would >> be nice if ChIPpeakAnno could report it as OverlapEnd or something >> similar. >> >> And correctly p2 is called as insideFeature = TRUE for overlap with f2, in >> this case p2 ranges are within the f2 ranges as you would expect. >> >> >> library(ChIPpeakAnno) >> peaks = >> >> RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,16788960 0),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1","p2", "p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) >> features = >> >> ?RangedData(IRanges(start=c(1549800,1554400,1565000,1569400,1678886 00),end=c(1550599,1560799,1565399,1571199,167888999),names=c("f1","f2" ,"f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) >> >> annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) >> >> as.data.frame(annoPeaks) >> >> ??space ????start ??????end width names strand feature start_position >> 1 ????5 167889600 167893599 ?4000 ???p5 ?????1 ?????f5 ?????167888600 >> 2 ????6 ??1543200 ??1555199 12000 ???p1 ?????1 ?????f1 ???????1549800 >> 3 ????6 ??1557200 ??1560599 ?3400 ???p2 ?????1 ?????f2 ???????1554400 >> 4 ????6 ??1563000 ??1565199 ?2200 ???p3 ?????1 ?????f3 ???????1565000 >> 5 ????6 ??1569800 ??1573799 ?4000 ???p4 ?????1 ?????f4 ???????1569400 >> ??end_position insideFeature distancetoFeature >> 1 ???167888999 ????????FALSE ?????????????1000 >> 2 ?????1550599 ????????FALSE ????????????-6600 >> 3 ?????1560799 ?????????TRUE ?????????????2800 >> 4 ?????1565399 ????????FALSE ????????????-2000 >> 5 ?????1571199 ?????????TRUE ??????????????400 >> >> >>> sessionInfo() >> R version 2.10.0 (2009-10-26) >> x86_64-unknown-linux-gnu >> >> locale: >> ?[1] LC_CTYPE=en_GB.UTF-8 ??????LC_NUMERIC=C >> ?[3] LC_TIME=en_GB.UTF-8 ???????LC_COLLATE=C >> ?[5] LC_MONETARY=C ?????????????LC_MESSAGES=C >> ?[7] LC_PAPER=en_GB.UTF-8 ??????LC_NAME=C >> ?[9] LC_ADDRESS=C ??????????????LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats ????graphics ?grDevices utils ????datasets ?methods ??base >> >> other attached packages: >> ?[1] ChIPpeakAnno_1.3.0 ?????????????????org.Hs.eg.db_2.3.6 >> ?[3] GO.db_2.3.5 ????????????????????????RSQLite_0.7-3 >> ?[5] DBI_0.2-4 ??????????????????????????AnnotationDbi_1.8.0 >> ?[7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 >> ?[9] Biostrings_2.14.2 ??????????????????IRanges_1.5.18 >> [11] multtest_2.2.0 ?????????????????????Biobase_2.6.0 >> [13] biomaRt_2.3.0 >> >> loaded via a namespace (and not attached): >> [1] MASS_7.3-3 ?????RCurl_1.3-0 ????XML_2.6-0 ??????splines_2.10.0 >> [5] survival_2.35-7 >> >> >> ----------------------------------------------------------- >> This e-mail was sent by GlaxoSmithKline Services Unlimited >> (registered in England and Wales No. 1047315), which is a >> member of the GlaxoSmithKline group of companies. The >> registered address of GlaxoSmithKline Services Unlimited >> is 980 Great West Road, Brentford, Middlesex TW8 9GS. >> ----------------------------------------------------------- >> >> ????????[[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> Bioc-sig-sequencing at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> > > >

ADD REPLY • link 14.1 years ago Ivan Gregoretti ▴ 310

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Hi Amy, Thank you so much for sharing your detailed comparisons! Yes, FindOverLappingPeaks will work in batch as well. Best regards, Julie On 3/9/10 11:00 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com> wrote: Hi Julie, CisGenome is only available for non-commercial users. I have used CEAS, but the main advantages of ChIPpeakAnno are the ability/flexibility to plug in with the other fast moving deep- sequencing analysis capabilities and infrastructure in bioconductor. But I guess the main differentiating point from CEAS is that we can compare overlap between any annotation feature objects that you like; for example comparing with CpG islands (or other annotated features not captured by CEAS), or comparing between two sets of peaks etc. The main function I tend to use is AnnotatePeakInBatch because it does it in batch. It would be time-consuming to iterate over thousands of peaks to find overlapping peaks, hence I like the batch functionality. If your proposed FindOverlappingPeaks function works in batch that will be great. Although we would also still want to know the insideFeature status as well. Thanks for your efforts on this package. Amy. From: Zhu, Julie [mailto:Julie.Zhu@umassmed.edu] Sent: 09 March 2010 15:03 To: Amy Molesworth; Khademul Islam; bioc-sig-sequencing@r-project.org Cc: Ivan Gregoretti; Eric Bremer; bioconductor Subject: Re: [Bioc-sig-seq] ChIPpeakAnno annotatePeakInBatch function Hi Amy, Thank you very much for the feedback! The inside feature is TRUE if the peakSummit (stored as Start in the RangedData) is inside the nearest annotated feature, FALSE otherwise. The feature you are proposing is useful for finding overlapping fragments between query peak ranges and target ranges. The timing of your suggestion cannot be better. We plan to add a function called FindOverlappingPeaks that will be added to the dev version, your ideas will be incorporated. Please let me know if you and others think that it is better to incorporate your ideas into annotatePeakInBatch. Thanks again for your help making this package more useful. I have a favor to ask you and those who have used or are aware of the ChIPpeakAnno package. We submitted a paper describing this package and just got the reviewers' comments back. One question comes up is that what this package offers that the existing annotation tools such as Cisgenome and CEAS do not. I thought it might be useful to get the feedback from the users of this package. If possible, could you please send me your thoughts on this, especially the reasons you chose using this package? Thanks a lot for your time and help! Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program Gene Function and Expression University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com> wrote: Firstly I'd like to thank the authors of the very useful package ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno annotatePeakInBatch function results that other users may or may not be aware of. I also propose improvements to compensate. The resulting insideFeature column reports TRUE if the query peak is either contained within an annotated feature, and also reports TRUE if it overlaps the end of an annotated feature. I think its worth noting that it reports FALSE if the peak overlaps the beginning of an annotated feature, and also reports FALSE if the peak overlaps in entirety an annotated feature(s). So, perhaps the insideFeature column (or additional new column called overlappingFeature) could report five options: ("false","inside","overlapStart","overlapEnd","super"). I haven't looked into the effects on how distanceToFeature should/could be called for each different scenario. Apologies if this has already been addressed, or if others do not consider this useful. Details with dummy example are described below. Many thanks, Amy. ##### In the dummy example below, p1 is bigger than f1 and consequently p1 overlaps it in entirety. It would be nice if ChIPpeakAnno could report this - although I accept it may overlap more than one feature, so would need to consider how to deal with that. And another example from below, p3 in fact overlaps with the start of f3, but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno could report it as OverlapStart. p4 is called as insideFeature = TRUE for overlapping with f4, but it would be nice if ChIPpeakAnno could report it as OverlapEnd or something similar. And correctly p2 is called as insideFeature = TRUE for overlap with f2, in this case p2 ranges are within the f2 ranges as you would expect. library(ChIPpeakAnno) peaks = RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167 889600),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1", "p2","p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) features = RangedData(IRanges(start=c(1549800,1554400,1565000,1569400 ,167888600),end=c(1550599,1560799,1565399,1571199,167888999),names=c(" f1","f2","f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) as.data.frame(annoPeaks) space start end width names strand feature start_position 1 5 167889600 167893599 4000 p5 1 f5 167888600 2 6 1543200 1555199 12000 p1 1 f1 1549800 3 6 1557200 1560599 3400 p2 1 f2 1554400 4 6 1563000 1565199 2200 p3 1 f3 1565000 5 6 1569800 1573799 4000 p4 1 f4 1569400 end_position insideFeature distancetoFeature 1 167888999 FALSE 1000 2 1550599 FALSE -6600 3 1560799 TRUE 2800 4 1565399 FALSE -2000 5 1571199 TRUE 400 > sessionInfo() R version 2.10.0 (2009-10-26) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C [5] LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ChIPpeakAnno_1.3.0 org.Hs.eg.db_2.3.6 [3] GO.db_2.3.5 RSQLite_0.7-3 [5] DBI_0.2-4 AnnotationDbi_1.8.0 [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 [9] Biostrings_2.14.2 IRanges_1.5.18 [11] multtest_2.2.0 Biobase_2.6.0 [13] biomaRt_2.3.0 loaded via a namespace (and not attached): [1] MASS_7.3-3 RCurl_1.3-0 XML_2.6-0 splines_2.10.0 [5] survival_2.35-7 ----------------------------------------------------------- This e-mail was sent by GlaxoSmithKline Services Unlimited (registered in England and Wales No. 1047315), which is a member of the GlaxoSmithKline group of companies. The registered address of GlaxoSmithKline Services Unlimited is 980 Great West Road, Brentford, Middlesex TW8 9GS. ----------------------------------------------------------- [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing ----------------------------------------------------------- This e-mail was sent by GlaxoSmithKline Services Unlimited (registered in England and Wales No. 1047315), which is a member of the GlaxoSmithKline group of companies. The registered address of GlaxoSmithKline Services Unlimited is 980 Great West Road, Brentford, Middlesex TW8 9GS. ----------------------------------------------------------- [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Hi Khademul, Thank you so much for sharing your comparison between ChIPpeakAnno and CEAS, and the great suggestions! We will incorporate your ideas into the next release. The functionality of finding overlapping feature is exactly what we are planning to add. Best regards, Julie On 3/9/10 1:40 PM, "Khademul Islam" <khademul.islam@gmail.com> wrote: hi Julie, Thanks for your mail. I use CEAS but not CiSGenome. I liked ChIPpeakAnno because # its in R, and fast # We can download peak region sequence for motif finding, or flanking region for primer design # We can do GO enrichment # CEAS offer distance of peak from ALL RefSeq Known genes (out put in excel file), but we usually want to find ONLY those genes which is closest. and We can use Ensembl gene here............we can use Biomart here ............. # We can compare two bed files and if peaks from two TF (in two bed file) are overlapping or not; if not, how far are they or what is the closest TF......... even how far two closest TFs....... we can know (we dont get it in CEAS). ## CEAS generates a pie chart of graphical distribution of peaks in different region (intron, exon, promoter, distal intergenic, downstream, upstream.......etc)...... but does not say actual distance............ that we can get from ChIPpeakAnno......... not all but at least distance from TSS, TTS, so, how far in promoter easily. I have some suggestions: ------------------------------------- # For example, there is one big size gene and its contains a miRNA gene inside it. I have observed that my peak is close the start side of the miRNA gene but actually it resides inside the big gene. So, ChIPpeakAnno decide to report the miRNA gene. However, in one analysis, I wanted to know only peaks that are residing within a gene. In this case the feature "TRUE" did not work. So, I used another tool called "BedTools" .......... then I realize that although ChIPpeakAnno saying the peak is "FALSE", but it is actually residing within a gene which is not reported. So, in some cases it would be nice to see "overlapping" feature within one tool ChIPpeakAnno. ## in another case: my peak: 24,72,191 -- 24,172,382 position ChipPeakAnno Found closest gene (TSS-TTS): 24,134,709 -- 24,511,317 at -ve strand However, the peak is actually residing in one gene (TSS-TTS): +ve strand 23,961,810--23,996,240 So, in that case, ChIPpeakAnno has right calculation that for negative strand, the gene closest to the peak it detect is correct. But this peak is still far far far away from TSS of the -ve stranded gene. So, in this case it would be more interesting to know that peak is actually residing inside a gene (+ve strand gene) although this gene's TSS is more far away than the -ve strand gene. So, considering this, another column of feature, if it is overlapping with any other gene or not would be interesting to know. Because, although peak is far from TSS, but it can work on gene splicing events etc other function ......................... ### You may extent the feature of the tool by making venn diagram of overlapping peaks. Currently I use BedTools to find overlap between two bed files, but using another tools "Cistrome" to generate venn diagram as BedTools does not do that or give account of overlapping peak directly (although we can count easily.........) ### may be some graphical presentation can also increase strength of ChIPpeakAnno. ### I guess there would be updated tutorial with some example code when you update it or include some more features. However, I would like to to thank for writing such nice and helpful package........... it's so useful in my daily ChIPseq data analysis. Whish you all the best with publication and looking fwd to see published version. thank you, Khademul Visiting Scholar, University of illinois at Chicago, USA [ PhD Student, Barcelona Biomedical Research Park Barcelona, Spain ] ====================================================================== ====== ============================================================ On Tue, Mar 9, 2010 at 9:02 AM, Zhu, Julie <julie.zhu@umassmed.edu> wrote: Hi Amy, Thank you very much for the feedback! The inside feature is TRUE if the peakSummit (stored as Start in the RangedData) is inside the nearest annotated feature, FALSE otherwise. The feature you are proposing is useful for finding overlapping fragments between query peak ranges and target ranges. The timing of your suggestion cannot be better. We plan to add a function called FindOverlappingPeaks that will be added to the dev version, your ideas will be incorporated. Please let me know if you and others think that it is better to incorporate your ideas into annotatePeakInBatch. Thanks again for your help making this package more useful. I have a favor to ask you and those who have used or are aware of the ChIPpeakAnno package. We submitted a paper describing this package and just got the reviewers' comments back. One question comes up is that what this package offers that the existing annotation tools such as Cisgenome and CEAS do not. I thought it might be useful to get the feedback from the users of this package. If possible, could you please send me your thoughts on this, especially the reasons you chose using this package? Thanks a lot for your time and help! Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program Gene Function and Expression University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com <http:="" amy.m.molesworth@gsk.com=""> > wrote: Firstly I'd like to thank the authors of the very useful package ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno annotatePeakInBatch function results that other users may or may not be aware of. I also propose improvements to compensate. The resulting insideFeature column reports TRUE if the query peak is either contained within an annotated feature, and also reports TRUE if it overlaps the end of an annotated feature. I think its worth noting that it reports FALSE if the peak overlaps the beginning of an annotated feature, and also reports FALSE if the peak overlaps in entirety an annotated feature(s). So, perhaps the insideFeature column (or additional new column called overlappingFeature) could report five options: ("false","inside","overlapStart","overlapEnd","super"). I haven't looked into the effects on how distanceToFeature should/could be called for each different scenario. Apologies if this has already been addressed, or if others do not consider this useful. Details with dummy example are described below. Many thanks, Amy. ##### In the dummy example below, p1 is bigger than f1 and consequently p1 overlaps it in entirety. It would be nice if ChIPpeakAnno could report this - although I accept it may overlap more than one feature, so would need to consider how to deal with that. And another example from below, p3 in fact overlaps with the start of f3, but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno could report it as OverlapStart. p4 is called as insideFeature = TRUE for overlapping with f4, but it would be nice if ChIPpeakAnno could report it as OverlapEnd or something similar. And correctly p2 is called as insideFeature = TRUE for overlap with f2, in this case p2 ranges are within the f2 ranges as you would expect. library(ChIPpeakAnno) peaks = RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167 889600),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1", "p2","p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) features = RangedData(IRanges(start=c(1549800,1554400,1565000,1569400 ,167888600),end=c(1550599,1560799,1565399,1571199,167888999),names=c(" f1","f2","f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) as.data.frame(annoPeaks) space start end width names strand feature start_position 1 5 167889600 167893599 4000 p5 1 f5 167888600 2 6 1543200 1555199 12000 p1 1 f1 1549800 3 6 1557200 1560599 3400 p2 1 f2 1554400 4 6 1563000 1565199 2200 p3 1 f3 1565000 5 6 1569800 1573799 4000 p4 1 f4 1569400 end_position insideFeature distancetoFeature 1 167888999 FALSE 1000 2 1550599 FALSE -6600 3 1560799 TRUE 2800 4 1565399 FALSE -2000 5 1571199 TRUE 400 > sessionInfo() R version 2.10.0 (2009-10-26) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C [5] LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ChIPpeakAnno_1.3.0 org.Hs.eg.db_2.3.6 [3] GO.db_2.3.5 RSQLite_0.7-3 [5] DBI_0.2-4 AnnotationDbi_1.8.0 [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 [9] Biostrings_2.14.2 IRanges_1.5.18 [11] multtest_2.2.0 Biobase_2.6.0 [13] biomaRt_2.3.0 loaded via a namespace (and not attached): [1] MASS_7.3-3 RCurl_1.3-0 XML_2.6-0 splines_2.10.0 [5] survival_2.35-7 ----------------------------------------------------------- This e-mail was sent by GlaxoSmithKline Services Unlimited (registered in England and Wales No. 1047315), which is a member of the GlaxoSmithKline group of companies. The registered address of GlaxoSmithKline Services Unlimited is 980 Great West Road, Brentford, Middlesex TW8 9GS. ----------------------------------------------------------- [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org <http: bioc-sig-="" sequencing@r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Hi Ivan, Thanks so much for sharing your code snippets and great ideas! Topic 1 Currently, BED2RangedData, GFF2RangedData as well as annotatePeakInBatch auto generate the name for each row using chromosome name, the start and the end fields if the name field is not defined. To preserve the input order, your approach of using just numbers for names would work better. Topic 2 AnnotationData that ships with the package were built using getAnnotation function. Thank you all so much for all these great ideas! Best regards, Julie On 3/9/10 2:48 PM, "Ivan Gregoretti" <ivangreg@gmail.com> wrote: It seems that we are getting quite excited about ChIPpeakAnno. Good! Two additions to this discussion ########## # Topic 1 ########## Just a comment about AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] That statement will work ONLY if rownames have been defined in AannotatedRd. That means that AannotatedRd must inherit the Ard's rownames. In the simplest case, when you import a BED file into a RangedData instance, the row names will not be defined. My solution, currently in use: So as to preserve the rows in their original order then, what I do is name the rows with numbers as soon as I import my BED. Like this library(rtracklayer) Ard <- import('mypeaks.bed', 'bed') rownames(Ard) <- formatC(1:dim(Ard)[1], width=nchar(dim(Ard)[1]), flag='0') Now AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd) AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] will work like a charm. ########## # Topic 2 ########## I do not know how the AnnotationData (Srd above) that ships with the package is built but this is how I build it on the fly library(RMySQL) con <- dbConnect(MySQL(), user='genome', dbname='mm9', host='genome-mysql.cse.ucsc.edu') Srd <- as(dbGetQuery(con, "SELECT a.chrom AS space, a.txStart+1 AS start, a.txEnd AS end, a.strand AS strand FROM mm9.ensGene AS a GROUP BY a.name2"), "RangedData") rownames(Srd) <- as.list(dbGetQuery(con, "SELECT a.name2 AS name FROM mm9.ensGene AS a GROUP BY a.name2"))[[1]] Srd <- Srd[order(as.data.frame(Srd)$space, start(Srd)), ] dbDisconnect(con) Notice that I get ONE txStart per gene while in fact there may be multiple transcripts per single gene, with different start sites. So, Srd is suboptimal but I need it because ChIPpeakAnno needs gene names. In the ideal case, ChIPpeakAnno should be able to link transcripts to GO terms. That means that, instead of using Srd, we should be using Trd, which you can obtain on the fly like this: con <- dbConnect(MySQL(), user='genome', dbname='mm9', host='genome-mysql.cse.ucsc.edu') Trd <- as(dbGetQuery(con, "SELECT a.chrom AS space, a.txStart+1 AS start, a.txEnd AS end, a.strand AS strand FROM mm9.ensGene AS a GROUP BY a.name"), "RangedData") rownames(Trd) <- as.list(dbGetQuery(con, "SELECT a.name AS name FROM mm9.ensGene AS a GROUP BY a.name"))[[1]] Trd <- Trd[order(as.data.frame(Trd)$space, start(Trd)), ] dbDisconnect(con) I hope this helps improve the package. It's worth it. Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1592 Fax: 1-301-496-9878 On Tue, Mar 9, 2010 at 10:44 AM, Zhu, Julie <julie.zhu@umassmed.edu> wrote: > Hi Ivan, > > Thank you very much for the feedback! Sure, I will add AannotatedRd <- > AannotatedRd[order(rownames(AannotatedRd)),] inside the function. Thanks! > > Best regards, > > Julie > > > On 3/9/10 10:24 AM, "Ivan Gregoretti" <ivangreg@gmail.com> wrote: > > I haven't used Cisgenome or CEAS so I can't speak about them. > > I use ChIPpeakAnno because, while working in Bioconductor, I get the > enriched GO terms in three lines of code. (Four lines, if you count > library(ChIPpeakAnno)). > > By the way, since you are in the process of upgrading the package, > would you consider this fixable problem?: > > say Ard and Srd are RangedData instances. When I do > > AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd) > > the order of rows in AannotatedRd is different from the order of rows > in Ard. Currently, I fix this by hand doing > > AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),] > > Would you mind incorporating this fix of row shuffling in the next upgrade? > > Both Amy and me use the stable release of R and Bioc, not the devel > version. I would highly appreciate if you could push the upgrade for > both stable and devel but I won't hold you accountable if you don't. > > Have a great day! > > Ivan > > > Ivan Gregoretti, PhD > National Institute of Diabetes and Digestive and Kidney Diseases > National Institutes of Health > 5 Memorial Dr, Building 5, Room 205. > Bethesda, MD 20892. USA. > Phone: 1-301-496-1592 > Fax: 1-301-496-9878 > > > > On Tue, Mar 9, 2010 at 10:02 AM, Zhu, Julie <julie.zhu@umassmed.edu> wrote: >> Hi Amy, >> >> Thank you very much for the feedback! >> >> The inside feature is TRUE if the peakSummit (stored as Start in the >> RangedData) is inside the nearest annotated feature, FALSE otherwise. The >> feature you are proposing is useful for finding overlapping fragments >> between query peak ranges and target ranges. The timing of your suggestion >> cannot be better. We plan to add a function called FindOverlappingPeaks >> that will be added to the dev version, your ideas will be incorporated. >> Please let me know if you and others think that it is better to >> incorporate >> your ideas into annotatePeakInBatch. Thanks again for your help making >> this >> package more useful. >> >> I have a favor to ask you and those who have used or are aware of the >> ChIPpeakAnno package. We submitted a paper describing this package and >> just >> got the reviewers' comments back. One question comes up is that what this >> package offers that the existing annotation tools such as Cisgenome and >> CEAS >> do not. I thought it might be useful to get the feedback from the users of >> this package. If possible, could you please send me your thoughts on this, >> especially the reasons you chose using this package? Thanks a lot for your >> time and help! >> >> Best regards, >> >> Julie >> >> >> ******************************************* >> Lihua Julie Zhu, Ph.D >> Research Associate Professor >> Program Gene Function and Expression >> University of Massachusetts Medical School >> 364 Plantation Street, Room 613 >> Worcester, MA 01605 >> 508-856-5256 >> http://www.umassmed.edu/pgfe/faculty/zhu.cfm >> >> >> >> On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com> wrote: >> >> >> >> Firstly I'd like to thank the authors of the very useful package >> ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno >> annotatePeakInBatch function results that other users may or may not be >> aware of. I also propose improvements to compensate. >> >> The resulting insideFeature column reports TRUE if the query peak is >> either >> contained within an annotated feature, and also reports TRUE if it >> overlaps >> the end of an annotated feature. >> >> I think its worth noting that it reports FALSE if the peak overlaps the >> beginning of an annotated feature, and also reports FALSE if the peak >> overlaps in entirety an annotated feature(s). >> >> So, perhaps the insideFeature column (or additional new column called >> overlappingFeature) could report five options: >> ("false","inside","overlapStart","overlapEnd","super"). I haven't looked >> into the effects on how distanceToFeature should/could be called for each >> different scenario. >> >> Apologies if this has already been addressed, or if others do not consider >> this useful. >> >> Details with dummy example are described below. >> >> Many thanks, >> Amy. >> >> ##### >> >> In the dummy example below, p1 is bigger than f1 and consequently p1 >> overlaps it in entirety. It would be nice if ChIPpeakAnno could report >> this >> - although I accept it may overlap more than one feature, >> so would need to consider how to deal with that. >> >> And another example from below, p3 in fact overlaps with the start of f3, >> but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno >> could >> report it as OverlapStart. >> >> p4 is called as insideFeature = TRUE for overlapping with f4, but it would >> be nice if ChIPpeakAnno could report it as OverlapEnd or something >> similar. >> >> And correctly p2 is called as insideFeature = TRUE for overlap with f2, in >> this case p2 ranges are within the f2 ranges as you would expect. >> >> >> library(ChIPpeakAnno) >> peaks = >> >> RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,16788960 0),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1","p2", "p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) >> features = >> >> RangedData(IRanges(start=c(1549800,1554400,1565000,1569400,1678886 00),end=c(1550599,1560799,1565399,1571199,167888999),names=c("f1","f2" ,"f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) >> >> annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) >> >> as.data.frame(annoPeaks) >> >> space start end width names strand feature start_position >> 1 5 167889600 167893599 4000 p5 1 f5 167888600 >> 2 6 1543200 1555199 12000 p1 1 f1 1549800 >> 3 6 1557200 1560599 3400 p2 1 f2 1554400 >> 4 6 1563000 1565199 2200 p3 1 f3 1565000 >> 5 6 1569800 1573799 4000 p4 1 f4 1569400 >> end_position insideFeature distancetoFeature >> 1 167888999 FALSE 1000 >> 2 1550599 FALSE -6600 >> 3 1560799 TRUE 2800 >> 4 1565399 FALSE -2000 >> 5 1571199 TRUE 400 >> >> >>> sessionInfo() >> R version 2.10.0 (2009-10-26) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C >> [5] LC_MONETARY=C LC_MESSAGES=C >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ChIPpeakAnno_1.3.0 org.Hs.eg.db_2.3.6 >> [3] GO.db_2.3.5 RSQLite_0.7-3 >> [5] DBI_0.2-4 AnnotationDbi_1.8.0 >> [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 >> [9] Biostrings_2.14.2 IRanges_1.5.18 >> [11] multtest_2.2.0 Biobase_2.6.0 >> [13] biomaRt_2.3.0 >> >> loaded via a namespace (and not attached): >> [1] MASS_7.3-3 RCurl_1.3-0 XML_2.6-0 splines_2.10.0 >> [5] survival_2.35-7 >> >> >> ----------------------------------------------------------- >> This e-mail was sent by GlaxoSmithKline Services Unlimited >> (registered in England and Wales No. 1047315), which is a >> member of the GlaxoSmithKline group of companies. The >> registered address of GlaxoSmithKline Services Unlimited >> is 980 Great West Road, Brentford, Middlesex TW8 9GS. >> ----------------------------------------------------------- >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> Bioc-sig-sequencing@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> > > > [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Dear Amy, This is to let you know that the insideFeature column has been fixed ( Version: 1.2.6). If you run your example below, you should get the correct classification. Please let me know if you encounter any problems. Thank you so much for the helpful feedback! BTW, do you mind if I use your example below in the help manual? Thanks! Best regards, Julie ******************************************* Lihua Julie Zhu, Ph.D Research Associate Professor Program Gene Function and Expression University of Massachusetts Medical School 364 Plantation Street, Room 613 Worcester, MA 01605 508-856-5256 http://www.umassmed.edu/pgfe/faculty/zhu.cfm ******************************************* On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth@gsk.com> wrote: Firstly I'd like to thank the authors of the very useful package ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno annotatePeakInBatch function results that other users may or may not be aware of. I also propose improvements to compensate. The resulting insideFeature column reports TRUE if the query peak is either contained within an annotated feature, and also reports TRUE if it overlaps the end of an annotated feature. I think its worth noting that it reports FALSE if the peak overlaps the beginning of an annotated feature, and also reports FALSE if the peak overlaps in entirety an annotated feature(s). So, perhaps the insideFeature column (or additional new column called overlappingFeature) could report five options: ("false","inside","overlapStart","overlapEnd","super"). I haven't looked into the effects on how distanceToFeature should/could be called for each different scenario. Apologies if this has already been addressed, or if others do not consider this useful. Details with dummy example are described below. Many thanks, Amy. ##### In the dummy example below, p1 is bigger than f1 and consequently p1 overlaps it in entirety. It would be nice if ChIPpeakAnno could report this - although I accept it may overlap more than one feature, so would need to consider how to deal with that. And another example from below, p3 in fact overlaps with the start of f3, but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno could report it as OverlapStart. p4 is called as insideFeature = TRUE for overlapping with f4, but it would be nice if ChIPpeakAnno could report it as OverlapEnd or something similar. And correctly p2 is called as insideFeature = TRUE for overlap with f2, in this case p2 ranges are within the f2 ranges as you would expect. library(ChIPpeakAnno) peaks = RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167 889600),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1", "p2","p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5)) features = RangedData(IRanges(start=c(1549800,1554400,1565000,1569400 ,167888600),end=c(1550599,1560799,1565399,1571199,167888999),names=c(" f1","f2","f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5)) annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features) as.data.frame(annoPeaks) space start end width names strand feature start_position 1 5 167889600 167893599 4000 p5 1 f5 167888600 2 6 1543200 1555199 12000 p1 1 f1 1549800 3 6 1557200 1560599 3400 p2 1 f2 1554400 4 6 1563000 1565199 2200 p3 1 f3 1565000 5 6 1569800 1573799 4000 p4 1 f4 1569400 end_position insideFeature distancetoFeature 1 167888999 FALSE 1000 2 1550599 FALSE -6600 3 1560799 TRUE 2800 4 1565399 FALSE -2000 5 1571199 TRUE 400 > sessionInfo() R version 2.10.0 (2009-10-26) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C [5] LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ChIPpeakAnno_1.3.0 org.Hs.eg.db_2.3.6 [3] GO.db_2.3.5 RSQLite_0.7-3 [5] DBI_0.2-4 AnnotationDbi_1.8.0 [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0 [9] Biostrings_2.14.2 IRanges_1.5.18 [11] multtest_2.2.0 Biobase_2.6.0 [13] biomaRt_2.3.0 loaded via a namespace (and not attached): [1] MASS_7.3-3 RCurl_1.3-0 XML_2.6-0 splines_2.10.0 [5] survival_2.35-7 ----------------------------------------------------------- This e-mail was sent by GlaxoSmithKline Services Unlimited (registered in England and Wales No. 1047315), which is a member of the GlaxoSmithKline group of companies. The registered address of GlaxoSmithKline Services Unlimited is 980 Great West Road, Brentford, Middlesex TW8 9GS. ----------------------------------------------------------- [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Julie Zhu ★ 4.3k

Login before adding your answer.