Question

Query on ChipPeakAnno: AnnotatePeakinBatch input

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Parthav, Could you please send us the code snippets, a test bed file and the sessionInfo? Thanks! Best regards, Julie On 12/3/13 9:43 AM, "Jailwala, Parthav (NIH/NCI) [C]" <parthav.jailwala at="" nih.gov=""> wrote: > Hi Julie, > > I have a strand issue with using the AnnotatePeakinBatch function within the > ChipPeakAnno package and am reaching out to you to see if you can help to > figure out what is the issue. > > I am trying to find the distance to the TSS , for a set of lincRNA. To do > this, I am using my own BED file of the 'background' or Annotation. The BED > file looks like this: > > Y 597158 623056 Ddx3y - > Y 346986 365290 Eif2s3y + > Y 2118049 2129045 Gm10256 + > Y 2156899 2168120 Gm10352 + > Y 1976249 1976584 Gm16501 - > Y 2390390 2398856 Gm3376 + > > As you can see, there is now header row for the column names as well as, the > fifth column is the strand of the feature. > > Now, when I run the command, in the output file, the 'Strand' column is always > +ve (Always + eventhough the feature is on ?ve strand). > > Here is a sample from the output file: > > "","space","start","end","width","names","peak","strand","feature"," start_posi > tion","end_position","insid > eFeature","distancetoFeature","shortestDistance","fromOverlappingOrN earest" > "1","1",9708702,9782003,73302,"0001 > 23152","0001","+","23152",9708703,9738463,"includeFeature",-1,1,"Near > estStart" > "2","1",134088012,134153958,65947,"0002 > 22624","0002","+","22624",134088013,134153958,"overlapStart",-1,0 > ,"NearestStart" > "3","1",171899539,172040632,141094,"0003 > 22283","0003","+","22283",171902439,172040632,"overlapStart",-29 > 00,0,"NearestStart" > "4","1",195333431,195335997,2567,"0004 > 22164","0004","+","22164",195172540,195196491,"downstream",160891, > 136940,"NearestStart" > > > I will really appreciate if you can tell me what is wrong with my inputs. > > Thanks > Parthav Jailwala > > Parthav Jailwala [Contractor] > Bioinformatics Analyst, CCRIFX Bioinformatics Core > > Advanced Biomedical Computing Center (ABCC) > Information Systems Program > Leidos Biomedical Research, Inc. > (formerly SAIC-Frederick, Inc.) > Frederick National Laboratory for Cancer Research (FNLCR) > P. O. Box B, Frederick, MD 21702 > > Building 41-B620, NIH, Bethesda, MD > E-mail: parthav.jailwala at nih.gov<mailto:parthav.jailwala at="" nih.gov=""> > Bethesda: 301.451.3455 > Frederick: 301.846.5664 > Fax (Bethesda): 301.480.0391 > http://ccrifx.cancer.gov<http: ccrifx.cancer.gov=""/> > > [cid:3573556C-D796-400A-A322-DCBDDD35455A]

Annotation Cancer ChIPpeakAnno Annotation Cancer ChIPpeakAnno • 1.4k views

ADD COMMENT • link updated 10.4 years ago by Jailwala, Parthav NIH/NCI [C] ▴ 30 • written 10.4 years ago by Julie Zhu ★ 4.3k

score 0 · Answer 1 · 2013-12-03

Julie, Thanks for your response. Attached is my input file of 'peaks' (2070 lincRNA_mergedGTF.txt), the features annotation file that I am using (23188PCGgroupEnsemblGTFwithstrand.txt: it has strand information coded as +,-). Also attached is the output file that shows the strand information as all positive (2070lincRNAmergedGTF.annout) Here is the sessionInfo() > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel grid stats graphics grDevices utils datasets [8] methods base other attached packages: [1] ChIPpeakAnno_2.10.0 GenomicFeatures_1.14.2 [3] limma_3.18.3 org.Hs.eg.db_2.10.1 [5] GO.db_2.10.1 RSQLite_0.11.4 [7] DBI_0.2-7 AnnotationDbi_1.24.0 [9] BSgenome.Ecoli.NCBI.20080805_1.3.17 BSgenome_1.30.0 [11] GenomicRanges_1.14.3 Biostrings_2.30.1 [13] XVector_0.2.0 IRanges_1.20.6 [15] multtest_2.18.0 Biobase_2.22.0 [17] biomaRt_2.18.0 BiocGenerics_0.8.0 [19] VennDiagram_1.6.5 loaded via a namespace (and not attached): [1] MASS_7.3-29 RCurl_1.95-4.1 Rsamtools_1.14.2 XML_3.98-1.1 [5] bitops_1.0-6 rtracklayer_1.22.0 splines_3.0.2 stats4_3.0.2 [9] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0 > On 12/3/13 9:52 AM, "Zhu, Lihua (Julie)" <julie.zhu at="" umassmed.edu<mailto:julie.zhu="" at="" umassmed.edu="">> wrote: Parthav, Could you please send us the code snippets, a test bed file and the sessionInfo? Thanks! Best regards, Julie On 12/3/13 9:43 AM, "Jailwala, Parthav (NIH/NCI) [C]" <parthav.jailwala at="" nih.gov<mailto:parthav.jailwala="" at="" nih.gov="">> wrote: Hi Julie, I have a strand issue with using the AnnotatePeakinBatch function within the ChipPeakAnno package and am reaching out to you to see if you can help to figure out what is the issue. I am trying to find the distance to the TSS , for a set of lincRNA. To do this, I am using my own BED file of the 'background' or Annotation. The BED file looks like this: Y 597158 623056 Ddx3y - Y 346986 365290 Eif2s3y + Y 2118049 2129045 Gm10256 + Y 2156899 2168120 Gm10352 + Y 1976249 1976584 Gm16501 - Y 2390390 2398856 Gm3376 + As you can see, there is now header row for the column names as well as, the fifth column is the strand of the feature. Now, when I run the command, in the output file, the 'Strand' column is always +ve (Always + eventhough the feature is on ?ve strand). Here is a sample from the output file: "","space","start","end","width","names","peak","strand","feature","st art_posi tion","end_position","insid eFeature","distancetoFeature","shortestDistance","fromOverlappingOrNea rest" "1","1",9708702,9782003,73302,"0001 23152","0001","+","23152",9708703,9738463,"includeFeature",-1,1,"Near estStart" "2","1",134088012,134153958,65947,"0002 22624","0002","+","22624",134088013,134153958,"overlapStart",-1,0 ,"NearestStart" "3","1",171899539,172040632,141094,"0003 22283","0003","+","22283",171902439,172040632,"overlapStart",-29 00,0,"NearestStart" "4","1",195333431,195335997,2567,"0004 22164","0004","+","22164",195172540,195196491,"downstream",160891, 136940,"NearestStart" I will really appreciate if you can tell me what is wrong with my inputs. Thanks Parthav Jailwala Parthav Jailwala [Contractor] Bioinformatics Analyst, CCRIFX Bioinformatics Core Advanced Biomedical Computing Center (ABCC) Information Systems Program Leidos Biomedical Research, Inc. (formerly SAIC-Frederick, Inc.) Frederick National Laboratory for Cancer Research (FNLCR) P. O. Box B, Frederick, MD 21702 Building 41-B620, NIH, Bethesda, MD E-mail: parthav.jailwala at nih.gov<mailto:parthav.jailwala at="" nih.gov=""><mailto:parthav.jailwala at="" nih.gov=""> Bethesda: 301.451.3455 Frederick: 301.846.5664 Fax (Bethesda): 301.480.0391 http://ccrifx.cancer.gov<http: ccrifx.cancer.gov=""/> [cid:3573556C-D796-400A-A322-DCBDDD35455A] -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 23188_PCGgroup_EnsemblGTFwithstrand.txt URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20131203="" 1d2601d4="" attachment-0002.txt=""> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 2070_lincRNA_mergedGTF.txt URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20131203="" 1d2601d4="" attachment-0003.txt="">

score 0 · Answer 2 · 2013-12-03

Parthav, Your annotation file is not in bed format, i.e., strand information needs to be on the 6th column ( http://genome.ucsc.edu/FAQ/FAQformat#format1). You can fix it by adding score as 5th column. Please let me know if you still have problem after fixing the annotation file. Thanks! Best regards, Julie On 12/3/13 10:10 AM, "Jailwala, Parthav (NIH/NCI) [C]" <parthav.jailwala at="" nih.gov=""> wrote: > Julie, > > Thanks for your response. Attached is my input file of 'peaks' (2070 > lincRNA_mergedGTF.txt), the features annotation file that I am using > (23188PCGgroupEnsemblGTFwithstrand.txt: it has strand information coded as > +,-). > > Also attached is the output file that shows the strand information as all > positive (2070lincRNAmergedGTF.annout) > > Here is the sessionInfo() > > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel grid stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] ChIPpeakAnno_2.10.0 GenomicFeatures_1.14.2 > [3] limma_3.18.3 org.Hs.eg.db_2.10.1 > [5] GO.db_2.10.1 RSQLite_0.11.4 > [7] DBI_0.2-7 AnnotationDbi_1.24.0 > [9] BSgenome.Ecoli.NCBI.20080805_1.3.17 BSgenome_1.30.0 > [11] GenomicRanges_1.14.3 Biostrings_2.30.1 > [13] XVector_0.2.0 IRanges_1.20.6 > [15] multtest_2.18.0 Biobase_2.22.0 > [17] biomaRt_2.18.0 BiocGenerics_0.8.0 > [19] VennDiagram_1.6.5 > > loaded via a namespace (and not attached): > [1] MASS_7.3-29 RCurl_1.95-4.1 Rsamtools_1.14.2 XML_3.98-1.1 > [5] bitops_1.0-6 rtracklayer_1.22.0 splines_3.0.2 stats4_3.0.2 > [9] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0 >> > > > On 12/3/13 9:52 AM, "Zhu, Lihua (Julie)" > <julie.zhu at="" umassmed.edu<mailto:julie.zhu="" at="" umassmed.edu="">> wrote: > > Parthav, > > Could you please send us the code snippets, a test bed file and the > sessionInfo? Thanks! > > Best regards, > > Julie > > > On 12/3/13 9:43 AM, "Jailwala, Parthav (NIH/NCI) [C]" > <parthav.jailwala at="" nih.gov<mailto:parthav.jailwala="" at="" nih.gov="">> wrote: > > Hi Julie, > I have a strand issue with using the AnnotatePeakinBatch function within the > ChipPeakAnno package and am reaching out to you to see if you can help to > figure out what is the issue. > I am trying to find the distance to the TSS , for a set of lincRNA. To do > this, I am using my own BED file of the 'background' or Annotation. The BED > file looks like this: > Y 597158 623056 Ddx3y - > Y 346986 365290 Eif2s3y + > Y 2118049 2129045 Gm10256 + > Y 2156899 2168120 Gm10352 + > Y 1976249 1976584 Gm16501 - > Y 2390390 2398856 Gm3376 + > As you can see, there is now header row for the column names as well as, the > fifth column is the strand of the feature. > Now, when I run the command, in the output file, the 'Strand' column is always > +ve (Always + eventhough the feature is on ?ve strand). > Here is a sample from the output file: > "","space","start","end","width","names","peak","strand","feature"," start_posi > tion","end_position","insid > eFeature","distancetoFeature","shortestDistance","fromOverlappingOrN earest" > "1","1",9708702,9782003,73302,"0001 > 23152","0001","+","23152",9708703,9738463,"includeFeature",-1,1,"Near > estStart" > "2","1",134088012,134153958,65947,"0002 > 22624","0002","+","22624",134088013,134153958,"overlapStart",-1,0 > ,"NearestStart" > "3","1",171899539,172040632,141094,"0003 > 22283","0003","+","22283",171902439,172040632,"overlapStart",-29 > 00,0,"NearestStart" > "4","1",195333431,195335997,2567,"0004 > 22164","0004","+","22164",195172540,195196491,"downstream",160891, > 136940,"NearestStart" > I will really appreciate if you can tell me what is wrong with my inputs. > Thanks > Parthav Jailwala > Parthav Jailwala [Contractor] > Bioinformatics Analyst, CCRIFX Bioinformatics Core > Advanced Biomedical Computing Center (ABCC) > Information Systems Program > Leidos Biomedical Research, Inc. > (formerly SAIC-Frederick, Inc.) > Frederick National Laboratory for Cancer Research (FNLCR) > P. O. Box B, Frederick, MD 21702 > Building 41-B620, NIH, Bethesda, MD > E-mail: > parthav.jailwala at nih.gov<mailto:parthav.jailwala at="" nih.gov=""><mailto:parthav.jailw> ala at nih.gov> > Bethesda: 301.451.3455 > Frederick: 301.846.5664 > Fax (Bethesda): 301.480.0391 > http://ccrifx.cancer.gov<http: ccrifx.cancer.gov=""/> > [cid:3573556C-D796-400A-A322-DCBDDD35455A] > >

score 0 · Answer 3 · 2013-12-03

Parthav, Great to know that you got the correct strand information now. To understand the meaning of each output variable, please type help(annotatePeakInBatch) in R. Under the value section, you will see the description for each output variable. For example, distancetoFeature is described as "distance to the nearest feature such as transcription start site. By default, the distance is calculated as the distance between the start of the binding site and the TSS that is the gene start for genes located on the forward strand and the gene end for genes located on the reverse strand." Please see additional inline comments below. Best regards, Julie On 12/3/13 11:37 AM, "Jailwala, Parthav (NIH/NCI) [C]" <parthav.jailwala at="" nih.gov=""> wrote: > Hi Julie, > > Thanks ! > I fixed the strand information in the annotation file and now I do get > correct strand information in the output. > > However, when looking at the output, I am still confused about the > 'upstream/downstream' determination for features that are on -ve strand. > My understanding is that for genes on the reverse strand, the Start = 3' > end of the gene and the End= 5' end of the gene. Hence, when I chose 'TSS' > as the option, all distances should have been calculated from the TSS, > that is the 'End' coordinate for that gene. Correct. > Also, for features on the > negative strand, if the start of the peak is higher than the TSS of the > feature, then actually, the peak is 'Upstream' of the feature. However, in > the output, for features on -ve strand,when the start of the peak is > higher than the TSS of the feature, the peak is determined to be > 'Downstream' of the feature. Could you please send me an example output row? Also which version of ChIPpeakAnno did you use ? Please type sessionInfo() in R and copy the output. > > I will really appreciate if you can advise if my understanding is > incorrect. > > Thanks > Parthav > > > > > On 12/3/13 11:09 AM, "Zhu, Lihua (Julie)" <julie.zhu at="" umassmed.edu=""> wrote: > >> Parthav, >> >> Your annotation file is not in bed format, i.e., strand information needs >> to >> be on the 6th column ( http://genome.ucsc.edu/FAQ/FAQformat#format1). You >> can fix it by adding score as 5th column. >> >> Please let me know if you still have problem after fixing the annotation >> file. Thanks! >> >> Best regards, >> >> Julie >> >> >> On 12/3/13 10:10 AM, "Jailwala, Parthav (NIH/NCI) [C]" >> <parthav.jailwala at="" nih.gov=""> wrote: >> >>> Julie, >>> >>> Thanks for your response. Attached is my input file of 'peaks' (2070 >>> lincRNA_mergedGTF.txt), the features annotation file that I am using >>> (23188PCGgroupEnsemblGTFwithstrand.txt: it has strand information coded >>> as >>> +,-). >>> >>> Also attached is the output file that shows the strand information as >>> all >>> positive (2070lincRNAmergedGTF.annout) >>> >>> Here is the sessionInfo() >>> >>> >>>> sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel grid stats graphics grDevices utils datasets >>> [8] methods base >>> >>> other attached packages: >>> [1] ChIPpeakAnno_2.10.0 GenomicFeatures_1.14.2 >>> [3] limma_3.18.3 org.Hs.eg.db_2.10.1 >>> [5] GO.db_2.10.1 RSQLite_0.11.4 >>> [7] DBI_0.2-7 AnnotationDbi_1.24.0 >>> [9] BSgenome.Ecoli.NCBI.20080805_1.3.17 BSgenome_1.30.0 >>> [11] GenomicRanges_1.14.3 Biostrings_2.30.1 >>> [13] XVector_0.2.0 IRanges_1.20.6 >>> [15] multtest_2.18.0 Biobase_2.22.0 >>> [17] biomaRt_2.18.0 BiocGenerics_0.8.0 >>> [19] VennDiagram_1.6.5 >>> >>> loaded via a namespace (and not attached): >>> [1] MASS_7.3-29 RCurl_1.95-4.1 Rsamtools_1.14.2 XML_3.98-1.1 >>> [5] bitops_1.0-6 rtracklayer_1.22.0 splines_3.0.2 stats4_3.0.2 >>> [9] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0 >>>> >>> >>> >>> On 12/3/13 9:52 AM, "Zhu, Lihua (Julie)" >>> <julie.zhu at="" umassmed.edu<mailto:julie.zhu="" at="" umassmed.edu="">> wrote: >>> >>> Parthav, >>> >>> Could you please send us the code snippets, a test bed file and the >>> sessionInfo? Thanks! >>> >>> Best regards, >>> >>> Julie >>> >>> >>> On 12/3/13 9:43 AM, "Jailwala, Parthav (NIH/NCI) [C]" >>> <parthav.jailwala at="" nih.gov<mailto:parthav.jailwala="" at="" nih.gov="">> wrote: >>> >>> Hi Julie, >>> I have a strand issue with using the AnnotatePeakinBatch function >>> within the >>> ChipPeakAnno package and am reaching out to you to see if you can help >>> to >>> figure out what is the issue. >>> I am trying to find the distance to the TSS , for a set of lincRNA. To >>> do >>> this, I am using my own BED file of the 'background' or Annotation. The >>> BED >>> file looks like this: >>> Y 597158 623056 Ddx3y - >>> Y 346986 365290 Eif2s3y + >>> Y 2118049 2129045 Gm10256 + >>> Y 2156899 2168120 Gm10352 + >>> Y 1976249 1976584 Gm16501 - >>> Y 2390390 2398856 Gm3376 + >>> As you can see, there is now header row for the column names as well >>> as, the >>> fifth column is the strand of the feature. >>> Now, when I run the command, in the output file, the 'Strand' column is >>> always >>> +ve (Always + eventhough the feature is on ?ve strand). >>> Here is a sample from the output file: >>> >>> "","space","start","end","width","names","peak","strand","feature" ,"start >>> _posi >>> tion","end_position","insid >>> >>> eFeature","distancetoFeature","shortestDistance","fromOverlappingO rNeares >>> t" >>> "1","1",9708702,9782003,73302,"0001 >>> 23152","0001","+","23152",9708703,9738463,"includeFeature",-1,1,"Near >>> estStart" >>> "2","1",134088012,134153958,65947,"0002 >>> 22624","0002","+","22624",134088013,134153958,"overlapStart",-1,0 >>> ,"NearestStart" >>> "3","1",171899539,172040632,141094,"0003 >>> 22283","0003","+","22283",171902439,172040632,"overlapStart",-29 >>> 00,0,"NearestStart" >>> "4","1",195333431,195335997,2567,"0004 >>> 22164","0004","+","22164",195172540,195196491,"downstream",160891, >>> 136940,"NearestStart" >>> I will really appreciate if you can tell me what is wrong with my >>> inputs. >>> Thanks >>> Parthav Jailwala >>> Parthav Jailwala [Contractor] >>> Bioinformatics Analyst, CCRIFX Bioinformatics Core >>> Advanced Biomedical Computing Center (ABCC) >>> Information Systems Program >>> Leidos Biomedical Research, Inc. >>> (formerly SAIC-Frederick, Inc.) >>> Frederick National Laboratory for Cancer Research (FNLCR) >>> P. O. Box B, Frederick, MD 21702 >>> Building 41-B620, NIH, Bethesda, MD >>> E-mail: >>> >>> parthav.jailwala at nih.gov<mailto:parthav.jailwala at="" nih.gov=""><mailto:parthav.>>> jailw >>> ala at nih.gov> >>> Bethesda: 301.451.3455 >>> Frederick: 301.846.5664 >>> Fax (Bethesda): 301.480.0391 >>> http://ccrifx.cancer.gov<http: ccrifx.cancer.gov=""/> >>> [cid:3573556C-D796-400A-A322-DCBDDD35455A] >>> >>> >> >