Difficulties with tileHMM's gff output

0

Entering edit mode

Rayna ▴ 40

@rayna-4236

Last seen 11.3 years ago

Dear List, I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It comes from E. coli, so it is not epigenomics stuff :) Here is the code which gives me the gff file where only the enriched probes are kept and associated to regions: R> gff <- reg2gff(regions.clean, post.enriched, data.frame(chromosome=layout$probe.id, position=layout$pos)) R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end, score=gff$score) R> l <- list(regions=regions, probe.state=probe.state) R> l What I obtain is: ##gff-version 3 ##Wed Sep 22 17:33:09 2010 NC_000913 nimble region 1 1 1 + Name=region_region1 NC_000913 nimble region 1000016 1000016 1 + Name=region_region2 NC_000913 nimble region 100017 100017 1 + Name=region_region3 NC_000913 nimble region 1000184 1000184 1 + Name=region_region4 NC_000913 nimble region 100041 100041 0.99 + Name=region_region5 NC_000913 nimble region 1000424 1000424 1 + Name=region_region6 [...] NC_000913 nimble region 101337 1013223 0,96 + Name=region_region89 NC_000913 nimble region 101361 1013391 1 + Name=region_region90 Which is weird for me, for several reasons. First here is that I have a region start = 1 and a region end = 1 (for region 1, for example). I checked the layout (a merge of the .ndf and of the .pos files). This is the number of the probe as it is described in the layout: R> head(layout) probe.id sequence x y 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 820 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 920 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 919 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 918 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 716 chr pos length 1 NC_000913 1 50 2 NC_000913 1 50 3 NC_000913 1000016 50 4 NC_000913 1000016 50 5 NC_000913 1000040 50 6 NC_000913 1000040 50 Therefore, the problem comes apparently from here where the value in the column "pos" is the probe's number. I don't know how to think about a region with an excellent score (in the case of the region 1, it is the best score one may obtain) which begins at position 1 and ends at position 1, with a probe size of 50 bp. By the way, I have nowhere a correspondence between the probe ID and the gene it matches. So, somehow, blasting all of the probes against the genome seems a bit tedious and not really optimal... Second, when I was looking further in the gff, there are things such as the example of the region 90 I pasted above. Here, the region is very big. I checked the probes and so, the start position 101361 corresponds to the probe ID 101361 which lays in the interval 101361-101410 as listed in the ndf file. Moreover, the end position 1013391 corresponds to a probe ID 1013391 which covers the interval 1013391-1013440 according to the ndf file. I'm really confused what to think about this stuff and would be very grateful in case someone could explain me how I'm supposed to read this gff. Thanks a lot in advance :) Best, Rayna -- "Change l'ordre du monde plutÃ´t que tes dÃ©sirs." Mon blog perso/My personal blog : http://hatewasabi.wordpress.com/ Relectrice LinuxFr.org (http://linuxfr.org/~Malicia/<http: linuxfr.org="" %7emalicia=""/> ) PhD Student "Molecular Evolution and Bioinformatics" Ludwig-Maximilians University (LMU) of Munich [[alternative HTML version deleted]]

probe probe • 1.7k views

ADD COMMENT • link updated 15.3 years ago by Peter Humburg ▴ 30 • written 15.3 years ago by Rayna ▴ 40

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 months ago

United States

On 09/23/2010 07:47 AM, Rayna wrote: > Dear List, > > I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It > comes from E. coli, so it is not epigenomics stuff :) Hi Rayna -- tileHMM is not a Bioconductor package; someone here might respond with something useful, but probably you want to contact the package maintainer (cc'd on this message) > packageDescription('tileHMM')$Maintainer [1] "Peter Humburg ... Martin > > Here is the code which gives me the gff file where only the enriched probes > are kept and associated to regions: > > R> gff <- reg2gff(regions.clean, post.enriched, > data.frame(chromosome=layout$probe.id, > position=layout$pos)) > R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end, > score=gff$score) > R> l <- list(regions=regions, probe.state=probe.state) > R> l > > What I obtain is: > > ##gff-version 3 > ##Wed Sep 22 17:33:09 2010 > NC_000913 nimble region 1 1 1 + > Name=region_region1 > NC_000913 nimble region 1000016 1000016 1 + > Name=region_region2 > NC_000913 nimble region 100017 100017 1 + > Name=region_region3 > NC_000913 nimble region 1000184 1000184 1 + > Name=region_region4 > NC_000913 nimble region 100041 100041 0.99 + > Name=region_region5 > NC_000913 nimble region 1000424 1000424 1 + > Name=region_region6 > [...] > NC_000913 nimble region 101337 1013223 0,96 + > Name=region_region89 > NC_000913 nimble region 101361 1013391 1 + > Name=region_region90 > > > Which is weird for me, for several reasons. > First here is that I have a region start = 1 and a region end = 1 (for > region 1, for example). I checked the layout (a merge of the .ndf and of the > .pos files). This is the number of the probe as it is described in the > layout: > > R> head(layout) > probe.id sequence x y > 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023 > 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 820 > 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 920 > 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 919 > 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 918 > 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 716 > chr pos length > 1 NC_000913 1 50 > 2 NC_000913 1 50 > 3 NC_000913 1000016 50 > 4 NC_000913 1000016 50 > 5 NC_000913 1000040 50 > 6 NC_000913 1000040 50 > > > Therefore, the problem comes apparently from here where the value in the > column "pos" is the probe's number. I don't know how to think about a region > with an excellent score (in the case of the region 1, it is the best score > one may obtain) which begins at position 1 and ends at position 1, with a > probe size of 50 bp. By the way, I have nowhere a correspondence between the > probe ID and the gene it matches. So, somehow, blasting all of the probes > against the genome seems a bit tedious and not really optimal... > > Second, when I was looking further in the gff, there are things such as the > example of the region 90 I pasted above. Here, the region is very big. I > checked the probes and so, the start position 101361 corresponds to the > probe ID 101361 which lays in the interval 101361-101410 as listed in the > ndf file. Moreover, the end position 1013391 corresponds to a probe ID > 1013391 which covers the interval 1013391-1013440 according to the ndf file. > > I'm really confused what to think about this stuff and would be very > grateful in case someone could explain me how I'm supposed to read this gff. > > Thanks a lot in advance :) > > Best, > Rayna > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.3 years ago Martin Morgan 25k

0

Entering edit mode

Peter Humburg ▴ 30

@peter-humburg-4272

Last seen 10.4 years ago

United Kingdom

Dear Rayna, The tileHMM package is not part of Bioconductor so this is somewhat off-topic here. See comments inline below. On Thu, 2010-09-23 at 16:47 +0200, Rayna wrote: > Dear List, > > I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It > comes from E. coli, so it is not epigenomics stuff :) > > Here is the code which gives me the gff file where only the enriched probes > are kept and associated to regions: > > R> gff <- reg2gff(regions.clean, post.enriched, > data.frame(chromosome=layout$probe.id, > position=layout$pos)) > R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end, > score=gff$score) > R> l <- list(regions=regions, probe.state=probe.state) > R> l > > What I obtain is: > > ##gff-version 3 > ##Wed Sep 22 17:33:09 2010 > NC_000913 nimble region 1 1 1 + > Name=region_region1 > NC_000913 nimble region 1000016 1000016 1 + > Name=region_region2 > NC_000913 nimble region 100017 100017 1 + > Name=region_region3 > NC_000913 nimble region 1000184 1000184 1 + > Name=region_region4 > NC_000913 nimble region 100041 100041 0.99 + > Name=region_region5 > NC_000913 nimble region 1000424 1000424 1 + > Name=region_region6 > [...] > NC_000913 nimble region 101337 1013223 0,96 + > Name=region_region89 > NC_000913 nimble region 101361 1013391 1 + > Name=region_region90 > > > Which is weird for me, for several reasons. > First here is that I have a region start = 1 and a region end = 1 (for > region 1, for example). I checked the layout (a merge of the .ndf and of the > .pos files). This is the number of the probe as it is described in the > layout: > > R> head(layout) > probe.id sequence x y > 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023 > 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 820 > 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 920 > 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 919 > 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 918 > 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 716 > chr pos length > 1 NC_000913 1 50 > 2 NC_000913 1 50 > 3 NC_000913 1000016 50 > 4 NC_000913 1000016 50 > 5 NC_000913 1000040 50 > 6 NC_000913 1000040 50 > Therefore, the problem comes apparently from here where the value in the > column "pos" is the probe's number. I don't know how to think about a region > with an excellent score (in the case of the region 1, it is the best score > one may obtain) which begins at position 1 and ends at position 1, with a > probe size of 50 bp. So there seems to be a problem with the data.frame containing the probe positions. The pos column should contain the start position of the probe in the genome. It looks like the number in the probe name may actually the position, so that should be fine. But note that there are duplicates that you will need to deal with. Looking at the probe sequences these look like forward and reverse strand probes. TileHMM expects a single probe per start position. If you are not looking for strand specific information you probably should summarize the two probe intensities in an appropriate way. > By the way, I have nowhere a correspondence between the > probe ID and the gene it matches. So, somehow, blasting all of the probes > against the genome seems a bit tedious and not really optimal... If you have the probe location in the genome there is no need for blasting. You can just get annotations for that position (eg with biomaRt). > > Second, when I was looking further in the gff, there are things such as the > example of the region 90 I pasted above. Here, the region is very big. I > checked the probes and so, the start position 101361 corresponds to the > probe ID 101361 which lays in the interval 101361-101410 as listed in the > ndf file. Moreover, the end position 1013391 corresponds to a probe ID > 1013391 which covers the interval 1013391-1013440 according to the ndf file. > > I'm really confused what to think about this stuff and would be very > grateful in case someone could explain me how I'm supposed to read this gff. I take it that is much longer than you are expecting for the protein you are studying. Could you confirm that the probes are sorted properly? You'll also want to check how many probes there are between the beginning and the end of that region. If there are unusually large gaps in a region of the genome you want to split the probe sequence to exclude these gaps. Best wishes, Peter > > Thanks a lot in advance :) > > Best, > Rayna > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.3 years ago Peter Humburg ▴ 30

0

Entering edit mode

Dear List, Martin and Peter, I'm new to R and Bioconductor, so I was unaware of the fact tileHMM is not a Bioconductor package... Thanks for pointing it out and sorry :/ Given that Peter answered here, I'll keep on discussing it on the list, this may interest other people :) 2010/9/23 Peter Humburg <peter.humburg@gmail.com> > [...] > So there seems to be a problem with the data.frame containing the probe > positions. The pos column should contain the start position of the probe > in the genome. It does, at least in the .pos file. Here is how it looks like: SEQ_ID CHROMOSOME PROBE_ID POSITION LENGTH COUNT NC_000913 NC_000913 ECOLIP1 1 50 1 NC_000913 NC_000913 ECOLIP25 25 50 1 NC_000913 NC_000913 ECOLIP49 49 50 1 NC_000913 NC_000913 ECOLIP73 73 50 1 NC_000913 NC_000913 ECOLIP97 97 50 1 It looks like the number in the probe name may actually > the position, so that should be fine. But note that there are duplicates > that you will need to deal with. Looking at the probe sequences these > look like forward and reverse strand probes. One single probe is described as FORWARD and REVERSE in the ndf file, but the pos file counts it once. At the end, the layout object contains both of the probes. TileHMM expects a single > probe per start position. If you are not looking for strand specific > information you probably should summarize the two probe intensities in > an appropriate way. > When you say "summarize the probe intensities in an appropriate way", you mean only keep the probe IDs where the probes showing a meaningful signal after normalization or something like this? The thing is that I'm searching for binding events wherever they happen, on whatever strain they happen... > > > > By the way, I have nowhere a correspondence between the > > probe ID and the gene it matches. So, somehow, blasting all of the probes > > against the genome seems a bit tedious and not really optimal... > > If you have the probe location in the genome there is no need for > blasting. You can just get annotations for that position (eg with > biomaRt). > Yes, I already did this before. The remark was because of other arrays I was dealing with before where all this information was in the description file. But it is out of scope here, sorry for mentionning it. > > > > > Second, when I was looking further in the gff, there are things such as > the > > example of the region 90 I pasted above. Here, the region is very big. I > > checked the probes and so, the start position 101361 corresponds to the > > probe ID 101361 which lays in the interval 101361-101410 as listed in the > > ndf file. Moreover, the end position 1013391 corresponds to a probe ID > > 1013391 which covers the interval 1013391-1013440 according to the ndf > file. > > > > I'm really confused what to think about this stuff and would be very > > grateful in case someone could explain me how I'm supposed to read this > gff. > > I take it that is much longer than you are expecting for the protein you > are studying. Could you confirm that the probes are sorted properly? > I think so: R> head(layout) probe.id sequence x y 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 820 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 920 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 919 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 918 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 716 chr pos length 1 NC_000913 1 50 2 NC_000913 1 50 3 NC_000913 1000016 50 4 NC_000913 1000016 50 5 NC_000913 1000040 50 6 NC_000913 1000040 50 Similar sorting fashion in the object containing the signals. > You'll also want to check how many probes there are between the > beginning and the end of that region. If there are unusually large gaps > in a region of the genome you want to split the probe sequence to > exclude these gaps. > I used a fragment size of 500 and a max.gap of 300 (the sonication produces fragments of at about 500 bp). I'll check the number of probes of a given region showing this extension as I exampled previously. Thanks for your help :) > > Best wishes, > Peter > Best, Rayna -- "Change l'ordre du monde plutÃ´t que tes dÃ©sirs." Mon blog perso/My personal blog : http://hatewasabi.wordpress.com/ Relectrice LinuxFr.org (http://linuxfr.org/~Malicia/<http: linuxfr.org="" %7emalicia=""/> ) PhD Student "Molecular Evolution and Bioinformatics" Ludwig-Maximilians University (LMU) of Munich [[alternative HTML version deleted]]

ADD REPLY • link 15.3 years ago Rayna ▴ 40

0

Entering edit mode

Hi Rayna, On Thu, 2010-09-23 at 18:52 +0200, Rayna wrote: > Dear List, Martin and Peter, > > I'm new to R and Bioconductor, so I was unaware of the fact tileHMM is > not a Bioconductor package... Thanks for pointing it out and sorry :/ > > Given that Peter answered here, I'll keep on discussing it on the > list, this may interest other people :) > > 2010/9/23 Peter Humburg <peter.humburg at="" gmail.com=""> > [...] > So there seems to be a problem with the data.frame containing > the probe > positions. The pos column should contain the start position of > the probe > in the genome. > > It does, at least in the .pos file. Here is how it looks like: > > SEQ_ID CHROMOSOME PROBE_ID POSITION LENGTH COUNT > NC_000913 NC_000913 ECOLIP1 1 50 1 > NC_000913 NC_000913 ECOLIP25 25 50 1 > NC_000913 NC_000913 ECOLIP49 49 50 1 > NC_000913 NC_000913 ECOLIP73 73 50 1 > NC_000913 NC_000913 ECOLIP97 97 50 1 > Yes, that looks fine. > > It looks like the number in the probe name may actually > the position, so that should be fine. But note that there are > duplicates > that you will need to deal with. Looking at the probe > sequences these > look like forward and reverse strand probes. > > One single probe is described as FORWARD and REVERSE in the ndf file, > but the pos file counts it once. At the end, the layout object > contains both of the probes. > > > > TileHMM expects a single > probe per start position. If you are not looking for strand > specific > information you probably should summarize the two probe > intensities in > an appropriate way. > > When you say "summarize the probe intensities in an appropriate way", > you mean only keep the probe IDs where the probes showing a meaningful > signal after normalization or something like this? > If you are familiar with gene expression arrays you could think of these two probes as a probe set and then compute a probe set summary statistic. Of course two probes are a pretty small sample so you may want to be a bit careful about how you do this. Of the top of my head I am not sure what the best approach is but others here may have some insights. > The thing is that I'm searching for binding events wherever they > happen, on whatever strain they happen... > > > > > By the way, I have nowhere a correspondence between the > > probe ID and the gene it matches. So, somehow, blasting all > of the probes > > against the genome seems a bit tedious and not really > optimal... > > > If you have the probe location in the genome there is no need > for > blasting. You can just get annotations for that position (eg > with > biomaRt). > > Yes, I already did this before. The remark was because of other arrays > I was dealing with before where all this information was in the > description file. But it is out of scope here, sorry for mentionning > it. > > > > > > Second, when I was looking further in the gff, there are > things such as the > > example of the region 90 I pasted above. Here, the region is > very big. I > > checked the probes and so, the start position 101361 > corresponds to the > > probe ID 101361 which lays in the interval 101361-101410 as > listed in the > > ndf file. Moreover, the end position 1013391 corresponds to > a probe ID > > 1013391 which covers the interval 1013391-1013440 according > to the ndf file. > > > > I'm really confused what to think about this stuff and would > be very > > grateful in case someone could explain me how I'm supposed > to read this gff. > > > I take it that is much longer than you are expecting for the > protein you > are studying. Could you confirm that the probes are sorted > properly? > > I think so: > R> > head(layout) > probe.id sequence x > y > 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 > 1023 > 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 > 820 > 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 > 920 > 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 > 919 > 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 > 918 > 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 > 716 > chr pos length > 1 NC_000913 1 50 > 2 NC_000913 1 50 > 3 NC_000913 1000016 50 > 4 NC_000913 1000016 50 > 5 NC_000913 1000040 50 > 6 NC_000913 1000040 50 > > Similar sorting fashion in the object containing the signals. > This does look wrong to me. There are no probes listed between 1 and 1000016 but in the table above there clearly are probes at 25, 49, 73 and 97. Please try the following (assuming your data.frame is called 'pos'): idx <- order(as.numeric(pos$pos) ) pos <- pos[idx,] head(pos) > > > You'll also want to check how many probes there are between > the > beginning and the end of that region. If there are unusually > large gaps > in a region of the genome you want to split the probe sequence > to > exclude these gaps. > > I used a fragment size of 500 and a max.gap of 300 (the sonication > produces fragments of at about 500 bp). I'll check the number of > probes of a given region showing this extension as I exampled > previously. > That sounds reasonable. I think the issue is the ordering. Don't forget to reorder the probe intensities as well. HTH, Peter > Thanks for your help :) > > > Best wishes, > Peter > > Best, > Rayna > > -- > "Change l'ordre du monde plut?t que tes d?sirs." > > Mon blog perso/My personal blog : http://hatewasabi.wordpress.com/ > Relectrice LinuxFr.org (http://linuxfr.org/~Malicia/) > > PhD Student > "Molecular Evolution and Bioinformatics" > Ludwig-Maximilians University (LMU) of Munich

ADD REPLY • link 15.3 years ago Peter Humburg ▴ 30

Login before adding your answer.