Entering edit mode
Rayna
▴
40
@rayna-4236
Last seen 10.2 years ago
Dear List,
I use tileHMM to assess for bound/unbound regions in my ChIP-chip
data. It
comes from E. coli, so it is not epigenomics stuff :)
Here is the code which gives me the gff file where only the enriched
probes
are kept and associated to regions:
R> gff <- reg2gff(regions.clean, post.enriched,
data.frame(chromosome=layout$probe.id,
position=layout$pos))
R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end,
score=gff$score)
R> l <- list(regions=regions, probe.state=probe.state)
R> l
What I obtain is:
##gff-version 3
##Wed Sep 22 17:33:09 2010
NC_000913 nimble region 1 1 1 +
Name=region_region1
NC_000913 nimble region 1000016 1000016 1 +
Name=region_region2
NC_000913 nimble region 100017 100017 1 +
Name=region_region3
NC_000913 nimble region 1000184 1000184 1 +
Name=region_region4
NC_000913 nimble region 100041 100041 0.99 +
Name=region_region5
NC_000913 nimble region 1000424 1000424 1 +
Name=region_region6
[...]
NC_000913 nimble region 101337 1013223 0,96 +
Name=region_region89
NC_000913 nimble region 101361 1013391 1 +
Name=region_region90
Which is weird for me, for several reasons.
First here is that I have a region start = 1 and a region end = 1 (for
region 1, for example). I checked the layout (a merge of the .ndf and
of the
.pos files). This is the number of the probe as it is described in the
layout:
R> head(layout)
probe.id sequence x
y
1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437
1023
2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580
820
3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296
920
4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711
919
5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478
918
6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106
716
chr pos length
1 NC_000913 1 50
2 NC_000913 1 50
3 NC_000913 1000016 50
4 NC_000913 1000016 50
5 NC_000913 1000040 50
6 NC_000913 1000040 50
Therefore, the problem comes apparently from here where the value in
the
column "pos" is the probe's number. I don't know how to think about a
region
with an excellent score (in the case of the region 1, it is the best
score
one may obtain) which begins at position 1 and ends at position 1,
with a
probe size of 50 bp. By the way, I have nowhere a correspondence
between the
probe ID and the gene it matches. So, somehow, blasting all of the
probes
against the genome seems a bit tedious and not really optimal...
Second, when I was looking further in the gff, there are things such
as the
example of the region 90 I pasted above. Here, the region is very big.
I
checked the probes and so, the start position 101361 corresponds to
the
probe ID 101361 which lays in the interval 101361-101410 as listed in
the
ndf file. Moreover, the end position 1013391 corresponds to a probe ID
1013391 which covers the interval 1013391-1013440 according to the ndf
file.
I'm really confused what to think about this stuff and would be very
grateful in case someone could explain me how I'm supposed to read
this gff.
Thanks a lot in advance :)
Best,
Rayna
--
"Change l'ordre du monde plutôt que tes désirs."
Mon blog perso/My personal blog : http://hatewasabi.wordpress.com/
Relectrice LinuxFr.org
(http://linuxfr.org/~Malicia/<http: linuxfr.org="" %7emalicia=""/>
)
PhD Student
"Molecular Evolution and Bioinformatics"
Ludwig-Maximilians University (LMU) of Munich
[[alternative HTML version deleted]]