Hi Gowthaman,
Here is one approach,
(1) create the GRanges or GRangesList from gff
Using a gff file in GenomicFeatures as an example,
> gffFile <-
system.file("extdata","a.gff3",package="GenomicFeatures")
> txdb <- makeTranscriptDbFromGFF(file=gffFile,
+ format="gff3",
+ dataSource="partial gtf file for Tomatoes for
testing",
+ species="Solanum lycopersicum",
+ circ_seqs=DEFAULT_CIRC_SEQS,
+ miRBaseBuild=NULL)
If you are interested in genes either exons by genes or transcripts by
genes might be useful,
exbygene <- exonsBy(txdb, "gene")
txbygene <- transcriptsBy(txdb, "gene")
The gene names are associated with each outer list element of the
GRangesList,
> exbygene
GRangesList of length 488:
$gene:Solyc00g005000.2
GRanges with 2 ranges and 2 elementMetadata cols:
seqnames ranges strand | exon_id
exon_name
<rle> <iranges> <rle> | <integer> <character>
[1] SL2.40ch00 [16437, 17275] + | 1
exon:Solyc00g005000.2.1.1
[2] SL2.40ch00 [17336, 18189] + | 2
exon:Solyc00g005000.2.1.2
$gene:Solyc00g005020.1
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | exon_id
exon_name
[1] SL2.40ch00 [68062, 68211] + | 3
exon:Solyc00g005020.1.1.1
[2] SL2.40ch00 [68344, 68568] + | 4
exon:Solyc00g005020.1.1.2
[3] SL2.40ch00 [68654, 68764] + | 5
exon:Solyc00g005020.1.1.3
(2) findOverlaps()
For the sake of example, I'll take a subset of one of the GRangesLists
and overlap it with the full list,
subset <- exbygene[c(2,4,6)]
olap <- findOverlaps(subset, exbygene)
> olap
Hits of length 3
queryLength: 3
subjectLength: 488
queryHits subjectHits
<integer> <integer>
1 1 2
2 2 4
3 3 6
The information in the 'Hits' object can be used to map back to the
query and subject used in findOverlaps().
See ?queryHits and ?subjectHIts.
Next identify which subjects were hit by the queries and extract the
gene names,
names(exbygene[subjectHits(olap)])
(3) reconsiling gene sets
Take a look at the 'type' argument to findOverlaps(). To find exact
matches you'll want to find ranges with type='equal', to find any
overlaps use type='any' (default).
You may also want to look at summarizeOverlaps(). It is an extension
of
findOverlaps() that counts a read a maximum of once (i.e., no double
counting). If a read hits multiple features different rule sets are
applied to assign the read to a single feature.
Valerie
On 05/24/2012 10:05 AM, gowtham wrote:
> Hi Everyone,
> This is my first attempt in R/bioconductor outside of plot()/hist()
> function. Thanks to Martin Morgan, I was impressed with what
> bioconductor can do for me ( after attending bioc workshop last
week).
>
> When I try to findOverlap on two GFF files I get information on
which
> feature is overlapping with another feature. But, instead of using
"feature
> names" it just indicates 1,2, 3...etc. Is there a way, i can make it
output
> gene names. (my GRanges objects are read from gff files using
Rtracklayer&
> later have names (geneids) assigned to them using names() function).
>
>> findOverlaps(au, am)
> Hits of length 1740
> queryLength: 790
> subjectLength: 1054
> queryHits subjectHits
> <integer> <integer>
> 1 1 1
> 2 1 528
> 3 2 13
> 4 2 540
> 5 3 136
>
>
>
> And here is the background of what i am trying to do. In case, if
any of
> you have a better suggestion:
> I run 3 de novo gene prediction algorithm on our newly assembled
genome.
> And I would like to reconcile the gene sets from all of them to
genes (a)
> predicted with exact same boundaries in all three (b) boundaries not
same
> but overlap (c) predicted by only one or two of them.
>
> And use reduce in case of (b) to obtain outer most boundaries of
> overlapping prediction.
>
> As a novice users its bit overwhelming for me.
>
> Thanks very much,
> Gowthaman
>