findOverlaps question from a newbie...
1
0
Entering edit mode
gowtham ▴ 210
@gowtham-5301
Last seen 9.6 years ago
Hi Everyone, This is my first attempt in R/bioconductor outside of plot()/hist() function. Thanks to Martin Morgan, I was impressed with what bioconductor can do for me ( after attending bioc workshop last week). When I try to findOverlap on two GFF files I get information on which feature is overlapping with another feature. But, instead of using "feature names" it just indicates 1,2, 3...etc. Is there a way, i can make it output gene names. (my GRanges objects are read from gff files using Rtracklayer & later have names (geneids) assigned to them using names() function). > findOverlaps(au, am) Hits of length 1740 queryLength: 790 subjectLength: 1054 queryHits subjectHits <integer> <integer> 1 1 1 2 1 528 3 2 13 4 2 540 5 3 136 And here is the background of what i am trying to do. In case, if any of you have a better suggestion: I run 3 de novo gene prediction algorithm on our newly assembled genome. And I would like to reconcile the gene sets from all of them to genes (a) predicted with exact same boundaries in all three (b) boundaries not same but overlap (c) predicted by only one or two of them. And use reduce in case of (b) to obtain outer most boundaries of overlapping prediction. As a novice users its bit overwhelming for me. Thanks very much, Gowthaman -- Gowthaman Bioinformatics Systems Programmer. SBRI, 307 West lake Ave N Suite 500 Seattle, WA. 98109-5219 Phone : LAB 206-256-7188 (direct). [[alternative HTML version deleted]]
• 3.3k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
Hi Gowthaman, Here is one approach, (1) create the GRanges or GRangesList from gff Using a gff file in GenomicFeatures as an example, > gffFile <- system.file("extdata","a.gff3",package="GenomicFeatures") > txdb <- makeTranscriptDbFromGFF(file=gffFile, + format="gff3", + dataSource="partial gtf file for Tomatoes for testing", + species="Solanum lycopersicum", + circ_seqs=DEFAULT_CIRC_SEQS, + miRBaseBuild=NULL) If you are interested in genes either exons by genes or transcripts by genes might be useful, exbygene <- exonsBy(txdb, "gene") txbygene <- transcriptsBy(txdb, "gene") The gene names are associated with each outer list element of the GRangesList, > exbygene GRangesList of length 488: $gene:Solyc00g005000.2 GRanges with 2 ranges and 2 elementMetadata cols: seqnames ranges strand | exon_id exon_name <rle> <iranges> <rle> | <integer> <character> [1] SL2.40ch00 [16437, 17275] + | 1 exon:Solyc00g005000.2.1.1 [2] SL2.40ch00 [17336, 18189] + | 2 exon:Solyc00g005000.2.1.2 $gene:Solyc00g005020.1 GRanges with 3 ranges and 2 elementMetadata cols: seqnames ranges strand | exon_id exon_name [1] SL2.40ch00 [68062, 68211] + | 3 exon:Solyc00g005020.1.1.1 [2] SL2.40ch00 [68344, 68568] + | 4 exon:Solyc00g005020.1.1.2 [3] SL2.40ch00 [68654, 68764] + | 5 exon:Solyc00g005020.1.1.3 (2) findOverlaps() For the sake of example, I'll take a subset of one of the GRangesLists and overlap it with the full list, subset <- exbygene[c(2,4,6)] olap <- findOverlaps(subset, exbygene) > olap Hits of length 3 queryLength: 3 subjectLength: 488 queryHits subjectHits <integer> <integer> 1 1 2 2 2 4 3 3 6 The information in the 'Hits' object can be used to map back to the query and subject used in findOverlaps(). See ?queryHits and ?subjectHIts. Next identify which subjects were hit by the queries and extract the gene names, names(exbygene[subjectHits(olap)]) (3) reconsiling gene sets Take a look at the 'type' argument to findOverlaps(). To find exact matches you'll want to find ranges with type='equal', to find any overlaps use type='any' (default). You may also want to look at summarizeOverlaps(). It is an extension of findOverlaps() that counts a read a maximum of once (i.e., no double counting). If a read hits multiple features different rule sets are applied to assign the read to a single feature. Valerie On 05/24/2012 10:05 AM, gowtham wrote: > Hi Everyone, > This is my first attempt in R/bioconductor outside of plot()/hist() > function. Thanks to Martin Morgan, I was impressed with what > bioconductor can do for me ( after attending bioc workshop last week). > > When I try to findOverlap on two GFF files I get information on which > feature is overlapping with another feature. But, instead of using "feature > names" it just indicates 1,2, 3...etc. Is there a way, i can make it output > gene names. (my GRanges objects are read from gff files using Rtracklayer& > later have names (geneids) assigned to them using names() function). > >> findOverlaps(au, am) > Hits of length 1740 > queryLength: 790 > subjectLength: 1054 > queryHits subjectHits > <integer> <integer> > 1 1 1 > 2 1 528 > 3 2 13 > 4 2 540 > 5 3 136 > > > > And here is the background of what i am trying to do. In case, if any of > you have a better suggestion: > I run 3 de novo gene prediction algorithm on our newly assembled genome. > And I would like to reconcile the gene sets from all of them to genes (a) > predicted with exact same boundaries in all three (b) boundaries not same > but overlap (c) predicted by only one or two of them. > > And use reduce in case of (b) to obtain outer most boundaries of > overlapping prediction. > > As a novice users its bit overwhelming for me. > > Thanks very much, > Gowthaman >
ADD COMMENT
0
Entering edit mode
Thanks so much Valerie. This sounds like a good approach. I did not even know there is a way to read GFFs directly from Granges. I was using rtracklayer before. I will follow this advice and see how it goes. I really appreciate your help. Gowthaman On Tue, May 29, 2012 at 9:03 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hi Gowthaman, > > Here is one approach, > > (1) create the GRanges or GRangesList from gff > > Using a gff file in GenomicFeatures as an example, > > > gffFile <- system.file("extdata","a.gff3"**,package="GenomicFeatures") > > txdb <- makeTranscriptDbFromGFF(file=**gffFile, > + format="gff3", > + dataSource="partial gtf file for Tomatoes for testing", > + species="Solanum lycopersicum", > + circ_seqs=DEFAULT_CIRC_SEQS, > + miRBaseBuild=NULL) > > If you are interested in genes either exons by genes or transcripts by > genes might be useful, > > exbygene <- exonsBy(txdb, "gene") > txbygene <- transcriptsBy(txdb, "gene") > > The gene names are associated with each outer list element of the > GRangesList, > > > exbygene > GRangesList of length 488: > $gene:Solyc00g005000.2 > GRanges with 2 ranges and 2 elementMetadata cols: > seqnames ranges strand | exon_id > exon_name > <rle> <iranges> <rle> | <integer> <character> > [1] SL2.40ch00 [16437, 17275] + | 1 > exon:Solyc00g005000.2.1.1 > [2] SL2.40ch00 [17336, 18189] + | 2 > exon:Solyc00g005000.2.1.2 > > $gene:Solyc00g005020.1 > GRanges with 3 ranges and 2 elementMetadata cols: > seqnames ranges strand | exon_id exon_name > [1] SL2.40ch00 [68062, 68211] + | 3 exon:Solyc00g005020.1.1.1 > [2] SL2.40ch00 [68344, 68568] + | 4 exon:Solyc00g005020.1.1.2 > [3] SL2.40ch00 [68654, 68764] + | 5 exon:Solyc00g005020.1.1.3 > > > (2) findOverlaps() > > For the sake of example, I'll take a subset of one of the GRangesLists and > overlap it with the full list, > > subset <- exbygene[c(2,4,6)] > olap <- findOverlaps(subset, exbygene) > > > olap > Hits of length 3 > queryLength: 3 > subjectLength: 488 > queryHits subjectHits > <integer> <integer> > 1 1 2 > 2 2 4 > 3 3 6 > > The information in the 'Hits' object can be used to map back to the query > and subject used in findOverlaps(). > See ?queryHits and ?subjectHIts. > > Next identify which subjects were hit by the queries and extract the gene > names, > > names(exbygene[subjectHits(**olap)]) > > > (3) reconsiling gene sets > > Take a look at the 'type' argument to findOverlaps(). To find exact > matches you'll want to find ranges with type='equal', to find any overlaps > use type='any' (default). > > You may also want to look at summarizeOverlaps(). It is an extension of > findOverlaps() that counts a read a maximum of once (i.e., no double > counting). If a read hits multiple features different rule sets are applied > to assign the read to a single feature. > > Valerie > > > > > On 05/24/2012 10:05 AM, gowtham wrote: > >> Hi Everyone, >> This is my first attempt in R/bioconductor outside of plot()/hist() >> function. Thanks to Martin Morgan, I was impressed with what >> bioconductor can do for me ( after attending bioc workshop last week). >> >> When I try to findOverlap on two GFF files I get information on which >> feature is overlapping with another feature. But, instead of using >> "feature >> names" it just indicates 1,2, 3...etc. Is there a way, i can make it >> output >> gene names. (my GRanges objects are read from gff files using Rtracklayer& >> later have names (geneids) assigned to them using names() function). >> >> findOverlaps(au, am) >>> >> Hits of length 1740 >> queryLength: 790 >> subjectLength: 1054 >> queryHits subjectHits >> <integer> <integer> >> 1 1 1 >> 2 1 528 >> 3 2 13 >> 4 2 540 >> 5 3 136 >> >> >> >> And here is the background of what i am trying to do. In case, if any of >> you have a better suggestion: >> I run 3 de novo gene prediction algorithm on our newly assembled genome. >> And I would like to reconcile the gene sets from all of them to genes (a) >> predicted with exact same boundaries in all three (b) boundaries not same >> but overlap (c) predicted by only one or two of them. >> >> And use reduce in case of (b) to obtain outer most boundaries of >> overlapping prediction. >> >> As a novice users its bit overwhelming for me. >> >> Thanks very much, >> Gowthaman >> >> > -- Gowthaman Bioinformatics Systems Programmer. SBRI, 307 West lake Ave N Suite 500 Seattle, WA. 98109-5219 Phone : LAB 206-256-7188 (direct). [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I should mention I was using the development version of the Bioconductor packages. Here is my session info. > sessionInfo() R version 2.15.0 Patched (2012-05-08 r59333) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.9.11 AnnotationDbi_1.19.9 Biobase_2.17.5 [4] BiocInstaller_1.5.8 GenomicRanges_1.9.21 IRanges_1.15.11 [7] BiocGenerics_0.3.0 loaded via a namespace (and not attached): [1] biomaRt_2.13.1 Biostrings_2.25.4 bitops_1.0-4.1 BSgenome_1.25.1 [5] DBI_0.2-5 RCurl_1.91-1 Rsamtools_1.9.16 RSQLite_0.11.1 [9] rtracklayer_1.17.8 stats4_2.15.0 tools_2.15.0 XML_3.9-4 [13] zlibbioc_1.3.0 Valerie On 05/29/2012 12:44 PM, gowtham wrote: > Thanks so much Valerie. > This sounds like a good approach. I did not even know there is a way > to read GFFs directly from Granges. I was using rtracklayer before. > > I will follow this advice and see how it goes. > > I really appreciate your help. > > Gowthaman > > On Tue, May 29, 2012 at 9:03 AM, Valerie Obenchain <vobencha@fhcrc.org> <mailto:vobencha@fhcrc.org>> wrote: > > Hi Gowthaman, > > Here is one approach, > > (1) create the GRanges or GRangesList from gff > > Using a gff file in GenomicFeatures as an example, > > > gffFile <- > system.file("extdata","a.gff3",package="GenomicFeatures") > > txdb <- makeTranscriptDbFromGFF(file=gffFile, > + format="gff3", > + dataSource="partial gtf file for Tomatoes for > testing", > + species="Solanum lycopersicum", > + circ_seqs=DEFAULT_CIRC_SEQS, > + miRBaseBuild=NULL) > > If you are interested in genes either exons by genes or > transcripts by genes might be useful, > > exbygene <- exonsBy(txdb, "gene") > txbygene <- transcriptsBy(txdb, "gene") > > The gene names are associated with each outer list element of the > GRangesList, > > > exbygene > GRangesList of length 488: > $gene:Solyc00g005000.2 > GRanges with 2 ranges and 2 elementMetadata cols: > seqnames ranges strand | exon_id > exon_name > <rle> <iranges> <rle> | <integer> <character> > [1] SL2.40ch00 [16437, 17275] + | 1 > exon:Solyc00g005000.2.1.1 > [2] SL2.40ch00 [17336, 18189] + | 2 > exon:Solyc00g005000.2.1.2 > > $gene:Solyc00g005020.1 > GRanges with 3 ranges and 2 elementMetadata cols: > seqnames ranges strand | exon_id > exon_name > [1] SL2.40ch00 [68062, 68211] + | 3 > exon:Solyc00g005020.1.1.1 > [2] SL2.40ch00 [68344, 68568] + | 4 > exon:Solyc00g005020.1.1.2 > [3] SL2.40ch00 [68654, 68764] + | 5 > exon:Solyc00g005020.1.1.3 > > > (2) findOverlaps() > > For the sake of example, I'll take a subset of one of the > GRangesLists and overlap it with the full list, > > subset <- exbygene[c(2,4,6)] > olap <- findOverlaps(subset, exbygene) > > > olap > Hits of length 3 > queryLength: 3 > subjectLength: 488 > queryHits subjectHits > <integer> <integer> > 1 1 2 > 2 2 4 > 3 3 6 > > The information in the 'Hits' object can be used to map back to > the query and subject used in findOverlaps(). > See ?queryHits and ?subjectHIts. > > Next identify which subjects were hit by the queries and extract > the gene names, > > names(exbygene[subjectHits(olap)]) > > > (3) reconsiling gene sets > > Take a look at the 'type' argument to findOverlaps(). To find > exact matches you'll want to find ranges with type='equal', to > find any overlaps use type='any' (default). > > You may also want to look at summarizeOverlaps(). It is an > extension of findOverlaps() that counts a read a maximum of once > (i.e., no double counting). If a read hits multiple features > different rule sets are applied to assign the read to a single > feature. > > Valerie > > > > > On 05/24/2012 10:05 AM, gowtham wrote: > > Hi Everyone, > This is my first attempt in R/bioconductor outside of > plot()/hist() > function. Thanks to Martin Morgan, I was impressed with what > bioconductor can do for me ( after attending bioc workshop > last week). > > When I try to findOverlap on two GFF files I get information > on which > feature is overlapping with another feature. But, instead of > using "feature > names" it just indicates 1,2, 3...etc. Is there a way, i can > make it output > gene names. (my GRanges objects are read from gff files using > Rtracklayer& > later have names (geneids) assigned to them using names() > function). > > findOverlaps(au, am) > > Hits of length 1740 > queryLength: 790 > subjectLength: 1054 > queryHits subjectHits > <integer> <integer> > 1 1 1 > 2 1 528 > 3 2 13 > 4 2 540 > 5 3 136 > > > > And here is the background of what i am trying to do. In case, > if any of > you have a better suggestion: > I run 3 de novo gene prediction algorithm on our newly > assembled genome. > And I would like to reconcile the gene sets from all of them > to genes (a) > predicted with exact same boundaries in all three (b) > boundaries not same > but overlap (c) predicted by only one or two of them. > > And use reduce in case of (b) to obtain outer most boundaries of > overlapping prediction. > > As a novice users its bit overwhelming for me. > > Thanks very much, > Gowthaman > > > > > > -- > Gowthaman > > Bioinformatics Systems Programmer. > SBRI, 307 West lake Ave N Suite 500 > Seattle, WA. 98109-5219 > Phone : LAB 206-256-7188 (direct). [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6