GRanges Overlaps

0

Entering edit mode

Giovanni Carosso ▴ 30

@giovanni-carosso-6254

Last seen 9.6 years ago

Hello, I have three CSV files of genomic intervals (chromosome start, stop, plus other information in different columns) each containing between 10k- 20k lines. I am looking for a way to pull out from one file only the lines that have genomic overlap with both of the other two files. In other words, subset file A to include only the lines of file A whose interval overlaps intervals in Files B and C. I'm able to create GRanges and manipulate these to get the subsetted GRanges list, but I have not found a way to then use this GRanges as an index to pull out the proper lines from CSV file. Is there a simple way to subset my file in this way that I'm missing? Thanks, Giovanni [[alternative HTML version deleted]]

• 2.4k views

ADD COMMENT • link updated 10.4 years ago by Michael Lawrence ★ 11k • written 10.4 years ago by Giovanni Carosso ▴ 30

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 3.6 years ago

United States

turn them into BED files and use bedops/bedtools? Or use findOverlaps() to generate an index that you can walk backwards. (see the manpage or a set of results to get a general idea of how this can work; essentially you will want to match query results to subject results in order to get back to lines in your files, although that's a lot of work for the results you get) Please post dummy code when possible, it's easier to reply with a fix when one can just paste it into the example and send it back. (Just a general note) Best, --t *He that would live in peace and at ease, * *Must not speak all he knows, nor judge all he sees.* Benjamin Franklin, Poor Richard's Almanack<http: archive.org="" details="" poorrichardsalma00franrich=""> On Tue, Nov 19, 2013 at 10:49 AM, Giovanni Carosso <gacarosso@gmail.com>wrote: > Hello, > > I have three CSV files of genomic intervals (chromosome start, stop, plus > other information in different columns) each containing between 10k- 20k > lines. > > I am looking for a way to pull out from one file only the lines that have > genomic overlap with both of the other two files. In other words, subset > file A to include only the lines of file A whose interval overlaps > intervals in Files B and C. > > I'm able to create GRanges and manipulate these to get the subsetted > GRanges list, but I have not found a way to then use this GRanges as an > index to pull out the proper lines from CSV file. > > Is there a simple way to subset my file in this way that I'm missing? > > Thanks, > Giovanni > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.4 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

Simply run Rsamtools::bgzip() and indexTabix() on your BED file and then rtracklayer::import("my.bed.gz", which = roi_as_granges). On Tue, Nov 19, 2013 at 10:49 AM, Giovanni Carosso <gacarosso@gmail.com>wrote: > Hello, > > I have three CSV files of genomic intervals (chromosome start, stop, plus > other information in different columns) each containing between 10k- 20k > lines. > > I am looking for a way to pull out from one file only the lines that have > genomic overlap with both of the other two files. In other words, subset > file A to include only the lines of file A whose interval overlaps > intervals in Files B and C. > > I'm able to create GRanges and manipulate these to get the subsetted > GRanges list, but I have not found a way to then use this GRanges as an > index to pull out the proper lines from CSV file. > > Is there a simple way to subset my file in this way that I'm missing? > > Thanks, > Giovanni > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, Is it the case that BED files can be comma-separated values e.g. chrom,chromStart,chromEnd,otherStuffNotInBed9FileDefinition ? Because, otherwise, it seems like require(GenomicRanges) tbl <- read.csv(file1) GR1 <- as(tbl, 'GRanges') GR2 <- as(read.csv(file2), 'GRanges') GR3 <- as(read.csv(file3), 'GRanges') lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, intersect(GR2,GR3))) might be the most straightforward way to get the requested line numbers. (?) Now if he has actual BED files already, then obviously your method is better, and I am making a note of this trick for later (I have been TABIX'ing my BEDs & other tab-delimited genome-format files ever since you added support for it). Thanks for the tip, --t *He that would live in peace and at ease, * *Must not speak all he knows, nor judge all he sees.* Benjamin Franklin, Poor Richard's Almanack<http: archive.org="" details="" poorrichardsalma00franrich=""> On Tue, Nov 19, 2013 at 5:21 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > Simply run Rsamtools::bgzip() and indexTabix() on your BED file and then > rtracklayer::import("my.bed.gz", which = roi_as_granges). > > > > > On Tue, Nov 19, 2013 at 10:49 AM, Giovanni Carosso <gacarosso@gmail.com> >wrote: > > > Hello, > > > > I have three CSV files of genomic intervals (chromosome start, stop, plus > > other information in different columns) each containing between 10k-20k > > lines. > > > > I am looking for a way to pull out from one file only the lines that have > > genomic overlap with both of the other two files. In other words, subset > > file A to include only the lines of file A whose interval overlaps > > intervals in Files B and C. > > > > I'm able to create GRanges and manipulate these to get the subsetted > > GRanges list, but I have not found a way to then use this GRanges as an > > index to pull out the proper lines from CSV file. > > > > Is there a simple way to subset my file in this way that I'm missing? > > > > Thanks, > > Giovanni > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim and Michael, Thanks so much for the responses. I tried Tim's method for subsetting my csv file for lines that overlap two other csv's, which was: require(GenomicRanges) tbl <- read.csv(file1) GR1 <- as(tbl, 'GRanges') GR2 <- as(read.csv(file2), 'GRanges') GR3 <- as(read.csv(file3), 'GRanges') lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, intersect(GR2,GR3))) But, I got errors saying that data.frame(from read.csv) cannot be coerced into GRanges. I then tried a similar approach by first turning each file into GRanges, then using queryHits,findOverlaps. This generated an integer list, but I'm not sure how to then pull out these lines from the original csv file, to include all the columns in addition to the chrom, start, end. Should I change the files somehow before running Tim's code? Here's some dummy code I used: > GR1 <- as(tbl,'GRanges') Error in as(tbl, "GRanges") : no method or default for coercing data.frame to GRanges > class(tbl) [1] "data.frame" > GR1 <- GRanges(seqnames=tbl$chr, ranges=IRanges(start=tbl$start, end=tbl$end)) > GR2 <- GRanges(seqnames=tbl1$chr, ranges=IRanges(start=tbl1$start, end=tbl1$end))> GR3 <- GRanges(seqnames=tbl2$chr, ranges=IRanges(start=tbl2$start, end=tbl2$end)) > lines_in_1_over2and3 <- queryHits(findOverlaps(GR1, intersect(GR2,GR3))) > head(lines_in_1_over2and3) [1] 2 3 4 6 8 10 > test <- tbl[lines_in_1_over2and3] Error in `[.data.frame`(tbl, lines_in_1_over2and3) : undefined columns selected Sorry if these are kind of naive questions, I'm new to this kind of stuff. The main thing is that I want to make sure to extract all the columns out of the csv, in addition to chrom, start, end (which is why I think making BED file wouldn't work). I very much appreciate the help! Giovanni On Wed, Nov 20, 2013 at 1:26 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > Hi Michael, > > Is it the case that BED files can be comma-separated values e.g. > chrom,chromStart,chromEnd,otherStuffNotInBed9FileDefinition > ? > > Because, otherwise, it seems like > > require(GenomicRanges) > tbl <- read.csv(file1) > GR1 <- as(tbl, 'GRanges') > > GR2 <- as(read.csv(file2), 'GRanges') > GR3 <- as(read.csv(file3), 'GRanges') > > lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, > > intersect(GR2,GR3))) > > might be the most straightforward way to get the requested line numbers. > (?) > > Now if he has actual BED files already, then obviously your method is > better, and I am making a note of this trick for later (I have been > TABIX'ing my BEDs & other tab-delimited genome-format files ever since you > added support for it). > > Thanks for the tip, > > --t > > > > *He that would live in peace and at ease, * > *Must not speak all he knows, nor judge all he sees.* > > Benjamin Franklin, Poor Richard's Almanack<http: archive.org="" details="" poorrichardsalma00franrich=""> > > > On Tue, Nov 19, 2013 at 5:21 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> Simply run Rsamtools::bgzip() and indexTabix() on your BED file and then >> rtracklayer::import("my.bed.gz", which = roi_as_granges). >> >> >> >> >> On Tue, Nov 19, 2013 at 10:49 AM, Giovanni Carosso <gacarosso@gmail.com>> >wrote: >> >> > Hello, >> > >> > I have three CSV files of genomic intervals (chromosome start, stop, >> plus >> > other information in different columns) each containing between 10k-20k >> > lines. >> > >> > I am looking for a way to pull out from one file only the lines that >> have >> > genomic overlap with both of the other two files. In other words, subset >> > file A to include only the lines of file A whose interval overlaps >> > intervals in Files B and C. >> > >> > I'm able to create GRanges and manipulate these to get the subsetted >> > GRanges list, but I have not found a way to then use this GRanges as an >> > index to pull out the proper lines from CSV file. >> > >> > Is there a simple way to subset my file in this way that I'm missing? >> > >> > Thanks, >> > Giovanni >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Giovanni Carosso ▴ 30

0

Entering edit mode

On Tue, Nov 26, 2013 at 10:55 AM, Giovanni Carosso <gacarosso@gmail.com>wrote: > Hi Tim and Michael, > > Thanks so much for the responses. > > I tried Tim's method for subsetting my csv file for lines that overlap two > other csv's, which was: > > require(GenomicRanges) > tbl <- read.csv(file1) > > GR1 <- as(tbl, 'GRanges') > GR2 <- as(read.csv(file2), 'GRanges') > GR3 <- as(read.csv(file3), 'GRanges') > > lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, > intersect(GR2,GR3))) > > But, I got errors saying that data.frame(from read.csv) cannot be coerced > into GRanges. I then tried a similar approach by first turning each file > into GRanges, then using queryHits,findOverlaps. This generated an integer > list, but I'm not sure how to then pull out these lines from the original > csv file, to include all the columns in addition to the chrom, start, end. > Should I change the files somehow before running Tim's code? > > Here's some dummy code I used: > > > GR1 <- as(tbl,'GRanges') > Error in as(tbl, "GRanges") : > no method or default for coercing data.frame to GRanges > I'm guessing your GenomicRanges version is old, so the data.frame coercion does not exist. Please attach your sessionInfo(). > class(tbl) > [1] "data.frame" > > GR1 <- GRanges(seqnames=tbl$chr, ranges=IRanges(start=tbl$start, > end=tbl$end)) > > GR2 <- GRanges(seqnames=tbl1$chr, ranges=IRanges(start=tbl1$start, > end=tbl1$end))> GR3 <- GRanges(seqnames=tbl2$chr, > ranges=IRanges(start=tbl2$start, end=tbl2$end)) > > lines_in_1_over2and3 <- queryHits(findOverlaps(GR1, intersect(GR2,GR3))) > > head(lines_in_1_over2and3) > [1] 2 3 4 6 8 10 > > test <- tbl[lines_in_1_over2and3] > Error in `[.data.frame`(tbl, lines_in_1_over2and3) : > undefined columns selected > > Those are row indices, so you need: test <- tbl[lines_in_1_over2and3,] > > Sorry if these are kind of naive questions, I'm new to this kind of stuff. > The main thing is that I want to make sure to extract all the columns out > of the csv, in addition to chrom, start, end (which is why I think making > BED file wouldn't work). > > You can tabix any tab-separated file with chrom, start, end. And then import with rtracklayer::import(TabixFile(file1), which = GRanges(...)) > I very much appreciate the help! > > Giovanni > > > On Wed, Nov 20, 2013 at 1:26 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> Hi Michael, >> >> Is it the case that BED files can be comma-separated values e.g. >> chrom,chromStart,chromEnd,otherStuffNotInBed9FileDefinition >> ? >> >> Because, otherwise, it seems like >> >> require(GenomicRanges) >> tbl <- read.csv(file1) >> GR1 <- as(tbl, 'GRanges') >> >> GR2 <- as(read.csv(file2), 'GRanges') >> GR3 <- as(read.csv(file3), 'GRanges') >> >> lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, >> >> intersect(GR2,GR3))) >> >> might be the most straightforward way to get the requested line numbers. >> (?) >> >> Now if he has actual BED files already, then obviously your method is >> better, and I am making a note of this trick for later (I have been >> TABIX'ing my BEDs & other tab-delimited genome-format files ever since you >> added support for it). >> >> Thanks for the tip, >> >> --t >> >> >> >> *He that would live in peace and at ease, * >> *Must not speak all he knows, nor judge all he sees.* >> >> Benjamin Franklin, Poor Richard's Almanack<http: archive.org="" details="" poorrichardsalma00franrich=""> >> >> >> On Tue, Nov 19, 2013 at 5:21 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> Simply run Rsamtools::bgzip() and indexTabix() on your BED file and then >>> rtracklayer::import("my.bed.gz", which = roi_as_granges). >>> >>> >>> >>> >>> On Tue, Nov 19, 2013 at 10:49 AM, Giovanni Carosso <gacarosso@gmail.com>>> >wrote: >>> >>> > Hello, >>> > >>> > I have three CSV files of genomic intervals (chromosome start, stop, >>> plus >>> > other information in different columns) each containing between 10k-20k >>> > lines. >>> > >>> > I am looking for a way to pull out from one file only the lines that >>> have >>> > genomic overlap with both of the other two files. In other words, >>> subset >>> > file A to include only the lines of file A whose interval overlaps >>> > intervals in Files B and C. >>> > >>> > I'm able to create GRanges and manipulate these to get the subsetted >>> > GRanges list, but I have not found a way to then use this GRanges as an >>> > index to pull out the proper lines from CSV file. >>> > >>> > Is there a simple way to subset my file in this way that I'm missing? >>> > >>> > Thanks, >>> > Giovanni >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor@r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Michael, Thanks, forgot about the row comma. That got it to work! session info was: > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicRanges_1.12.5 IRanges_1.18.4 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] stats4_3.0.1 tools_3.0.1 On Tue, Nov 26, 2013 at 2:02 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > > > > On Tue, Nov 26, 2013 at 10:55 AM, Giovanni Carosso <gacarosso@gmail.com>wrote: > >> Hi Tim and Michael, >> >> Thanks so much for the responses. >> >> I tried Tim's method for subsetting my csv file for lines that overlap >> two other csv's, which was: >> >> require(GenomicRanges) >> tbl <- read.csv(file1) >> >> GR1 <- as(tbl, 'GRanges') >> GR2 <- as(read.csv(file2), 'GRanges') >> GR3 <- as(read.csv(file3), 'GRanges') >> >> lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, >> intersect(GR2,GR3))) >> >> But, I got errors saying that data.frame(from read.csv) cannot be coerced >> into GRanges. I then tried a similar approach by first turning each file >> into GRanges, then using queryHits,findOverlaps. This generated an integer >> list, but I'm not sure how to then pull out these lines from the original >> csv file, to include all the columns in addition to the chrom, start, end. >> Should I change the files somehow before running Tim's code? >> >> Here's some dummy code I used: >> >> > GR1 <- as(tbl,'GRanges') >> Error in as(tbl, "GRanges") : >> no method or default for coercing data.frame to GRanges >> > > I'm guessing your GenomicRanges version is old, so the data.frame coercion > does not exist. Please attach your sessionInfo(). > > > class(tbl) >> [1] "data.frame" >> > GR1 <- GRanges(seqnames=tbl$chr, ranges=IRanges(start=tbl$start, >> end=tbl$end)) >> > GR2 <- GRanges(seqnames=tbl1$chr, ranges=IRanges(start=tbl1$start, >> end=tbl1$end))> GR3 <- GRanges(seqnames=tbl2$chr, >> ranges=IRanges(start=tbl2$start, end=tbl2$end)) >> > lines_in_1_over2and3 <- queryHits(findOverlaps(GR1, intersect(GR2,GR3))) >> > head(lines_in_1_over2and3) >> [1] 2 3 4 6 8 10 >> > test <- tbl[lines_in_1_over2and3] >> Error in `[.data.frame`(tbl, lines_in_1_over2and3) : >> undefined columns selected >> >> > Those are row indices, so you need: > test <- tbl[lines_in_1_over2and3,] > > >> >> Sorry if these are kind of naive questions, I'm new to this kind of >> stuff. The main thing is that I want to make sure to extract all the >> columns out of the csv, in addition to chrom, start, end (which is why I >> think making BED file wouldn't work). >> >> > You can tabix any tab-separated file with chrom, start, end. And then > import with > rtracklayer::import(TabixFile(file1), which = GRanges(...)) > > > >> I very much appreciate the help! >> >> Giovanni >> >> >> On Wed, Nov 20, 2013 at 1:26 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >> >>> Hi Michael, >>> >>> Is it the case that BED files can be comma-separated values e.g. >>> chrom,chromStart,chromEnd,otherStuffNotInBed9FileDefinition >>> ? >>> >>> Because, otherwise, it seems like >>> >>> require(GenomicRanges) >>> tbl <- read.csv(file1) >>> GR1 <- as(tbl, 'GRanges') >>> >>> GR2 <- as(read.csv(file2), 'GRanges') >>> GR3 <- as(read.csv(file3), 'GRanges') >>> >>> lines_in_file1_over_2and3 <- queryHits(findOverlaps(GR1, >>> >>> intersect(GR2,GR3))) >>> >>> might be the most straightforward way to get the requested line numbers. >>> (?) >>> >>> Now if he has actual BED files already, then obviously your method is >>> better, and I am making a note of this trick for later (I have been >>> TABIX'ing my BEDs & other tab-delimited genome-format files ever since you >>> added support for it). >>> >>> Thanks for the tip, >>> >>> --t >>> >>> >>> >>> *He that would live in peace and at ease, * >>> *Must not speak all he knows, nor judge all he sees.* >>> >>> Benjamin Franklin, Poor Richard's Almanack<http: archive.org="" details="" poorrichardsalma00franrich=""> >>> >>> >>> On Tue, Nov 19, 2013 at 5:21 PM, Michael Lawrence < >>> lawrence.michael@gene.com> wrote: >>> >>>> Simply run Rsamtools::bgzip() and indexTabix() on your BED file and then >>>> rtracklayer::import("my.bed.gz", which = roi_as_granges). >>>> >>>> >>>> >>>> >>>> On Tue, Nov 19, 2013 at 10:49 AM, Giovanni Carosso <gacarosso@gmail.com>>>> >wrote: >>>> >>>> > Hello, >>>> > >>>> > I have three CSV files of genomic intervals (chromosome start, stop, >>>> plus >>>> > other information in different columns) each containing between >>>> 10k-20k >>>> > lines. >>>> > >>>> > I am looking for a way to pull out from one file only the lines that >>>> have >>>> > genomic overlap with both of the other two files. In other words, >>>> subset >>>> > file A to include only the lines of file A whose interval overlaps >>>> > intervals in Files B and C. >>>> > >>>> > I'm able to create GRanges and manipulate these to get the subsetted >>>> > GRanges list, but I have not found a way to then use this GRanges as >>>> an >>>> > index to pull out the proper lines from CSV file. >>>> > >>>> > Is there a simple way to subset my file in this way that I'm missing? >>>> > >>>> > Thanks, >>>> > Giovanni >>>> > >>>> > [[alternative HTML version deleted]] >>>> > >>>> > _______________________________________________ >>>> > Bioconductor mailing list >>>> > Bioconductor@r-project.org >>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> > Search the archives: >>>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> > >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Giovanni Carosso ▴ 30

Login before adding your answer.