New to Bioconductor is there a better way?
1
0
Entering edit mode
Davis, Brian ▴ 40
@davis-brian-5165
Last seen 10.3 years ago
I'm very new to Bioconductor (first time to use it) but not to R. I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem. I have data frame of chromosome/position pairs (along with other data for the location). For each pair I need to determine if it is with in a given data frame of ranges. I need to keep only the pairs that are within any of the ranges for further processing. Example: snps<-NULL snps$CHR<-c("1","2","2","3","X") snps$POS<-as.integer(c(295,640,670,100,1100)) snps$DAT<-seq(1:length(snps$CHR)) snps<-as.data.frame(snps, stringsAsFactors=FALSE) snps CHR POS DAT 1 1 295 1 2 2 640 2 3 2 670 3 4 3 100 4 5 X 1100 5 region<-NULL region$CHR<-c("1","1","2","2","2","X") region$START<-as.integer(c(10,210,430,650,810,1090)) region$STOP<-as.integer(c(100,350,630,675,850,1111)) region<-as.data.frame(region, stringsAsFactors=FALSE) region CHR START STOP 1 1 10 100 2 1 210 350 3 2 430 630 4 2 650 675 5 2 810 850 6 X 1090 1111 The result I need would look like Res CHR POS DAT 1 295 1 2 670 3 X 1100 5 My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through. My current solution is: library(GenomicRanges) snplist<-with(snps, GRanges(CHR, IRanges(POS, POS))) locations<-with(region, GRanges(CHR, IRanges(START, STOP))) olaps<-findOverlaps(snplist, locations) then I can easily use olaps to subset as needed. Just trying to see if there are other functions / ways to go about solving this in an effort to learn. Thanks, Brian Davis [[alternative HTML version deleted]]
SNP GO SNP GO • 938 views
ADD COMMENT
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 18 months ago
United States
This is the way to do it. There is a convenience function called subsetByOverlaps(), you can probably guess what it does. Kasper On Thu, Mar 15, 2012 at 10:01 AM, Davis, Brian <brian.davis at="" uth.tmc.edu=""> wrote: > I'm very new to Bioconductor (first time to use it) but not to R. ?I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem. > > > I have data frame of chromosome/position pairs (along with other data for the location). ?For each pair I need to determine if it is with in a given data frame of ranges. ?I need to keep only the pairs that are within any of the ranges for further processing. > > > > Example: > > snps<-NULL > > snps$CHR<-c("1","2","2","3","X") > > snps$POS<-as.integer(c(295,640,670,100,1100)) > > snps$DAT<-seq(1:length(snps$CHR)) > > snps<-as.data.frame(snps, stringsAsFactors=FALSE) > > > > snps > > ?CHR ?POS DAT > > 1 ? 1 ?295 ? 1 > > 2 ? 2 ?640 ? 2 > > 3 ? 2 ?670 ? 3 > > 4 ? 3 ?100 ? 4 > > 5 ? X 1100 ? 5 > > > > region<-NULL > > region$CHR<-c("1","1","2","2","2","X") > > region$START<-as.integer(c(10,210,430,650,810,1090)) > > region$STOP<-as.integer(c(100,350,630,675,850,1111)) > > region<-as.data.frame(region, stringsAsFactors=FALSE) > > > > region > > ?CHR START STOP > > 1 ? 1 ? ?10 ?100 > > 2 ? 1 ? 210 ?350 > > 3 ? 2 ? 430 ?630 > > 4 ? 2 ? 650 ?675 > > 5 ? 2 ? 810 ?850 > > 6 ? X ?1090 1111 > > > > > > The result I need would look like > > > > Res > > > > CHR ?POS DAT > > ? 1 ?295 ? 1 > > ? 2 ?670 ? 3 > > ? X 1100 ? 5 > > > > > > My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through. > > > > My current solution is: > > library(GenomicRanges) > snplist<-with(snps, GRanges(CHR, IRanges(POS, POS))) > locations<-with(region, GRanges(CHR, IRanges(START, STOP))) > olaps<-findOverlaps(snplist, locations) > > then I can easily use olaps to subset as needed. ?Just trying to see if there are other functions / ways to go about solving this in an effort to learn. > > Thanks, > > Brian Davis > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Brian, Since you are new to Bioconductor maybe you are not aware there is a much more convenient container than data.frame for storing the kind of information you are dealing with: the GRanges container. library(GenomicRanges) snps <- GRanges(seqnames=snps$CHR, ranges=IRanges(start=snps$POS, width=1)) regions <- GRanges(seqnames=regions$CHR, ranges=IRanges(start=region$START, end=region$STOP)) On 03/15/2012 07:05 AM, Kasper Daniel Hansen wrote: > This is the way to do it. > > There is a convenience function called subsetByOverlaps(), you can > probably guess what it does. Yep. I would also recommend you have a look at the various vignettes in the GenomicRanges package to get you familiarized with the basic infrastructure. Cheers, H. > > Kasper > > On Thu, Mar 15, 2012 at 10:01 AM, Davis, Brian<brian.davis at="" uth.tmc.edu=""> wrote: >> I'm very new to Bioconductor (first time to use it) but not to R. I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem. >> >> >> I have data frame of chromosome/position pairs (along with other data for the location). For each pair I need to determine if it is with in a given data frame of ranges. I need to keep only the pairs that are within any of the ranges for further processing. >> >> >> >> Example: >> >> snps<-NULL >> >> snps$CHR<-c("1","2","2","3","X") >> >> snps$POS<-as.integer(c(295,640,670,100,1100)) >> >> snps$DAT<-seq(1:length(snps$CHR)) >> >> snps<-as.data.frame(snps, stringsAsFactors=FALSE) >> >> >> >> snps >> >> CHR POS DAT >> >> 1 1 295 1 >> >> 2 2 640 2 >> >> 3 2 670 3 >> >> 4 3 100 4 >> >> 5 X 1100 5 >> >> >> >> region<-NULL >> >> region$CHR<-c("1","1","2","2","2","X") >> >> region$START<-as.integer(c(10,210,430,650,810,1090)) >> >> region$STOP<-as.integer(c(100,350,630,675,850,1111)) >> >> region<-as.data.frame(region, stringsAsFactors=FALSE) >> >> >> >> region >> >> CHR START STOP >> >> 1 1 10 100 >> >> 2 1 210 350 >> >> 3 2 430 630 >> >> 4 2 650 675 >> >> 5 2 810 850 >> >> 6 X 1090 1111 >> >> >> >> >> >> The result I need would look like >> >> >> >> Res >> >> >> >> CHR POS DAT >> >> 1 295 1 >> >> 2 670 3 >> >> X 1100 5 >> >> >> >> >> >> My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through. >> >> >> >> My current solution is: >> >> library(GenomicRanges) >> snplist<-with(snps, GRanges(CHR, IRanges(POS, POS))) >> locations<-with(region, GRanges(CHR, IRanges(START, STOP))) >> olaps<-findOverlaps(snplist, locations) >> >> then I can easily use olaps to subset as needed. Just trying to see if there are other functions / ways to go about solving this in an effort to learn. >> >> Thanks, >> >> Brian Davis >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6