New to Bioconductor is there a better way?

0

Entering edit mode

Davis, Brian ▴ 40

@davis-brian-5165

Last seen 9.7 years ago

I'm very new to Bioconductor (first time to use it) but not to R. I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem. I have data frame of chromosome/position pairs (along with other data for the location). For each pair I need to determine if it is with in a given data frame of ranges. I need to keep only the pairs that are within any of the ranges for further processing. Example: snps<-NULL snps$CHR<-c("1","2","2","3","X") snps$POS<-as.integer(c(295,640,670,100,1100)) snps$DAT<-seq(1:length(snps$CHR)) snps<-as.data.frame(snps, stringsAsFactors=FALSE) snps CHR POS DAT 1 1 295 1 2 2 640 2 3 2 670 3 4 3 100 4 5 X 1100 5 region<-NULL region$CHR<-c("1","1","2","2","2","X") region$START<-as.integer(c(10,210,430,650,810,1090)) region$STOP<-as.integer(c(100,350,630,675,850,1111)) region<-as.data.frame(region, stringsAsFactors=FALSE) region CHR START STOP 1 1 10 100 2 1 210 350 3 2 430 630 4 2 650 675 5 2 810 850 6 X 1090 1111 The result I need would look like Res CHR POS DAT 1 295 1 2 670 3 X 1100 5 My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through. My current solution is: library(GenomicRanges) snplist<-with(snps, GRanges(CHR, IRanges(POS, POS))) locations<-with(region, GRanges(CHR, IRanges(START, STOP))) olaps<-findOverlaps(snplist, locations) then I can easily use olaps to subset as needed. Just trying to see if there are other functions / ways to go about solving this in an effort to learn. Thanks, Brian Davis [[alternative HTML version deleted]]

SNP GO SNP GO • 868 views

ADD COMMENT • link updated 12.2 years ago by Kasper Daniel Hansen ★ 6.5k • written 12.2 years ago by Davis, Brian ▴ 40

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 10 months ago

United States

This is the way to do it. There is a convenience function called subsetByOverlaps(), you can probably guess what it does. Kasper On Thu, Mar 15, 2012 at 10:01 AM, Davis, Brian <brian.davis at="" uth.tmc.edu=""> wrote: > I'm very new to Bioconductor (first time to use it) but not to R. ?I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem. > > > I have data frame of chromosome/position pairs (along with other data for the location). ?For each pair I need to determine if it is with in a given data frame of ranges. ?I need to keep only the pairs that are within any of the ranges for further processing. > > > > Example: > > snps<-NULL > > snps$CHR<-c("1","2","2","3","X") > > snps$POS<-as.integer(c(295,640,670,100,1100)) > > snps$DAT<-seq(1:length(snps$CHR)) > > snps<-as.data.frame(snps, stringsAsFactors=FALSE) > > > > snps > > ?CHR ?POS DAT > > 1 ? 1 ?295 ? 1 > > 2 ? 2 ?640 ? 2 > > 3 ? 2 ?670 ? 3 > > 4 ? 3 ?100 ? 4 > > 5 ? X 1100 ? 5 > > > > region<-NULL > > region$CHR<-c("1","1","2","2","2","X") > > region$START<-as.integer(c(10,210,430,650,810,1090)) > > region$STOP<-as.integer(c(100,350,630,675,850,1111)) > > region<-as.data.frame(region, stringsAsFactors=FALSE) > > > > region > > ?CHR START STOP > > 1 ? 1 ? ?10 ?100 > > 2 ? 1 ? 210 ?350 > > 3 ? 2 ? 430 ?630 > > 4 ? 2 ? 650 ?675 > > 5 ? 2 ? 810 ?850 > > 6 ? X ?1090 1111 > > > > > > The result I need would look like > > > > Res > > > > CHR ?POS DAT > > ? 1 ?295 ? 1 > > ? 2 ?670 ? 3 > > ? X 1100 ? 5 > > > > > > My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through. > > > > My current solution is: > > library(GenomicRanges) > snplist<-with(snps, GRanges(CHR, IRanges(POS, POS))) > locations<-with(region, GRanges(CHR, IRanges(START, STOP))) > olaps<-findOverlaps(snplist, locations) > > then I can easily use olaps to subset as needed. ?Just trying to see if there are other functions / ways to go about solving this in an effort to learn. > > Thanks, > > Brian Davis > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.2 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Hi Brian, Since you are new to Bioconductor maybe you are not aware there is a much more convenient container than data.frame for storing the kind of information you are dealing with: the GRanges container. library(GenomicRanges) snps <- GRanges(seqnames=snps$CHR, ranges=IRanges(start=snps$POS, width=1)) regions <- GRanges(seqnames=regions$CHR, ranges=IRanges(start=region$START, end=region$STOP)) On 03/15/2012 07:05 AM, Kasper Daniel Hansen wrote: > This is the way to do it. > > There is a convenience function called subsetByOverlaps(), you can > probably guess what it does. Yep. I would also recommend you have a look at the various vignettes in the GenomicRanges package to get you familiarized with the basic infrastructure. Cheers, H. > > Kasper > > On Thu, Mar 15, 2012 at 10:01 AM, Davis, Brian<brian.davis at="" uth.tmc.edu=""> wrote: >> I'm very new to Bioconductor (first time to use it) but not to R. I have a solution to my problem but being new to Bioconductor I'm wondering if there isn't a more appropriate/better way to solve my problem. >> >> >> I have data frame of chromosome/position pairs (along with other data for the location). For each pair I need to determine if it is with in a given data frame of ranges. I need to keep only the pairs that are within any of the ranges for further processing. >> >> >> >> Example: >> >> snps<-NULL >> >> snps$CHR<-c("1","2","2","3","X") >> >> snps$POS<-as.integer(c(295,640,670,100,1100)) >> >> snps$DAT<-seq(1:length(snps$CHR)) >> >> snps<-as.data.frame(snps, stringsAsFactors=FALSE) >> >> >> >> snps >> >> CHR POS DAT >> >> 1 1 295 1 >> >> 2 2 640 2 >> >> 3 2 670 3 >> >> 4 3 100 4 >> >> 5 X 1100 5 >> >> >> >> region<-NULL >> >> region$CHR<-c("1","1","2","2","2","X") >> >> region$START<-as.integer(c(10,210,430,650,810,1090)) >> >> region$STOP<-as.integer(c(100,350,630,675,850,1111)) >> >> region<-as.data.frame(region, stringsAsFactors=FALSE) >> >> >> >> region >> >> CHR START STOP >> >> 1 1 10 100 >> >> 2 1 210 350 >> >> 3 2 430 630 >> >> 4 2 650 675 >> >> 5 2 810 850 >> >> 6 X 1090 1111 >> >> >> >> >> >> The result I need would look like >> >> >> >> Res >> >> >> >> CHR POS DAT >> >> 1 295 1 >> >> 2 670 3 >> >> X 1100 5 >> >> >> >> >> >> My current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through. >> >> >> >> My current solution is: >> >> library(GenomicRanges) >> snplist<-with(snps, GRanges(CHR, IRanges(POS, POS))) >> locations<-with(region, GRanges(CHR, IRanges(START, STOP))) >> olaps<-findOverlaps(snplist, locations) >> >> then I can easily use olaps to subset as needed. Just trying to see if there are other functions / ways to go about solving this in an effort to learn. >> >> Thanks, >> >> Brian Davis >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.2 years ago Hervé Pagès 16k

Login before adding your answer.