pulling functional information for SNPs
1
0
Entering edit mode
@james-w-macdonald-5106
Last seen 8 minutes ago
United States
Hi Kay, Please don't take things off-list. The list archives are intended to be a useful repository of questions that were asked and answered, and if we go off-list, that function is invalidated. Kay Jaja wrote: > Hi James, > > Thank you very much for the response, I have tried to use BiomaRt and > about 700,000 SNPs on the Affy 6.0 chip to get the chromosome and the > base pair position but the script is taking forever (more than 24 hours) > , I have tried a test on a small set of SNPs and it works. below is my > script Getting a query for 700K things will likely take a long time. Had you mentioned that you were using the Affy 6.0 chip, we could have gone in a different direction. biocLite("pd.genomewidesnp.6") ## this will take a while library(pd.genomewidesnp.6) con <- pd.genomewidesnp.6 at getdb() ## there might be a better way of setting up the connection... ## If so, Benilton will correct me very soon ;-P ## check things out > dbListTables(con) [1] "featureSet" "featureSetCNV" "pmfeature" "pmfeatureCNV" [5] "sequence" "sequenceCNV" "sqlite_stat1" "table_info" > dbListFields(con, "featureSet") [1] "fsetid" "man_fsetid" "affy_snp_id" "dbsnp_rs_id" [5] "chrom" "physical_pos" "strand" "cytoband" [9] "allele_a" "allele_b" "gene_assoc" "fragment_length" [13] "fragment_length2" "dbsnp" "cnv" ## try a 70K query - I won't show how I made the snp vector... > length(snps) [1] 70000 > system.time(dbGetQuery(con, paste("select dbsnp_rs_id, chrom, physical_pos from featureSet where dbsnp_rs_id in ('", paste(snps, collapse = "','"), "');", sep = ""))) user system elapsed 4.89 1.09 119.09 So about 2 min for 70K query. Not bad. > dbGetQuery(con, "select dbsnp_rs_id, chrom, physical_pos from featureSet limit 10;") dbsnp_rs_id chrom physical_pos 1 rs2887286 1 1145994 2 rs1496555 1 2224111 3 rs41477744 1 2319424 4 rs3890745 1 2543484 5 rs10492936 1 2926730 6 rs10489588 1 2941694 7 rs2376495 1 3084986 8 rs4648462 1 3155127 9 rs10492939 1 3292731 10 rs9424283 1 3695086 Best, Jim > > ######################################################## > # snps is a vector of about 700,000 rs numbers from the Affy 6.0 ##### > ###################################################### > library(biomaRt) > mart <- useMart("snp") > mart<-useDataset("hsapiens_snp",mart) > Checking attributes and filters ... ok > > out=getBM(attributes=("refsnp_id","chr_name","chrom_start"), > + filters=c("refsnp_id"), values=snps, mart=mart) > > ############################################ > > is there a reason why you prefer to use RMySQL over biomaRt? > Do you know why my script is taking too long? > > Thanks for your help, > > -------------------------------------------------------------------- ---- > *From:* James W. MacDonald <jmacdon at="" med.umich.edu=""> > *To:* Kay Jaja <kjaja27 at="" yahoo.com=""> > *Cc:* bioconductor at stat.math.ethz.ch > *Sent:* Wed, April 28, 2010 1:42:28 PM > *Subject:* Re: [BioC] pulling functional information for SNPs > > Hi Kay, > > Kay Jaja wrote: > > Hi , > > > > I have a list of SNPS (rs numbers ) and I am interested in pulling > the functional data corresponding to each SNP from a data base like > ensemble, i.e.( is the gene name if the snp i sin a gene, intron, exon, > non_ synonymous snp, or synonymous snp). is it possible to do this in R > using BioMart or any other packages? > > Do you mean to ask if it is possible, or is it easy? It is certainly > possible, although it depends on exactly what you want. Your question is > not as complete as it could be. In the future, you should try to explain > exactly what you are trying to do rather than asking open-ended questions. > > You can get information about SNPs using biomaRt, but the available > information looks pretty sparse to me when compared to the small list of > interests you seem to have. But you can look to see what is available > easily enough: > > library(biomaRt) > mart <- useMart("snp","hsapiens_snp") > listAttributes(mart) > > There are one or two vignettes that come with biomaRt that should help > you get started if you like what you see. > > I generally don't use biomaRt for this sort of thing, instead preferring > to hit the UCSC database directly. Note that what I show below might be > done as easily using the rtracklayer package; you might explore the > vignettes for that package as well. Anyway, I would use the RMySQL > package and query directly: > > library(RMySQL) > con <- dbConnect("MySQL", host = "genome-mysql.cse.ucsc.edu > <http: genome-mysql.cse.ucsc.edu=""/>", dbname = "hg18", user = "genome") > > ## what type of info is available? > > > dbGetQuery(con, "select * from snp129 where name='rs25';") > bin chrom chromStart chromEnd name score strand refNCBI refUCSC observed > 1 673 chr7 11550666 11550667 rs25 0 - T T A/G > molType class valid avHet avHetSE func > 1 genomic single by-cluster,by-frequency,by-hapmap 0.499586 0.014383 intron > locType weight > 1 exact 1 > > Note two things here. First, you don't know the return order, so you > should always ask for the database to return what you are querying on > (this is true of biomaRt as well). Second, if you are querying lots of > SNPs, just do it in one big query instead of one by one. Repeatedly > querying an online database will get you banned. So say your rs IDs are > in a vector rsid, and you want the chromosome, the position, the bases, > and the function (intron, exon, intragenic, etc). > > sql <- paste("select name, chrom, chromEnd, observed, func from snp129 > where name in ('", paste(rsid, collapse = "','"), "');", sep = "") > > there are a lot of ' and " in there, because we want something that > looks like this: > > select name, chrom, chromEnd, observed, func from snp129 where name in > ('rs25','rs26','rs27','rs28'); > > so you want to make sure the sql statement looks OK first. Then just do > > dat <- dbGetQuery(con, sql) > > > rsid <- c("rs25","rs26","rs27","rs28") > > rsid > [1] "rs25" "rs26" "rs27" "rs28" > > sql <- paste("select name, chrom, chromEnd, observed, func from > snp129 where name in ('", paste(rsid, collapse = "','"), "');", sep = "") > > sql > [1] "select name, chrom, chromEnd, observed, func from snp129 where name > in ('rs25','rs26','rs27','rs28');" > > z <- dbGetQuery(con, sql) > > z > name chrom chromEnd observed func > 1 rs25 chr7 11550667 A/G intron > 2 rs26 chr7 11549996 -/A/G intron > 3 rs27 chr7 11549750 C/G intron > 4 rs28 chr7 11562590 A/G intron > > Best, > > Jim > > > > > > > I appreciate your help, > > thanks > > > > > > [[alternative HTML version deleted]] > > > > > > > > ----------------------------------------------------------------- ------- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch <mailto:bioconductor at="" stat.math.ethz.ch=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826 > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
SNP affy biomaRt rtracklayer SNP affy biomaRt rtracklayer • 1.4k views
ADD COMMENT
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 10.3 years ago
Hi Jim, On 5/4/10 11:56 AM, James W. MacDonald wrote: > Getting a query for 700K things will likely take a long time. Had you > mentioned that you were using the Affy 6.0 chip, we could have gone in a > different direction. > > biocLite("pd.genomewidesnp.6") ## this will take a while > library(pd.genomewidesnp.6) > con <- pd.genomewidesnp.6 at getdb() > ## there might be a better way of setting up the connection... > ## If so, Benilton will correct me very soon ;-P > > ## check things out > > dbListTables(con) > [1] "featureSet" "featureSetCNV" "pmfeature" "pmfeatureCNV" > [5] "sequence" "sequenceCNV" "sqlite_stat1" "table_info" > > dbListFields(con, "featureSet") > [1] "fsetid" "man_fsetid" "affy_snp_id" "dbsnp_rs_id" > [5] "chrom" "physical_pos" "strand" "cytoband" > [9] "allele_a" "allele_b" "gene_assoc" "fragment_length" > [13] "fragment_length2" "dbsnp" "cnv" > > ## try a 70K query - I won't show how I made the snp vector... > > > length(snps) > [1] 70000 > > system.time(dbGetQuery(con, paste("select dbsnp_rs_id, chrom, > physical_pos from featureSet where dbsnp_rs_id in ('", paste(snps, > collapse = "','"), "');", sep = ""))) > user system elapsed > 4.89 1.09 119.09 > > So about 2 min for 70K query. Not bad. I was curious about this and tried the above on my laptop, a MacBook Pro running at 2.53GHz 4GB RAM (pretty sure it has a 5400 rpm disk) I get a result for the above in ~4 sec. And I can retrieve 800K in ~30 sec. Did you use a particularly slow system (NFS perhaps?) + seth -- Seth Falcon Bioconductor Core Team | FHCRC
ADD COMMENT
0
Entering edit mode
Hi Seth, Seth Falcon wrote: > Hi Jim, > > On 5/4/10 11:56 AM, James W. MacDonald wrote: >> Getting a query for 700K things will likely take a long time. Had you >> mentioned that you were using the Affy 6.0 chip, we could have gone in a >> different direction. >> >> biocLite("pd.genomewidesnp.6") ## this will take a while >> library(pd.genomewidesnp.6) >> con <- pd.genomewidesnp.6 at getdb() >> ## there might be a better way of setting up the connection... >> ## If so, Benilton will correct me very soon ;-P >> >> ## check things out >> > dbListTables(con) >> [1] "featureSet" "featureSetCNV" "pmfeature" "pmfeatureCNV" >> [5] "sequence" "sequenceCNV" "sqlite_stat1" "table_info" >> > dbListFields(con, "featureSet") >> [1] "fsetid" "man_fsetid" "affy_snp_id" "dbsnp_rs_id" >> [5] "chrom" "physical_pos" "strand" "cytoband" >> [9] "allele_a" "allele_b" "gene_assoc" "fragment_length" >> [13] "fragment_length2" "dbsnp" "cnv" >> >> ## try a 70K query - I won't show how I made the snp vector... >> >> > length(snps) >> [1] 70000 >> > system.time(dbGetQuery(con, paste("select dbsnp_rs_id, chrom, >> physical_pos from featureSet where dbsnp_rs_id in ('", paste(snps, >> collapse = "','"), "');", sep = ""))) >> user system elapsed >> 4.89 1.09 119.09 >> >> So about 2 min for 70K query. Not bad. > > I was curious about this and tried the above on my laptop, a MacBook Pro > running at 2.53GHz 4GB RAM (pretty sure it has a 5400 rpm disk) I get a > result for the above in ~4 sec. And I can retrieve 800K in ~30 sec. > > Did you use a particularly slow system (NFS perhaps?) Well, the hard drive isn't that slow - it's a 7200 rpm Seagate Barracuda. However, the comp is a Pentium 4 (3GHz) with 1 Gb RAM. I'm sort of shocked that this sled of a computer is that bad though... Jim > > + seth > > > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY

Login before adding your answer.

Traffic: 748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6