Problem using biomaRt to retrieve human SNPs given a list of gene symbols

0

Entering edit mode

Yong Li ▴ 80

@yong-li-5277

Last seen 11.4 years ago

Dear all, I have a task that given a list of hundreds human genes, retrieve the SNPs located in these genes. Using biomaRt seems to be a good option. I though to first get the chromosome locations of the genes and then find the SNPs in these regions. My codes is as the following: # start my R code library(biomaRt) ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") dbsnp <- useMart("snp", dataset = "hsapiens_snp") # gene_symbols.txt is the file that has the list of gene symbols. genes <- read.table("./gene_symbols.txt") genes <- genes$V1 genes <- genes[1:50] locations <- getBM(attributes=c('ensembl_gene_id', 'hgnc_symbol', 'chromosome_name', 'start_position', 'end_position', 'strand'), filters = 'hgnc_symbol', values = genes, mart = ensembl) snps <- getBM(c('refsnp_id','allele','chrom_start','chrom_strand', 'consequence_type_tv'), filters = c('chr_name', 'chrom_start', 'chrom_end'), values = list(locations$chromosome_name, locations$start_position, locations$end_position), mart = dbsnp) # end my R code The step of using getBM to get the locations is extremely fast. But the step to get the snps never finishes, even when I limit my gene list to 50. Does anyone has an idea of the reason for this? Or any suggestions to solve this problem using other ways/packages? Thanks in advance! Yong PS: my sessioninfo. > sessionInfo() R version 2.14.2 (2012-02-29) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.10.0 loaded via a namespace (and not attached): [1] RCurl_1.91-1 tools_2.14.2 XML_3.9-4

biomaRt biomaRt • 3.4k views

ADD COMMENT • link updated 13.7 years ago by ying chen ▴ 340 • written 13.7 years ago by Yong Li ▴ 80

0

Entering edit mode

ying chen ▴ 340

@ying-chen-5085

Last seen 11.4 years ago

Hi, I have a list of probes and a table with porbes' chromosome coordinates. I want to retrieve probe's gene symbol, gene's chromosome coordinate,..... I can use biomaRt to retrieve info one probe at a time. For example: > genesymbol<-getBM(attributes=c('entrezgene','hgnc _symbol','description','chromosome_name','start_position','end_positio n'),filters=c('chromosome_name','start','end'),values=list(4,40354254, 40354313),mart=ensembl) > genesymbol entrezgene hgnc_symbol 1 55584 CHRNA9 description 1 cholinergic receptor, nicotinic, alpha 9 [Source:HGNC Symbol;Acc:14079] chromosome_name start_position end_position 1 4 40337346 40357234 > I just wonder if there is a quick way to retrieve the info for all probes I have at once? I have chromosome coordinates of my probes in a tab-delimited txt file like: 4 40354254 40354213 5 234567 3450404006 4736473 4897789............ Any suggestion? Thanks a lot for the help! Ying [[alternative HTML version deleted]]

ADD COMMENT • link 13.7 years ago ying chen ▴ 340

0

Entering edit mode

Hi Ying, I figured you could just extend your values list to include vectors of chr, start, end, like thus: > values $chromosome_name 1 2 3 4 5 6 7 8 9 10 "19" "12" "8" "8" "8" "8" "14" "3" "2" "17" $start 1 2 3 4 5 6 "58858174" "9220304" "18027971" "18067618" "18079177" "18248755" 7 8 9 10 "95078714" "151531861" "219128853" "74449433" $end 1 2 3 4 5 6 "58864865" "9268558" "18081197" "18081197" "18081197" "18258723" 7 8 9 10 "95090389" "151546277" "219134893" "74466198" But if you then use your call to getBM(), you get this: > dim(genesymbol) [1] 16376 6 and further exploration indicates that what you get are genes that fulfill the criteria of all three list items, so you get anything from chr19 that is between 58858174 and 219134893 (and the same for all of the other chromosomes). So not that helpful. Now there might be a nice way to get what you want directly from biomaRt, but I can't figure out how to do so. But the GenomicRanges package gives us a way out. > gr <- GRanges(seqnames = genesymbol$chrom, ranges = IRanges(start=genesymbol$start, end = genesymbol$end)) > gr2 <- GRanges(seqnames = values[[1]], ranges = IRanges(start = as.numeric(values[[2]]), end = as.numeric(values[[3]]))) So here I have made a GRanges object based on all the sequences we get back from biomaRt (gr) and a GRanges object based on the original 10 positions I queried on (gr2). We can now create an indicator that tells us which of the biomaRt sequences are in the original 10: > ind <- gr %in% gr2 and subset genesymbol using that indicator: > genesymbol <- genesymbol[ind,] Best, Jim On 5/11/2012 2:59 PM, ying chen wrote: > Hi, I have a list of probes and a table with porbes' chromosome coordinates. I want to retrieve probe's gene symbol, gene's chromosome coordinate,..... I can use biomaRt to retrieve info one probe at a time. For example:> genesymbol<-getBM(attributes=c('entrezgene','hgnc _symbol','description','chromosome_name','start_position','end_positio n'),filters=c('chromosome_name','start','end'),values=list(4,40354254, 40354313),mart=ensembl) >> genesymbol > entrezgene hgnc_symbol > 1 55584 CHRNA9 > description > 1 cholinergic receptor, nicotinic, alpha 9 [Source:HGNC Symbol;Acc:14079] > chromosome_name start_position end_position > 1 4 40337346 40357234 > I just wonder if there is a quick way to retrieve the info for all probes I have at once? I have chromosome coordinates of my probes in a tab-delimited txt file like: 4 40354254 40354213 > 5 234567 3450404006 4736473 4897789............ Any suggestion? Thanks a lot for the help! Ying > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 13.7 years ago James W. MacDonald 68k

Login before adding your answer.