Entering edit mode
Yong Li
▴
80
@yong-li-5277
Last seen 10.2 years ago
Dear all,
I have a task that given a list of hundreds human genes, retrieve the
SNPs located in these genes. Using biomaRt seems to be a good option.
I though to first get the chromosome locations of the genes and then
find the SNPs in these regions. My codes is as the following:
# start my R code
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
dbsnp <- useMart("snp", dataset = "hsapiens_snp")
# gene_symbols.txt is the file that has the list of gene symbols.
genes <- read.table("./gene_symbols.txt")
genes <- genes$V1
genes <- genes[1:50]
locations <- getBM(attributes=c('ensembl_gene_id', 'hgnc_symbol',
'chromosome_name',
'start_position', 'end_position', 'strand'), filters =
'hgnc_symbol', values = genes,
mart = ensembl)
snps <- getBM(c('refsnp_id','allele','chrom_start','chrom_strand',
'consequence_type_tv'), filters = c('chr_name',
'chrom_start', 'chrom_end'), values =
list(locations$chromosome_name,
locations$start_position, locations$end_position), mart = dbsnp)
# end my R code
The step of using getBM to get the locations is extremely fast. But
the step to get the snps never finishes, even when I limit my gene
list to 50. Does anyone has an idea of the reason for this? Or any
suggestions to solve this problem using other ways/packages?
Thanks in advance!
Yong
PS: my sessioninfo.
> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.10.0
loaded via a namespace (and not attached):
[1] RCurl_1.91-1 tools_2.14.2 XML_3.9-4