I am fairly new to biomaRt and have read through the user's guide (pub Apr 11, 2014). It was extremely helpful in building general understanding of the package and in putting together several useful queries. However, I have not been able to figure out if there is a way to construct a query to get the particular output I would like.
Specifically, I have a list of 1.9 million SNPs that are referenced by rs ID (e.g., rs35418599), SNP chromosome, SNP BP position. I would like to get the HUGO (HGNC) gene symbol mappings to each of those SNPs. Thus, at a minimum, I would like the output to include the SNP rsID and the SNP chromosomal location of my index dataset as well as the corresponding gene with it's chromosome, gene start and gene end locations.
Initially I attempted the query using the 'snp' mart because this mart has the 'snp_filter' (rs ID) and chromosomal location filtering options as well as gene attributes (e.g., ensembl gene, 'associated gene'). However, the processing was quite slow and if I attempted to feed the call more than 50,000 rows of input data, the execution would not complete. I suspect the connection to the database times out. (I am working on a windows machine so do not see an option for downloading a local copy of the database that will work with biomaRt). Also, the 'associated_gene' field (which provides HGNC symbols) had what seemed to be an excessive number of missing values. Here is an example of a call that did execute, to map 10,000 rsIDs and some relevant output:
system.time( SNPgeneMapTEST.snp<-getBM(attributes=c("refsnp_id", "chr_name", "chrom_start", "ensembl_gene_stable_id", "associated_gene"), filters=c("snp_filter", "chr_name", "chrom_start", "chrom_end"), values=as.list(AnnotData[1:10000,c('rsID', 'chr', 'startBP', 'endBP')]), mart=snpmart)) # user system elapsed # 0.11 0.01 137.74 >head(SNPgeneMapTEST.snp) #refsnp_id chr_name chrom_start ensembl_gene_stable_id associated_gene #1 rs11252127 10 98087 ENSG00000173876 NA #2 rs11252546 10 104427 ENSG00000173876 NA #3 rs12255619 10 98481 ENSG00000173876 NA #4 rs35367238 10 71212 NA
Next, I tried used the 'ensembl' database. Since ensembl does not have an equivalent SNP-based filtering option (so far as I can find), I used chromosome/BP locations for mapping. This procedure executes much faster and I am able to provide chunks of 500,000 SNPs. However, since there is no rsID attribute, the output contains only the gene maps to my input SNPs, but not the input SNPs (rsIDs) themselves. So, it is not useful. Here is the script I submitted and some relevant output:
>system.time(SNPgeneMapTEST.ensembl <- getBM(c('hgnc_symbol','ensembl_gene_id','chromosome_name','start_position','end_position'), filters = c('chromosome_name','marker_start','marker_end'), values=as.list(AnnotData[1:500000,c('chr', 'startBP', 'endBP')]), mart=ensembl)) #output dimensions: 12688 x 6 #user system elapsed #1.29 0.03 60.18 >head(SNPgeneMapTEST.ensembl) hgnc_symbol ensembl_gene_id chromosome_name start_position end_position #1 PTCHD3 ENSG00000182077 10 27687116 27703297 #2 GDF2 ENSG00000128802 10 48413092 48416853
Does anyone know of a way to get the ensembl-based output to include the input (index) rsID and marker location values? Alternatively, is there a way to improve the efficiency of the code for the 'snp' mart to prevent the execution from timing out during mapping of a large number of rsIDs?
Thank you so much, in advance, for any guidance you can provide.
Kathleen
I should add that you can have a better column header as well: