biomaRt SNP (rs ID) -to-HGNC symbol mapping question
2
2
Entering edit mode
Last seen 6.0 years ago
United States

I am fairly new to biomaRt and have read through the user's guide (pub Apr 11, 2014). It was extremely helpful in building general understanding of the package and in putting together several useful queries. However, I have not been able to figure out if there is a way to construct a query to get the particular output I would like.

Specifically, I have a list of 1.9 million SNPs that are referenced by rs ID (e.g., rs35418599), SNP chromosome, SNP BP position. I would like to get the HUGO (HGNC) gene symbol mappings to each of those SNPs. Thus, at a minimum, I would like the output to include the SNP rsID and the SNP chromosomal location of my index dataset as well as the corresponding gene with it's chromosome, gene start and gene end locations.

Initially I attempted the query using the 'snp' mart because this mart has the 'snp_filter' (rs ID) and chromosomal location filtering options as well as gene attributes (e.g., ensembl gene, 'associated gene'). However, the processing was quite slow and if I attempted to feed the call more than 50,000 rows of input data, the execution would not complete. I suspect the connection to the database times out. (I am working on a windows machine so do not see an option for downloading a local copy of the database that will work with biomaRt). Also, the 'associated_gene' field (which provides HGNC symbols) had what seemed to be an excessive number of missing values. Here is an example of a call that did execute, to map 10,000 rsIDs and some relevant output:

system.time(

SNPgeneMapTEST.snp<-getBM(attributes=c("refsnp_id", "chr_name", "chrom_start", "ensembl_gene_stable_id", "associated_gene"),

filters=c("snp_filter", "chr_name", "chrom_start", "chrom_end"),

values=as.list(AnnotData[1:10000,c('rsID', 'chr', 'startBP', 'endBP')]), mart=snpmart))

# user  system elapsed

#  0.11    0.01  137.74

#refsnp_id chr_name chrom_start ensembl_gene_stable_id associated_gene

#1 rs11252127       10       98087        ENSG00000173876              NA

#2 rs11252546       10      104427        ENSG00000173876              NA

#3 rs12255619       10       98481        ENSG00000173876              NA

#4 rs35367238       10       71212                                                  NA

Next, I tried used the 'ensembl' database. Since ensembl does not have an equivalent SNP-based filtering option (so far as I can find), I used chromosome/BP locations for mapping. This procedure executes much faster and I am able to provide chunks of 500,000 SNPs.  However, since there is no rsID attribute, the output contains only the gene maps to my input SNPs, but not the input SNPs (rsIDs) themselves. So, it is not useful. Here is the script I submitted and some relevant output:

>system.time(SNPgeneMapTEST.ensembl <- getBM(c('hgnc_symbol','ensembl_gene_id','chromosome_name','start_position','end_position'),

filters = c('chromosome_name','marker_start','marker_end'),

values=as.list(AnnotData[1:500000,c('chr', 'startBP', 'endBP')]), mart=ensembl))

#output dimensions: 12688 x 6

#user  system elapsed

#1.29    0.03   60.18

hgnc_symbol ensembl_gene_id chromosome_name start_position end_position

#1      PTCHD3 ENSG00000182077          10       27687116     27703297

#2      GDF2 ENSG00000128802              10       48413092     48416853

Does anyone know of a way to get the ensembl-based output to include the input (index) rsID and marker location values? Alternatively, is there a way to improve the efficiency of the code for the 'snp' mart to prevent the execution from timing out during mapping of a large number of rsIDs?

Thank you so much, in advance, for any guidance you can provide.

Kathleen

biomart • 5.6k views
3
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

With that many SNPs, you would probably be better off using different infrastructure, although what I am going to show you isn't without its own pain. Here I am using the release version of Bioconductor, but you could use the devel version (soon to be released), which has GRCh38 data as well. Here I will just use 10,000 randomly selected rsIDs, in a character vector called 'rsids'.

library("SNPlocs.Hsapiens.dbSNP.20120608")

> gr <- snpid2grange(SNPlocs.Hsapiens.dbSNP.20120608, rsids)
> gr
GRanges with 10000 ranges and 2 metadata columns:
seqnames                 ranges strand   |   RefSNP_id
<Rle>              <IRanges>  <Rle>   | <character>
[1]     ch14 [102795914, 102795914]      +   |   180858818
[2]      ch7 [ 17587744,  17587744]      +   |    79709241
[3]      ch5 [ 49443237,  49443237]      +   |   149337550
[4]      ch5 [ 94400454,  94400454]      +   |   188481634
[5]     ch11 [104808780, 104808780]      +   |   149521384
...      ...                    ...    ... ...         ...
[9996]      ch1 [152135402, 152135402]      +   |    78872389
[9997]      ch1 [120390542, 120390542]      +   |   145043299
[9998]      ch7 [ 85057199,  85057199]      +   |   146211261
[9999]      ch6 [144300677, 144300677]      +   |   188976388
[10000]      ch2 [216296439, 216296439]      +   |   187032925
alleles_as_ambig
<character>
[1]                R
[2]                R
[3]                Y
[4]                Y
[5]                R
...              ...
[9996]                R
[9997]                W
[9998]                Y
[9999]                Y
[10000]                Y
---
seqlengths:
ch1       ch2       ch3       ch4 ...       chX       chY      chMT
249250621 243199373 198022430 191154276 ... 155270560  59373566     16569

Note that the chromosomes are 'ch1', etc. We will have to fix that.

> seqlevels(gr) <- genomeStyles()$Homo_sapiens$UCSC

In the new version of Bioc, you will be able to just do

> seqlevelsStyle(gr) <- "UCSC"

Now let's get gene info

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## get all genes
> gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
## remove unplaced sequences
> gns <- keepStandardChromosomes(gns, "Homo_sapiens","UCSC")

Now we want to know which of the SNPs (in the 'gr' object) are overlapping any genes (in the 'gns' object):

> hits <- findOverlaps(gr, gns)
Error in mergeNamedAtomicVectors(seqlengths(x), seqlengths(y), what = c("sequence",  :
sequence chrM has incompatible seqlengths:
- in 'x': 16569
- in 'y': 16571

URG. Stupid mitochondrial DNA. Let's remove that.

> gr <- keepSeqlevels(gr, seqlevels(gr)[1:24])
> gns <- keepSeqlevels(gns, seqlevels(gns)[1:24])
> hits <- findOverlaps(gr, gns)
Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence",  :
sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY have incompatible genomes:
- in 'x': GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5, GRCh37.p5
- in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19

Double URG! Let's make the genomes match (while GRCh37 and hg19 are not technically exactly the same, I forget how they differ, so let's ignore that part).

> genome(gr) <- "hg19"
> hits <- findOverlaps(gr, gns)

Now we can get the overlapping data, and add the HUGO symbols:

> grolaps <- gr[queryHits(hits),]
> gnolaps <- gns[subjectHits(hits),]
> library(org.Hs.eg.db)

> symb <- select(org.Hs.eg.db, unlist(mcols(gnolaps)$gene_id), "SYMBOL") > mcols(gnolaps)$symbol <- symb[,2]

GRanges with 6 ranges and 2 metadata columns:
seqnames                 ranges strand |         gene_id      symbol
<Rle>              <IRanges>  <Rle> | <CharacterList> <character>
55778    chr14 [102783714, 102809511]      + |           55778      ZNF839
26054     chr6 [ 76311622,  76427994]      + |           26054       SENP6
54496    chr16 [ 68344877,  68391169]      + |           54496       PRMT7
51741    chr16 [ 78133327,  79246564]      + |           51741        WWOX
375612     chr7 [103969104, 104549003]      + |          375612      LHFPL3
28232    chr15 [ 92396938,  92715665]      + |           28232     SLCO3A1
---
seqlengths:
chr1      chr2      chr3      chr4 ...     chr22      chrX      chrY
249250621 243199373 198022430 191154276 ...  51304566 155270560  59373566
GRanges with 6 ranges and 2 metadata columns:
seqnames                 ranges strand |   RefSNP_id alleles_as_ambig
<Rle>              <IRanges>  <Rle> | <character>      <character>
[1]    chr14 [102795914, 102795914]      + |   180858818                R
[2]     chr6 [ 76317018,  76317018]      + |   183934912                M
[3]    chr16 [ 68349015,  68349015]      + |    56162254                W
[4]    chr16 [ 78396768,  78396768]      + |     8058491                Y
[5]     chr7 [104100154, 104100154]      + |    59912285                Y
[6]    chr15 [ 92517715,  92517715]      + |   191756230                K
---
seqlengths:
chr1      chr2      chr3      chr4 ...     chr22      chrX      chrY
249250621 243199373 198022430 191154276 ...  51304566 155270560  59373566

And then if you want to put it all together, you can use the DataFrame class:

> out <- cbind(as(grolaps, "DataFrame"), as(gnolaps, "DataFrame"), cbind(mcols(gnolaps), mcols(grolaps)))
DataFrame with 6 rows and 6 columns
X                            X.1         gene_id
<GRanges>                      <GRanges> <CharacterList>
1 chr14:+:[102795914, 102795914] chr14:+:[102783714, 102809511]           55778
2  chr6:+:[ 76317018,  76317018]  chr6:+:[ 76311622,  76427994]           26054
3 chr16:+:[ 68349015,  68349015] chr16:+:[ 68344877,  68391169]           54496
4 chr16:+:[ 78396768,  78396768] chr16:+:[ 78133327,  79246564]           51741
5  chr7:+:[104100154, 104100154]  chr7:+:[103969104, 104549003]          375612
6 chr15:+:[ 92517715,  92517715] chr15:+:[ 92396938,  92715665]           28232
symbol   RefSNP_id alleles_as_ambig
<character> <character>      <character>
1      ZNF839   180858818                R
2       SENP6   183934912                M
3       PRMT7    56162254                W
4        WWOX     8058491                Y
5      LHFPL3    59912285                Y
6     SLCO3A1   191756230                K

HTH,

Jim

0
Entering edit mode

I should add that you can have a better column header as well:

> out <- cbind(Genes = as(grolaps, "DataFrame"), SNPs = as(gnolaps, "DataFrame"), cbind(mcols(gnolaps), mcols(grolaps)))
DataFrame with 6 rows and 6 columns
Genes.X                         SNPs.X         gene_id
<GRanges>                      <GRanges> <CharacterList>
1 chr14:+:[102795914, 102795914] chr14:+:[102783714, 102809511]           55778
2  chr6:+:[ 76317018,  76317018]  chr6:+:[ 76311622,  76427994]           26054
3 chr16:+:[ 68349015,  68349015] chr16:+:[ 68344877,  68391169]           54496
4 chr16:+:[ 78396768,  78396768] chr16:+:[ 78133327,  79246564]           51741
5  chr7:+:[104100154, 104100154]  chr7:+:[103969104, 104549003]          375612
6 chr15:+:[ 92517715,  92517715] chr15:+:[ 92396938,  92715665]           28232
symbol   RefSNP_id alleles_as_ambig
<character> <character>      <character>
1      ZNF839   180858818                R
2       SENP6   183934912                M
3       PRMT7    56162254                W
4        WWOX     8058491                Y
5      LHFPL3    59912285                Y
6     SLCO3A1   191756230                K

0
Entering edit mode
Last seen 6.0 years ago
United States

Jim,

Thank you very much for your response. I will be back at my machine tomorrow to give it a try. One quick follow-up question: From your output, it appears that the RefSNP_ids are not in the dbSNP reference SNP ID format (i.e., rs#).  Will this procedure match an index list that is in that format?

Thanks again,
Kathleen

0
Entering edit mode

Hi Kathleen,

The SNP IDs are just the IDs with the rs part stripped off. For instance, the very first SNP is rs180858818. To be sure, http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=180858818, and note that for GRCh37, it is on chr14:102795914.

Also note that I got mixed up and put the Genes label on the SNP locations and vice versa.

0
Entering edit mode

Jim (or others),

I am getting inconsistent results when I submit the line:

gr <- snpid2grange(SNPlocs.Hsapiens.dbSNP.20120608, rsids)

which seems to be based on whether my vector of rsids has any(too many?) that are missing from the accessed database.

My question is: Will an error be generated if any/some/many of the submitted vector of rsids are not in the dbSNP database? Is there a way to get it to give you results for those rsids that are not missing and skip those that are missing?

0
Entering edit mode

Hi Kathleen,

There is an object in the SNPlocs.Hsapiens.dbSNP.20120608 package that contains all the SNP IDs that are in the package, that you can use to test. You can load using

load(system.file("extdata/all_rsids.rda", package="SNPlocs.Hsapiens.dbSNP.20120608"))

And then you can subset your existing rsIDs by doing

knownrs <- paste0("rs", all_rsids)
ind <- rsids %in% knownrs
sum(ind)/length(ind)*100 ## percentage of your SNPs in the SNPlocs package
rsids <- rsids[ind]

And go from there. If there are tons of SNPs missing, it might be due to the age of this particular SNP package, and you would be better off to upgrade to the devel version of R so you can use the updated SNPlocs and TxDb packages.

0
Entering edit mode

Thank you very much, Jim.

I'll try that!

0
Entering edit mode

Luckily, 1926013 of my 1936864 SNPs (99.4%) were known rsIDs, so it does not appear to be a problem with the age of the SNP db we are using.

Using the remainder of your code, I was able to map my SNPs to hgnc symbols.

One follow-up question:

Does findOverlaps() function return only the hits in which the SNP location is strictly within the gene boundaries (i.e., between the start and end location of the gene)? If so, is there any way to modify the script in order to put a buffer region, say 5000 bp, around the gene location?

Thanks, once again, for your invaluable assistance.

0
Entering edit mode

Yes, findOverlaps() is only looking for things that overlap each other. It's easy to resize the regions you are looking to overlap. First note that the 'gns' object is a GRanges object. It can be difficult to find the help for these beasts, so I usually just remember one thing you can do, and then search on that (e.g., ?narrow or ?shift or ?promoters, etc). But the direct way to find the help is

help("intra-range-methods", "GenomicRanges")

Edit: Ignore what I wrote before...

I think you just want flank(), and I think you have to do it on each end separately.

> onend <- flank(gns, 5000)
> otherend <- flank(gns, 5000,FALSE)
> gns2 <- punion(onend, gns)
> gns2 <- punion(otherend, gns2)
GRanges with 6 ranges and 1 metadata column:
seqnames                 ranges strand |         gene_id
<Rle>              <IRanges>  <Rle> | <CharacterList>
1    chr19 [ 58858172,  58874214]      - |               1
10     chr8 [ 18248755,  18258723]      + |              10
100    chr20 [ 43248163,  43280376]      - |             100
1000    chr18 [ 25530930,  25757445]      - |            1000
10000     chr1 [243651535, 244006886]      - |           10000
100008586     chrX [ 49217763,  49233491]      + |       100008586
---
seqlengths:
chr1                 chr2 ...       chrUn_gl000249
249250621            243199373 ...                38502
GRanges with 6 ranges and 0 metadata columns:
seqnames                 ranges strand
<Rle>              <IRanges>  <Rle>
1    chr19 [ 58853172,  58879214]      -
10     chr8 [ 18243755,  18263723]      +
100    chr20 [ 43243163,  43285376]      -
1000    chr18 [ 25525930,  25762445]      -
10000     chr1 [243646535, 244011886]      -
100008586     chrX [ 49212763,  49238491]      +
---
seqlengths:
chr1                 chr2 ...       chrUn_gl000249
249250621            243199373 ...                38502

Which I now think is OK.

0
Entering edit mode

Thanks, Jim.

It's odd. When I run the findOverlaps using gns2, I get far fewer symbol hits and would expect to increase that number (i.e., went from 497722 with gns down to 93800 with gns2). Perhaps I'm doing something incorrectly.

0
Entering edit mode

0
Entering edit mode

The ranges for gns and gns2 look correct. However, when I run the remaining code to generate the hits and  add the HUGO symbols I get a couple of errors. Not sure if the resulting output is correct:

onend <- flank(gns, 5000)
otherend <- flank(gns, 5000,FALSE)
gns2 <- punion(onend, gns) #** SHOULD THIS BE gns or gns2?
gns2 <- punion(otherend, gns2)

#Now we want to know which of the SNPs (in the 'gr' object) are overlapping any genes (in the 'gns' object):
hits <- findOverlaps(gr, gns2)

#Now we can get the overlapping data, and add the HUGO symbols:
grolaps <- gr[queryHits(hits),]
gnolaps <- gns2[subjectHits(hits),]

library(org.Hs.eg.db)

symb <- select(org.Hs.eg.db, unlist(mcols(gnolaps)$gene_id), "SYMBOL") ## Error in .testForValidKeys(x, keys, keytype) : # 'keys' must be a character vector mcols(gnolaps)$symbol <- symb[,2]

#Error in [[<-(*tmp*, name, value = c("TUBBP5", "ZMYND11", "ZMYND11",  :
#  93800 elements in value to replace 590686 elements (93800 from previous run)

out <- cbind(SNPs = as(grolaps, "DataFrame"), Genes = as(gnolaps, "DataFrame"), cbind(mcols(gnolaps), mcols(grolaps)))

dim(out) # 590686 x 4 # ** This is more than I got before. I just want to make sure the above error is resolved so that I know these are correct mappings.

0
Entering edit mode

Doing the punion() business is stripping off the metadata columns of the GRanges object. You should do

mcols(gns2) <- mcols(gns)

to put them back in there.

0
Entering edit mode

Or you could just do the following to compute gns2:

gns2 <- gns
start(gns2) <- start(gns2) - 5000
end(gns2) <- end(gns2) + 5000

H.

0
Entering edit mode

OK, Jim. This all works well. One 'final' question. I'm not familiar with the objects generated by these procedures and for some reason, I cannot write the output to a text file. I get the following error.

rsIDtoHGNCmaps_GRanges2 <- cbind(SNPs = as(grolaps, "DataFrame"), Genes = as(gnolaps, "DataFrame"), cbind(mcols(gnolaps), mcols(grolaps)))

write.table(rsIDtoHGNCmaps_GRanges2, 'D:/WORKING/rsIDtoHGNCmaps_GRanges2.txt')
#Error in data.frame(seqnames = as.factor(seqnames(x)), start = start(x),  :
#  duplicate row.names: 10771, 414235, 23560, 55853, 22884, 399706, 642394, 5214, 100507034, 338588, 83592, 1645, 8644, 1109, 114131, 10276, 810, 54906, 84893, 84991, 5209, 100422989, 399715, 439949, 100507127, 3698, 509, 83860, 2625, 10659, 79746, 254427, 55526, 55176, 8872, 57118, 100616151, 283070, 10133, 55388, 8559, 100302116, 51182, 79723, 644890, 55301, 10557, 9317, 7431, 100128098, 8027, 653567, 4360, 574446, 221074, 783, 221079, 84898, 100616383, 100128511, 8028, 100130992, 100532731, 23412, 648, 9576, 219681, 22921, 256297, 220213, 56243, 693188, 79896, 57512, 53904, 2572, 54518, 645528, 731789, 23590, 84930, 22931, 51322, 100847088, 25805, 84569, 387647, 1326, 6935, 221016, 79741, 1390, 219771, 219770, 91074, 7581, 7587, 100129055, 158160, 399744, 399746, 84856, 9790, 100847014, 5979, 55454, 53828, 642819, 220992, 414201, 414197, 414208, 414260, 100506835, 283033, 100130539, 643236, 83937, 7570, 338579, 100422988, 240, 253725, 414212, 414241, 9721, 5540, 439965, 244,

0
Entering edit mode

You could always do

rownames(rsIDtoHGNCmaps_GRanges2) <- seq_len(nrow(rsIDtoHGNCmaps_GRanges2))

But in general you should resist writing things out to disk in text format. The whole idea behind all this infrastructure is to have easy ways to manipulate really large complex data, and dumping to disk just results in some huge file that is now really cumbersome to deal with.

What is your next step? Opening in Excel? That is a horrible idea on many levels, one of which is Excel's lovely habit of taking HUGO symbols that look like a date, and converting irreversibly to date format (did you know there are a whole host of genes with symbols like Apr1, Mar1, etc, which will be converted to 4/1/2014, etc?).

You are much better off learning to manipulate the data in R. Granted, the learning curve is steep, but exporting to suboptimal tools like Excel will cause more pain than enlightenment.

0
Entering edit mode

Thanks, Jim.

I actually just want it in a universally recognizable format so that I can import it into other applications that do not necessarily recognize the .rda format.  And I am very familiar with excel's destruction of HUGO symbols, particularly those that appear to it to be dates.

0
Entering edit mode

Not sure what a universally recognizable format would be for that. In addition to tabulated text, you could try GFF3:

mcols(gnolaps)$gene_id <- as.character(mcols(gnolaps)$gene_id)
mcols(gnolaps) <- cbind(DataFrame(type=factor("gene")), mcols(gnolaps), DataFrame(SNPloc=start(grolaps), RefSNP_id=paste0("rs", mcols(grolaps)$RefSNP_id), alleles_as_ambig=mcols(grolaps)$alleles_as_ambig))

library(rtracklayer)
export(gnolaps, "gnolaps.gff3")

To import the gnolaps object later on:

library(rtracklayer)
gnolaps2 <- import("gnolaps.gff3")

Unfortunately, there is a little bit of massage that needs to be done on gnolaps2 in order to get it back in its exact original state:

mcols(gnolaps2)$source <- mcols(gnolaps2)$score <- mcols(gnolaps2)$phase <- NULL mcols(gnolaps2)$SNPloc <- as.integer(mcols(gnolaps2)$SNPloc) names(mcols(gnolaps2)$type) <- NULL
identical(gnolaps2, gnolaps)  # TRUE

H.

PS: I edited all the code chunks in this thread to replace use of elementMetadata() with use of mcols() (the 2 are equivalent but the latter is the recommended one).

0
Entering edit mode

Jim. Thanks again for the suggestion for exporting as gff3. Any chance you know how to modify the 'out' file in order to export as .txt as well?