Entering edit mode
In my opinion, the problem is ill-posed. If we make it more specific
to say
that we want to obtain a map between DNA locations connected with
dbSNP
identifiers and "coding" regions of genes, we could use the resources
noted
by Herve. If instead we want to use literature-based SNP-gene
associations,
some progress could be made using the pd.genomewidesnp.6 annotation
resource
but it seems to me not too many of the rsnumbers supplied can be found
in
that. I expected to hear from biomaRt users that a relevant mapping
could
be obtained with that tool.
More typically investigators want to define some aspect of the
physical
context of a gene and ask whether DNA in there is polymorphic -- could
involve looking upstream or downstream of the conventional coding
region.
Here are some possible solutions to the problem I think is being
posed.
> x
[1] "rs7547453" "rs2840542" "rs1999527" "rs4648545" "rs10915459"
[6] "rs16838750" "rs12128230"
> mapSNP(x,txdb=tx) # uses resources noted by Herve
rsn sym
1 rs2840542 MORN1
2 rs2840542 MORN1
3 rs4648545 TP73
> mapSNP(x) # uses org.Hs.eg.db to define gene extents
rsn sym
6497 rs7547453 SKI
79906 rs2840542 MORN1
63976 rs1999527 PRDM16
7161 rs4648545 TP73
> mapSNP(x,rad=50000) # looks 50kb up or downstream of
# each coding region
rsn sym
6497 rs7547453 SKI
5192 rs2840542 PEX10
11079 rs2840542 RER1
79906 rs2840542 MORN1
100129534 rs2840542 LOC100129534
63976 rs1999527 PRDM16
1953 rs4648545 MEGF6
7161 rs4648545 TP73
49856 rs4648545 WDR8
127262 rs4648545 TPRG1L
284661 rs12128230 LOC284661
> unix.time(mapSNP(x,rad=50000))
user system elapsed
0.768 0.002 0.776
Here is the challenge to the bioconductor community: provide the
well-documented code to outperform (and/or correct?) mapSNP by 2359
PST 31
Dec 2010 -- the best solution will win an award of $150USD from the
Bioconductor foundation. My solution, which is chromosome-specific
and
allows selection of SNP mapping and gene mapping resources, takes 72
lines
with comments. If a biomaRt one-liner does the trick, so be it --
wrap it
up so that it works nicely to solve Simon's problem and it will be
considered. Judges' decisions will be final; foundation officers
ineligible. To avoid cluttering up the mailing system, submissions
must be
sent to carey.vj@gmail.com with BIOC CHALLENGE in the header. The
best
submissions, along with my effort that generated the solution
exemplars
above, will be presented and analyzed on a blog page to be announced
early
next year.
On Wed, Dec 22, 2010 at 9:51 AM, Simon Noël <simon.noel.2@ulaval.ca>
wrote:
> Hello,
>
> Sorry to be slow for a response. What you say that should take 3-5
min
> took 3-5 days... I don't have the best computer in the world.
>
> We do have a supercomputer but even if it's the same version of R
and
> package than my litle computer, it's not working... Well... It's
not
> working on my computer to but not at the same place. We both have R
2.12.0
> and the lastest version of every package.
>
> Here is what I have. I have try with the test you sugested.
>
> On the supercomputer :
>
> > getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9
> ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
995075
> 1158707
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19
> ch20
> 1147722 1105364 815729 740129 657719 757926 641905 645646
520666
> 586708
> ch21 ch22 chX chY chMT
> 338254 331060 529608 67438 624
> > ch22snps <- getSNPlocs("ch22")
> > ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 7288968 S 16050994
> 3 6518357 M 16051107
> 4 7292503 R 16051209
> 5 6518368 Y 16051241
> >
> >
> > makeGRangesFromRefSNPids <- function(myids)
> + {
> + ans_seqnames <- character(length(myids))
> + ans_seqnames[] <- "unknown"
> + ans_locs <- integer(length(myids))
> + for (seqname in names(getSNPcount())) {
> + locs <- getSNPlocs(seqname)
> + ids <- paste("rs", locs$RefSNP_id, sep="")
> + myrows <- match(myids, ids)
> + ans_seqnames[!is.na(myrows)] <- seqname
> + ans_locs[!is.na(myrows)] <- locs$loc[myrows]
> + }
> + GRanges(seqnames=ans_seqnames,
> + IRanges(start=ans_locs, width=1),
> + RefSNP_id=myids)
> + }
> >
> >
> > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
> "rs10915459", "rs16838750", "rs12128230", "rs999999999")
> > mysnps <- makeGRangesFromRefSNPids(myids)
> Warning message:
> In ans_locs[!is.na(myrows)] <- locs$loc[myrows] :
> number of items to replace is not a multiple of replacement length
> > mysnps # a GRanges object with 1 SNP per row
> GRangeswith 8 ranges and 1 elementMetadata value
> seqnames ranges strand | RefSNP_id
> <rle> <iranges> <rle> | <character>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] unknown [ 0, 0] * | rs999999999
> seqlengths
> ch1 unknown
> NA NA
> > txdb <- makeTranscriptDbFromUCSC(genome="hg19",
tablename="refGene")
> Downloadthe refGene table ... OK
> Downloadthe refLink table ... OK
> Extractthe 'transcripts' data frame ... OK
> Extractthe 'splicings' data frame ... OK
> Downloadand preprocess the 'chrominfo' data frame ... OK
> Preparethe 'metadata' data frame ... OK
> Makethe TranscriptDb object ... Error in .writeMetadataTable(conn,
> metadata) : subscript out of bounds
> In addition: There were 50 or more warnings (use warnings() to see
the
> first 50)
>
>
>
>
>
> On my computer :
>
>
> > getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9
> ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
995075
> 1158707
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19
> ch20
> 1147722 1105364 815729 740129 657719 757926 641905 645646
520666
> 586708
> ch21 ch22 chX chY chMT
> 338254 331060 529608 67438 624
> > ch22snps <- getSNPlocs("ch22")
> > ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 7288968 S 16050994
> 3 6518357 M 16051107
> 4 7292503 R 16051209
> 5 6518368 Y 16051241
> >
> >
> > makeGRangesFromRefSNPids <- function(myids)
> + {
> + ans_seqnames <- character(length(myids))
> + ans_seqnames[] <- "unknown"
> + ans_locs <- integer(length(myids))
> + for (seqname in names(getSNPcount())) {
> + locs <- getSNPlocs(seqname)
> + ids <- paste("rs", locs$RefSNP_id, sep="")
> + myrows <- match(myids, ids)
> + ans_seqnames[!is.na(myrows)] <- seqname
> + ans_locs[!is.na(myrows)] <- locs$loc[myrows]
> + }
> + GRanges(seqnames=ans_seqnames,
> + IRanges(start=ans_locs, width=1),
> + RefSNP_id=myids)
> + }
> >
> >
> > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
> "rs10915459", "rs16838750", "rs12128230", "rs999999999")
> > mysnps <- makeGRangesFromRefSNPids(myids)
> Warning message:
> In ans_locs[!is.na(myrows)] <- locs$loc[myrows] :
> number of items to replace is not a multiple of replacement length
> > mysnps # a GRanges object with 1 SNP per row
> GRangeswith 8 ranges and 1 elementMetadata value
> seqnames ranges strand | RefSNP_id
> <rle> <iranges> <rle> | <character>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] unknown [ 0, 0] * | rs999999999
> seqlengths
> ch1 unknown
> NA NA
> > txdb <- makeTranscriptDbFromUCSC(genome="hg19",
tablename="refGene")
> Downloadthe refGene table ... OK
> Downloadthe refLink table ... OK
> Extractthe 'transcripts' data frame ... OK
> Extractthe 'splicings' data frame ... OK
> Downloadand preprocess the 'chrominfo' data frame ... OK
> Preparethe 'metadata' data frame ... OK
> Makethe TranscriptDb object ... OK
> Il y a eu 50 avis ou plus (utilisez warnings() pour voir les 50
premiers)
> > txdb
> TranscriptDbobject:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: refGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 38098
> | exon_nrow: 230201
> | cds_nrow: 204683
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-12-22 09:34:40 -0500 (Wed, 22 Dec 2010)
> | GenomicFeatures version at creation time: 1.2.3
> | RSQLite version at creation time: 0.9-4
> | DBSCHEMAVERSION: 1.0
> > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> > tx # a GRanges object with 1 transcript per row
> GRangeswith 38098 ranges and 3 elementMetadata values
> seqnames ranges strand | tx_id
tx_name
> <rle> <iranges> <rle> | <integer>
<character>
> [1] chr1 [ 69091, 70008] + | 1021
NM_001005484
> [2] chr1 [323892, 328581] + | 1023
NR_028327
> [3] chr1 [323892, 328581] + | 1024
NR_028322
> [4] chr1 [323892, 328581] + | 1025
NR_028325
> [5] chr1 [367659, 368597] + | 1022
NM_001005277
> [6] chr1 [367659, 368597] + | 1026
NM_001005221
> [7] chr1 [367659, 368597] + | 1027
NM_001005224
> [8] chr1 [763064, 789740] + | 174
NR_015368
> [9] chr1 [861121, 879961] + | 1035
NM_152486
> ... ... ... ... ... ...
...
> [38090] chrY [27177050, 27198251] - | 18991
NM_004678
> [38091] chrY [27177050, 27198251] - | 18992
NM_001002760
> [38092] chrY [27177050, 27198251] - | 18993
NM_001002761
> [38093] chrY [27209230, 27246039] - | 18994
NR_001525
> [38094] chrY [27209230, 27246039] - | 18995
NR_002177
> [38095] chrY [27209230, 27246039] - | 18996
NR_002178
> [38096] chrY [27329790, 27330920] - | 18997
NR_001526
> [38097] chrY [27329790, 27330920] - | 18998
NR_002179
> [38098] chrY [27329790, 27330920] - | 18999
NR_002180
> gene_id
> <compressedcharacterlist>
> [1] 79501
> [2] 100133331
> [3] 100132287
> [4] 100132062
> [5] 81399
> [6] 729759
> [7] 26683
> [8] 643837
> [9] 148398
> ... ...
> [38090] 9083
> [38091] 442867
> [38092] 442868
> [38093] 114761
> [38094] 474150
> [38095] 474149
> [38096] 252949
> [38097] 474152
> [38098] 474151
> seqlengths
> chr1 chr2 ...
chr18_gl000207_random
> 249250621 243199373 ...
4262
> > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
> > map <- as.matrix(findOverlaps(mysnps, tx))
> Warning message :
> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
> Only some seqnames from 'query' and 'subject' were not identical
> > mapped_genes <- values(tx)$gene_id[map[, 2]]
> > mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]],
> elementLengths(mapped_genes))
> Erreur dans base::rep.int(x, ...) : valeur 'times' incorrecte
>
>
>
> Simon Noël
> CdeC
> ________________________________________
> De : Hervé Pagès [hpages@fhcrc.org]
> Date d'envoi : 5 décembre 2010 23:43
> À : Simon Noël
> Cc : bioconductor@r-project.org
> Objet : Re: [BioC] maping SNPs
>
> Hi Simon,
>
> On 12/03/2010 10:17 AM, Simon Noël wrote:
> >
> > Hi,
> >
> >
> >
> > I have a really big list of SNPs names like :
> >
> >
> >
> > SNPNAME
> >
> > rs7547453
> >
> > rs2840542
> >
> > rs1999527
> >
> > rs4648545
> >
> > rs10915459
> >
> > rs16838750
> >
> > rs12128230
> >
> > ...
> >
> >
> >
> > I woudlike to map them to their official gene symbol. What
the best
> way to
> > procede?
>
> Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
> from SNPs to genes and I don't think we have this kind of mappings
> either in our collection of annotations (*.db packages).
>
> But if your SNPs are Human then you can do the mapping yourself by
> using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
> packages.
>
> The latest SNPlocs.Hsapies.dbSNP.* package is
> SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
> SNP locations relative to the GRCh37 genome:
>
> > library(SNPlocs.Hsapiens.dbSNP.20101109)
> > getSNPcount()
> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9
> ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
995075
> 1158707
> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19
> ch20
> 1147722 1105364 815729 740129 657719 757926 641905 645646
520666
> 586708
> ch21 ch22 chX chY chMT
> 338254 331060 529608 67438 624
>
> Note that it doesn't contain *all* SNPs from dbSNP Build 132:
> only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
> for the details).
>
> > ch22snps <- getSNPlocs("ch22")
> > ch22snps[1:5, ]
> RefSNP_id alleles_as_ambig loc
> 1 56342815 K 16050353
> 2 7288968 S 16050994
> 3 6518357 M 16051107
> 4 7292503 R 16051209
> 5 6518368 Y 16051241
>
> Note that the rs prefix has been dropped.
>
> So here is how to proceed:
>
> First you can use the following function to make a GRanges object
from
> your SNP ids:
>
> makeGRangesFromRefSNPids <- function(myids)
> {
> ans_seqnames <- character(length(myids))
> ans_seqnames[] <- "unknown"
> ans_locs <- integer(length(myids))
> for (seqname in names(getSNPcount())) {
> locs <- getSNPlocs(seqname)
> ids <- paste("rs", locs$RefSNP_id, sep="")
> myrows <- match(myids, ids)
> ans_seqnames[!is.na(myrows)] <- seqname
> ans_locs[!is.na(myrows)] <- locs$loc[myrows]
> }
> GRanges(seqnames=ans_seqnames,
> IRanges(start=ans_locs, width=1),
> RefSNP_id=myids)
> }
>
> This takes between 3 and 5 minutes:
>
> > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
> "rs10915459", "rs16838750", "rs12128230",
"rs999999999")
> > mysnps <- makeGRangesFromRefSNPids(myids)
> > mysnps # a GRanges object with 1 SNP per row
> GRanges with 8 ranges and 1 elementMetadata value
> seqnames ranges strand | myids
> <rle> <iranges> <rle> | <character>
> [1] ch1 [2195117, 2195117] * | rs7547453
> [2] ch1 [2291680, 2291680] * | rs2840542
> [3] ch1 [3256108, 3256108] * | rs1999527
> [4] ch1 [3577321, 3577321] * | rs4648545
> [5] ch1 [4230463, 4230463] * | rs10915459
> [6] ch1 [4404344, 4404344] * | rs16838750
> [7] ch1 [4501911, 4501911] * | rs12128230
> [8] unknown [ 0, 0] * | rs999999999
>
> seqlengths
> ch1 unknown
> NA NA
>
> The next step is to create a TranscriptDb object with
> makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
> from the GenomicFeatures package. This TranscriptDb object will
> contain the transcript locations and their associated
> genes extracted from the annotation source you choose.
> For example, if you want to use RefSeq genes:
>
> ## Takes about 3 minutes:
> > txdb <- makeTranscriptDbFromUCSC(genome="hg19",
tablename="refGene")
> > txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: refGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 37924
> | exon_nrow: 230024
> | cds_nrow: 204571
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
> | GenomicFeatures version at creation time: 1.2.2
> | RSQLite version at creation time: 0.9-4
> | DBSCHEMAVERSION: 1.0
>
> Note the type of gene IDs (Entrez Gene ID) stored in this
TranscriptDb
> object: this means that later you will be able to use the
org.Hs.eg.db
> package to map your gene ids to their symbol (the org.*.eg.db
packages
> are Entrez Gene ID centric).
>
> To extract the transcript locations together with their genes:
>
> > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> > tx # a GRanges object with 1 transcript per row
> GRanges with 37924 ranges and 1 elementMetadata value
> seqnames ranges strand |
gene_id
> <rle> <iranges> <rle> |
<compressedcharacterlist>
> [1] chr1 [ 69091, 70008] + |
79501
> [2] chr1 [323892, 328581] + |
100133331
> [3] chr1 [323892, 328581] + |
100132287
> [4] chr1 [323892, 328581] + |
100132062
> [5] chr1 [367659, 368597] + |
81399
> [6] chr1 [367659, 368597] + |
729759
> [7] chr1 [367659, 368597] + |
26683
> [8] chr1 [763064, 789740] + |
643837
> [9] chr1 [861121, 879961] + |
148398
> ... ... ... ... ...
...
> [37916] chrY [27177050, 27198251] - |
9083
> [37917] chrY [27177050, 27198251] - |
442867
> [37918] chrY [27177050, 27198251] - |
442868
> [37919] chrY [27209230, 27246039] - |
114761
> [37920] chrY [27209230, 27246039] - |
474150
> [37921] chrY [27209230, 27246039] - |
474149
> [37922] chrY [27329790, 27330920] - |
252949
> [37923] chrY [27329790, 27330920] - |
474152
> [37924] chrY [27329790, 27330920] - |
474151
>
> seqlengths
> chr1 chr2 ...
chr18_gl000207_random
> 249250621 243199373 ...
4262
>
> Now you can use findOverlaps() on 'mysnps' and 'tx' to find
> the transcripts hits by your snps. But before you can do this,
> you need to rename the sequences in 'mysnps' because dbSNPs and
> UCSC use different naming conventions for the chromosomes:
>
> > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
>
> Then:
>
> > map <- as.matrix(findOverlaps(mysnps, tx))
>
> 'map' contains the mapping between your SNPs and their genes but not
> in a readable form (this matrix contains indices) so we make the
> 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
> the associated gene ids:
>
> > mapped_genes <- values(tx)$gene_id[map[, 2]]
> > mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]],
> elementLengths(mapped_genes))
> > snp2gene <- unique(data.frame(snp_id=mapped_snps,
> gene_id=unlist(mapped_genes)))
> > rownames(snp2gene) <- NULL
> > snp2gene[1:4, ]
> snp_id gene_id
> 1 rs7547453 6497
> 2 rs2840542 79906
> 3 rs1999527 63976
> 4 rs4648545 7161
>
> Note that there is no guarantee that the number of rows in this
> data frame is the number of your original SNP ids because the
> relation between SNP ids and gene ids is of course not one-to-one.
>
> Also the method described above considers that a SNP hits a gene
> if it's located between the start and the end of one of its
> transcripts but it doesn't take in account the exon structure of
> the transcripts. If you want to do this you need to use exonsBy()
> (from GenomicFeatures) to extract the exons grouped by transcripts
> (this will be stored in a GRangesList object) and use this object
> instead of 'tx' in the call to findOverlaps().
>
> Hope this helps,
> H.
>
>
> >
> >
> >
> > Simon Noël
> > CdeC
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]