RE : maping SNPs

0

Entering edit mode

SimonNoël ▴ 450

@simonnoel-3455

Last seen 9.7 years ago

Hello Mr. Pag?s, At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow. As you remember, with your help we changed it a little bit and we have got with R2.10 : library("IRanges") library("GenomicRanges") library("GenomicFeatures") #? changer si une version plus r?cente de la librairie est t?l?charg?e. library(SNPlocs.Hsapiens.dbSNP.20101109) library("org.Hs.eg.db") #Allocation de la m?moire sous windows memory.limit(size = 4095) #v?rification de la librairie SNPlocs.Hsapiens.dbSNP getSNPcount() ch22snps <- getSNPlocs("ch22") ch22snps[1:5, ] #Create a GRange objetc to use with GenomicRanges library makeGRangesFromRefSNPids <- function(myids, verbose=FALSE) { ans_seqnames <- character(length(myids)) ans_seqnames[] <- "unknown" ans_locs <- integer(length(myids)) for (seqname in names(getSNPcount())) { if (verbose) cat("Processing ", seqname, " SNPs ...\n", sep="") locs <- getSNPlocs(seqname) ids <- paste("rs", locs$RefSNP_id, sep="") myrows <- match(myids, ids) hit_idx <- !is.na(myrows) ans_seqnames[hit_idx] <- seqname ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]] } GRanges(seqnames=ans_seqnames, IRanges(start=ans_locs, width=1), RefSNP_id=myids) } #Test en utilisant les premi?res sondes du premier et second chormosome #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") #ouverture du fichier pour aller chercher nos num?ros rs rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE) myids <- rs_SNPs[,1] mysnps <- makeGRangesFromRefSNPids(myids) mysnps # a GRanges object with 1 SNP per row #create a TranscriptDb txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") txdb #extract the transcript locations together with their genes tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) tx # a GRanges object with 1 transcript per row #rename chromosome to fit USCS standard seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) #v?rifier pour X/Y -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps)) #mapping but not on a readable format map <- as.matrix(findOverlaps(mysnps, tx)) #making the mapping readable mapped_genes <- values(tx)$gene_id[map[, 2]] mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) rownames(snp2gene) <- NULL snp2gene #snp2gene se travaille mal alors on le transf?re en matrice snp2geneTmp = t(t(snp2gene)) #aller chercher les symboles correspondants ? nos gene id symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) save.image(file = "map.RData") And everything was working perfectly. Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping. I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error... Can you help me? The problems seems to start with "map <- as.matrix(findOverlaps(mysnps, tx))" and the other error seems to result from that problem. sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Hs.eg.db_2.6.4 [2] RSQLite_0.11.1 [3] DBI_0.2-5 [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 [5] GenomicFeatures_1.6.7 [6] AnnotationDbi_1.16.11 [7] Biobase_2.14.0 [8] GenomicRanges_1.6.7 [9] IRanges_1.12.6 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0 RCurl_1.9-5 [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4 zlibbioc_1.0.0 > library("IRanges") Attaching package: ?IRanges? The following object(s) are masked from ?package:base?: cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int, rbind, rep.int, setdiff, table, union > library("GenomicRanges") > library("GenomicFeatures") Loading required package: AnnotationDbi Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation("pkgname")'. Attaching package: ?Biobase? The following object(s) are masked from ?package:IRanges?: updateObject > #? changer si une version plus r?cente de la librairie est t?l?charg?e. > library(SNPlocs.Hsapiens.dbSNP.20110815) > library("org.Hs.eg.db") Loading required package: DBI > > > #Allocation de la m?moire sous windows > memory.limit(size = 4095) [1] Inf Warning message: 'memory.limit()' is Windows-specific > > #v?rification de la librairie SNPlocs.Hsapiens.dbSNP > getSNPcount() ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 1542256 1521919 1104719 1031214 949642 1084538 917737 886293 732039 788556 ch21 ch22 chX chY chMT 468379 454939 920890 75108 942 > ch22snps <- getSNPlocs("ch22") > ch22snps[1:5, ] RefSNP_id alleles_as_ambig loc 1 56342815 K 16050353 2 149201999 Y 16050408 3 146752890 S 16050612 4 139377059 Y 16050678 5 143300205 R 16050822 > > #########################? FAIRE CANOPUS################### > > #Create a GRange objetc to use with GenomicRanges library > makeGRangesFromRefSNPids <- function(myids, verbose=FALSE) + { + ans_seqnames <- character(length(myids)) + ans_seqnames[] <- "unknown" + ans_locs <- integer(length(myids)) + for (seqname in names(getSNPcount())) { + if (verbose) + cat("Processing ", seqname, " SNPs ...\n", sep="") + locs <- getSNPlocs(seqname) + ids <- paste("rs", locs$RefSNP_id, sep="") + myrows <- match(myids, ids) + hit_idx <- !is.na(myrows) + ans_seqnames[hit_idx] <- seqname + ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]] + } + GRanges(seqnames=ans_seqnames, + IRanges(start=ans_locs, width=1), + RefSNP_id=myids) + } > > > #Test en utilisant les premi?res sondes du premier et second chormosome > #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") > > #ouverture du fichier pour aller chercher nos num?ros rs > rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE) > myids <- rs_SNPs[,1] > > mysnps <- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row GRanges with 348411 ranges and 1 elementMetadata value: seqnames ranges strand | RefSNP_id <rle> <iranges> <rle> | <factor> [1] ch1 [2195117, 2195117] * | rs7547453 [2] ch1 [2291680, 2291680] * | rs2840542 [3] ch1 [3256108, 3256108] * | rs1999527 [4] ch1 [3577321, 3577321] * | rs4648545 [5] ch1 [4230463, 4230463] * | rs10915459 [6] ch1 [4404344, 4404344] * | rs16838750 [7] ch1 [4501911, 4501911] * | rs12128230 [8] ch1 [4535148, 4535148] * | rs7541288 [9] ch1 [4581230, 4581230] * | rs12039682 ... ... ... ... ... ... [348403] chX [154514047, 154514047] * | rs499428 [348404] chX [154514919, 154514919] * | rs507127 [348405] chX [154737376, 154737376] * | rs5940372 [348406] chX [154780283, 154780283] * | rs6642287 [348407] chX [154830377, 154830377] * | rs5983658 [348408] chX [154870197, 154870197] * | rs473772 [348409] chX [154892230, 154892230] * | rs553678 [348410] chX [154899846, 154899846] * | rs473491 [348411] chX [154929412, 154929412] * | rs557132 --- seqlengths: ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX unknown NA NA NA NA NA ... NA NA NA NA > > #create a TranscriptDb > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") Download the refGene table ... OK Download the refLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TranscriptDb object ... OK There were 50 or more warnings (use warnings() to see the first 50) > txdb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg19 | Genus and Species: Homo sapiens | UCSC Table: refGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: yes | transcript_nrow: 41677 | exon_nrow: 235596 | cds_nrow: 206518 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012) | GenomicFeatures version at creation time: 1.6.7 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 | package: GenomicFeatures > > #extract the transcript locations together with their genes > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row GRanges with 41677 ranges and 3 elementMetadata values: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr1 [ 11874, 14408] + | 1127 NR_046018 [2] chr1 [ 69091, 70008] + | 1128 NM_001005484 [3] chr1 [323892, 328581] + | 1130 NR_028327 [4] chr1 [323892, 328581] + | 1132 NR_028325 [5] chr1 [323892, 328581] + | 1133 NR_028322 [6] chr1 [367659, 368597] + | 1131 NM_001005221 [7] chr1 [367659, 368597] + | 1134 NM_001005224 [8] chr1 [367659, 368597] + | 1135 NM_001005277 [9] chr1 [763064, 789740] + | 198 NR_015368 ... ... ... ... ... ... ... [41669] chrY [27177050, 27198251] - | 20790 NM_004678 [41670] chrY [27177050, 27198251] - | 20793 NM_001002761 [41671] chrY [27177050, 27198251] - | 20794 NM_001002760 [41672] chrY [27209230, 27246039] - | 20791 NR_002177 [41673] chrY [27209230, 27246039] - | 20792 NR_002178 [41674] chrY [27209230, 27246039] - | 20795 NR_001525 [41675] chrY [27329790, 27330920] - | 20796 NR_002179 [41676] chrY [27329790, 27330920] - | 20797 NR_002180 [41677] chrY [27329790, 27330920] - | 20798 NR_001526 gene_id <compressedcharacterlist> [1] 100287102 [2] 79501 [3] 100133331 [4] 100132062 [5] 100132287 [6] 729759 [7] 26683 [8] 81399 [9] 643837 ... ... [41669] 9083 [41670] 442868 [41671] 442867 [41672] 474150 [41673] 474149 [41674] 114761 [41675] 474152 [41676] 474151 [41677] 252949 --- seqlengths: chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 > > #rename chromosome to fit USCS standard > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) : levels of supplied 'seqnames' must be identical to 'seqlevels(x)' > #v?rifier pour X/Y -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps)) > > #mapping but not on a readable format > map <- as.matrix(findOverlaps(mysnps, tx)) Warning message: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated] > > > #making the mapping readable > mapped_genes <- values(tx)$gene_id[map[, 2]] > mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene) <- NULL > snp2gene [1] SNPNAME gene_id <0 rows> (or 0-length row.names) > > > #snp2gene se travaille mal alors on le transf?re en matrice > snp2geneTmp = t(t(snp2gene)) > > #aller chercher les symboles correspondants ? nos gene id > symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) : error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) : keys must be supplied in a character vector with no NAs > > > save.image(file = "map.RData") > Simon No?l CdeC ________________________________________ De : Hervé Pagès [hpages at fhcrc.org] Date d'envoi : 5 d?cembre 2010 23:43 ? : Simon No?l Cc : bioconductor at r-project.org Objet : Re: [BioC] maping SNPs Hi Simon, On 12/03/2010 10:17 AM, Simon No?l wrote: > > Hi, > > > > I have a really big list of SNPs names like : > > > > SNPNAME > > rs7547453 > > rs2840542 > > rs1999527 > > rs4648545 > > rs10915459 > > rs16838750 > > rs12128230 > > ... > > > > I woudlike to map them to their official gene symbol. What the best way to > procede? Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings from SNPs to genes and I don't think we have this kind of mappings either in our collection of annotations (*.db packages). But if your SNPs are Human then you can do the mapping yourself by using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures packages. The latest SNPlocs.Hsapies.dbSNP.* package is SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains SNP locations relative to the GRCh37 genome: > library(SNPlocs.Hsapiens.dbSNP.20101109) > getSNPcount() ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 1158707 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 1147722 1105364 815729 740129 657719 757926 641905 645646 520666 586708 ch21 ch22 chX chY chMT 338254 331060 529608 67438 624 Note that it doesn't contain *all* SNPs from dbSNP Build 132: only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109 for the details). > ch22snps <- getSNPlocs("ch22") > ch22snps[1:5, ] RefSNP_id alleles_as_ambig loc 1 56342815 K 16050353 2 7288968 S 16050994 3 6518357 M 16051107 4 7292503 R 16051209 5 6518368 Y 16051241 Note that the rs prefix has been dropped. So here is how to proceed: First you can use the following function to make a GRanges object from your SNP ids: makeGRangesFromRefSNPids <- function(myids) { ans_seqnames <- character(length(myids)) ans_seqnames[] <- "unknown" ans_locs <- integer(length(myids)) for (seqname in names(getSNPcount())) { locs <- getSNPlocs(seqname) ids <- paste("rs", locs$RefSNP_id, sep="") myrows <- match(myids, ids) ans_seqnames[!is.na(myrows)] <- seqname ans_locs[!is.na(myrows)] <- locs$loc[myrows] } GRanges(seqnames=ans_seqnames, IRanges(start=ans_locs, width=1), RefSNP_id=myids) } This takes between 3 and 5 minutes: > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs10915459", "rs16838750", "rs12128230", "rs999999999") > mysnps <- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row GRanges with 8 ranges and 1 elementMetadata value seqnames ranges strand | myids <rle> <iranges> <rle> | <character> [1] ch1 [2195117, 2195117] * | rs7547453 [2] ch1 [2291680, 2291680] * | rs2840542 [3] ch1 [3256108, 3256108] * | rs1999527 [4] ch1 [3577321, 3577321] * | rs4648545 [5] ch1 [4230463, 4230463] * | rs10915459 [6] ch1 [4404344, 4404344] * | rs16838750 [7] ch1 [4501911, 4501911] * | rs12128230 [8] unknown [ 0, 0] * | rs999999999 seqlengths ch1 unknown NA NA The next step is to create a TranscriptDb object with makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart() from the GenomicFeatures package. This TranscriptDb object will contain the transcript locations and their associated genes extracted from the annotation source you choose. For example, if you want to use RefSeq genes: ## Takes about 3 minutes: > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > txdb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg19 | UCSC Table: refGene | Type of Gene ID: Entrez Gene ID | Full dataset: yes | transcript_nrow: 37924 | exon_nrow: 230024 | cds_nrow: 204571 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010) | GenomicFeatures version at creation time: 1.2.2 | RSQLite version at creation time: 0.9-4 | DBSCHEMAVERSION: 1.0 Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb object: this means that later you will be able to use the org.Hs.eg.db package to map your gene ids to their symbol (the org.*.eg.db packages are Entrez Gene ID centric). To extract the transcript locations together with their genes: > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row GRanges with 37924 ranges and 1 elementMetadata value seqnames ranges strand | gene_id <rle> <iranges> <rle> | <compressedcharacterlist> [1] chr1 [ 69091, 70008] + | 79501 [2] chr1 [323892, 328581] + | 100133331 [3] chr1 [323892, 328581] + | 100132287 [4] chr1 [323892, 328581] + | 100132062 [5] chr1 [367659, 368597] + | 81399 [6] chr1 [367659, 368597] + | 729759 [7] chr1 [367659, 368597] + | 26683 [8] chr1 [763064, 789740] + | 643837 [9] chr1 [861121, 879961] + | 148398 ... ... ... ... ... ... [37916] chrY [27177050, 27198251] - | 9083 [37917] chrY [27177050, 27198251] - | 442867 [37918] chrY [27177050, 27198251] - | 442868 [37919] chrY [27209230, 27246039] - | 114761 [37920] chrY [27209230, 27246039] - | 474150 [37921] chrY [27209230, 27246039] - | 474149 [37922] chrY [27329790, 27330920] - | 252949 [37923] chrY [27329790, 27330920] - | 474152 [37924] chrY [27329790, 27330920] - | 474151 seqlengths chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 Now you can use findOverlaps() on 'mysnps' and 'tx' to find the transcripts hits by your snps. But before you can do this, you need to rename the sequences in 'mysnps' because dbSNPs and UCSC use different naming conventions for the chromosomes: > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Then: > map <- as.matrix(findOverlaps(mysnps, tx)) 'map' contains the mapping between your SNPs and their genes but not in a readable form (this matrix contains indices) so we make the 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for the associated gene ids: > mapped_genes <- values(tx)$gene_id[map[, 2]] > mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]], elementLengths(mapped_genes)) > snp2gene <- unique(data.frame(snp_id=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene) <- NULL > snp2gene[1:4, ] snp_id gene_id 1 rs7547453 6497 2 rs2840542 79906 3 rs1999527 63976 4 rs4648545 7161 Note that there is no guarantee that the number of rows in this data frame is the number of your original SNP ids because the relation between SNP ids and gene ids is of course not one-to-one. Also the method described above considers that a SNP hits a gene if it's located between the start and the end of one of its transcripts but it doesn't take in account the exon structure of the transcripts. If you want to do this you need to use exonsBy() (from GenomicFeatures) to extract the exons grouped by transcripts (this will be stored in a GRangesList object) and use this object instead of 'tx' in the call to findOverlaps(). Hope this helps, H. > > > > Simon No??l > CdeC > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

SNP Annotation Cancer SNPlocs TranscriptDb GenomicFeatures GenomicRanges SNP Annotation • 1.8k views

ADD COMMENT • link updated 12.3 years ago by Hervé Pagès 16k • written 12.3 years ago by SimonNoël ▴ 450

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 9 hours ago

Seattle, WA, United States

Hi Simon, On 02/09/2012 12:28 PM, Simon No?l wrote: > Hello Mr. Pag?s, > > At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow. As you remember, with your help we changed it a little bit and we have got with R2.10 : > > library("IRanges") > library("GenomicRanges") > library("GenomicFeatures") > #? changer si une version plus r?cente de la librairie est t?l?charg?e. > library(SNPlocs.Hsapiens.dbSNP.20101109) > library("org.Hs.eg.db") > > #Allocation de la m?moire sous windows > memory.limit(size = 4095) > #v?rification de la librairie SNPlocs.Hsapiens.dbSNP > getSNPcount() > ch22snps<- getSNPlocs("ch22") > ch22snps[1:5, ] > > #Create a GRange objetc to use with GenomicRanges library > makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > if (verbose) > cat("Processing ", seqname, " SNPs ...\n", sep="") > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > hit_idx<- !is.na(myrows) > ans_seqnames[hit_idx]<- seqname > ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > #Test en utilisant les premi?res sondes du premier et second chormosome > #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") > #ouverture du fichier pour aller chercher nos num?ros rs > rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) > myids<- rs_SNPs[,1] > mysnps<- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row > #create a TranscriptDb > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > txdb > #extract the transcript locations together with their genes > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row > #rename chromosome to fit USCS standard > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) > #mapping but not on a readable format > map<- as.matrix(findOverlaps(mysnps, tx)) > > #making the mapping readable > mapped_genes<- values(tx)$gene_id[map[, 2]] > mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene)<- NULL > snp2gene > > #snp2gene se travaille mal alors on le transf?re en matrice > snp2geneTmp = t(t(snp2gene)) > #aller chercher les symboles correspondants ? nos gene id > symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > > save.image(file = "map.RData") > > > > > > And everything was working perfectly. > > Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping. I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error... You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which means a lot of things could have changed. You should not assume that your problems are caused only because you are using a more recent SNPlocs package. > Can you help me? The problems seems to start with "map<- as.matrix(findOverlaps(mysnps, tx))" The problem starts earlier with: > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) : levels of supplied 'seqnames' must be identical to 'seqlevels(x)' The right way to do this with recent version of GenomicRanges is to use seqlevels() instead of seqnames(). > and the other error seems to result from that problem. Failing to rename the seqlevels will surely cause you some troubles later down so I would try to address this issue first see if that solves the other problems. Also note that with recent versions of the SNPlocs packages (i.e. version >= 0.99.6), you can use rsidsToGRanges() to do what your home made makeGRangesFromRefSNPids() function does. The latter is much faster BUT, unlike the former, it will fail if some of the supplied rs ids are not found in the SNPlocs package (it will issue an error showing the list of rs ids that could not be found). I've already received some request to improve this so I'll try to work on this soon. Cheers, H. > > > > sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: x86_64-pc-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] org.Hs.eg.db_2.6.4 > [2] RSQLite_0.11.1 > [3] DBI_0.2-5 > [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 > [5] GenomicFeatures_1.6.7 > [6] AnnotationDbi_1.16.11 > [7] Biobase_2.14.0 > [8] GenomicRanges_1.6.7 > [9] IRanges_1.12.6 > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0 RCurl_1.9-5 > [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4 zlibbioc_1.0.0 > >> library("IRanges") > Attaching package: ?IRanges? > The following object(s) are masked from ?package:base?: > cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int, > pmin, pmin.int, rbind, rep.int, setdiff, table, union >> library("GenomicRanges") >> library("GenomicFeatures") > Loading required package: AnnotationDbi > Loading required package: Biobase > Welcome to Bioconductor > Vignettes contain introductory material. To view, type > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation("pkgname")'. > > Attaching package: ?Biobase? > The following object(s) are masked from ?package:IRanges?: > updateObject >> #? changer si une version plus r?cente de la librairie est t?l?charg?e. >> library(SNPlocs.Hsapiens.dbSNP.20110815) >> library("org.Hs.eg.db") > Loading required package: DBI >> >> >> #Allocation de la m?moire sous windows >> memory.limit(size = 4095) > [1] Inf > Warning message: > 'memory.limit()' is Windows-specific >> >> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP >> getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 > 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 > 1542256 1521919 1104719 1031214 949642 1084538 917737 886293 732039 788556 > ch21 ch22 chX chY chMT > 468379 454939 920890 75108 942 >> ch22snps<- getSNPlocs("ch22") >> ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 149201999 Y 16050408 > 3 146752890 S 16050612 > 4 139377059 Y 16050678 > 5 143300205 R 16050822 >> >> #########################? FAIRE CANOPUS################### >> >> #Create a GRange objetc to use with GenomicRanges library >> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > + { > + ans_seqnames<- character(length(myids)) > + ans_seqnames[]<- "unknown" > + ans_locs<- integer(length(myids)) > + for (seqname in names(getSNPcount())) { > + if (verbose) > + cat("Processing ", seqname, " SNPs ...\n", sep="") > + locs<- getSNPlocs(seqname) > + ids<- paste("rs", locs$RefSNP_id, sep="") > + myrows<- match(myids, ids) > + hit_idx<- !is.na(myrows) > + ans_seqnames[hit_idx]<- seqname > + ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > + } > + GRanges(seqnames=ans_seqnames, > + IRanges(start=ans_locs, width=1), > + RefSNP_id=myids) > + } >> >> >> #Test en utilisant les premi?res sondes du premier et second chormosome >> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") >> >> #ouverture du fichier pour aller chercher nos num?ros rs >> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) >> myids<- rs_SNPs[,1] >> >> mysnps<- makeGRangesFromRefSNPids(myids) >> mysnps # a GRanges object with 1 SNP per row > GRanges with 348411 ranges and 1 elementMetadata value: > seqnames ranges strand | RefSNP_id > <rle> <iranges> <rle> |<factor> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] ch1 [4535148, 4535148] * | rs7541288 > [9] ch1 [4581230, 4581230] * | rs12039682 > ... ... ... ... ... ... > [348403] chX [154514047, 154514047] * | rs499428 > [348404] chX [154514919, 154514919] * | rs507127 > [348405] chX [154737376, 154737376] * | rs5940372 > [348406] chX [154780283, 154780283] * | rs6642287 > [348407] chX [154830377, 154830377] * | rs5983658 > [348408] chX [154870197, 154870197] * | rs473772 > [348409] chX [154892230, 154892230] * | rs553678 > [348410] chX [154899846, 154899846] * | rs473491 > [348411] chX [154929412, 154929412] * | rs557132 > --- > seqlengths: > ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX unknown > NA NA NA NA NA ... NA NA NA NA >> >> #create a TranscriptDb >> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > Download the refGene table ... OK > Download the refLink table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | Genus and Species: Homo sapiens > | UCSC Table: refGene > | Resource URL: http://genome.ucsc.edu/ > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 41677 > | exon_nrow: 235596 > | cds_nrow: 206518 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012) > | GenomicFeatures version at creation time: 1.6.7 > | RSQLite version at creation time: 0.11.1 > | DBSCHEMAVERSION: 1.0 > | package: GenomicFeatures >> >> #extract the transcript locations together with their genes >> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) >> tx # a GRanges object with 1 transcript per row > GRanges with 41677 ranges and 3 elementMetadata values: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> |<integer> <character> > [1] chr1 [ 11874, 14408] + | 1127 NR_046018 > [2] chr1 [ 69091, 70008] + | 1128 NM_001005484 > [3] chr1 [323892, 328581] + | 1130 NR_028327 > [4] chr1 [323892, 328581] + | 1132 NR_028325 > [5] chr1 [323892, 328581] + | 1133 NR_028322 > [6] chr1 [367659, 368597] + | 1131 NM_001005221 > [7] chr1 [367659, 368597] + | 1134 NM_001005224 > [8] chr1 [367659, 368597] + | 1135 NM_001005277 > [9] chr1 [763064, 789740] + | 198 NR_015368 > ... ... ... ... ... ... ... > [41669] chrY [27177050, 27198251] - | 20790 NM_004678 > [41670] chrY [27177050, 27198251] - | 20793 NM_001002761 > [41671] chrY [27177050, 27198251] - | 20794 NM_001002760 > [41672] chrY [27209230, 27246039] - | 20791 NR_002177 > [41673] chrY [27209230, 27246039] - | 20792 NR_002178 > [41674] chrY [27209230, 27246039] - | 20795 NR_001525 > [41675] chrY [27329790, 27330920] - | 20796 NR_002179 > [41676] chrY [27329790, 27330920] - | 20797 NR_002180 > [41677] chrY [27329790, 27330920] - | 20798 NR_001526 > gene_id > <compressedcharacterlist> > [1] 100287102 > [2] 79501 > [3] 100133331 > [4] 100132062 > [5] 100132287 > [6] 729759 > [7] 26683 > [8] 81399 > [9] 643837 > ... ... > [41669] 9083 > [41670] 442868 > [41671] 442867 > [41672] 474150 > [41673] 474149 > [41674] 114761 > [41675] 474152 > [41676] 474151 > [41677] 252949 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 >> >> #rename chromosome to fit USCS standard >> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > Error in `seqnames<-`(`*tmp*`, value =<s4 object="" of="" class="" "rle"="">) : > levels of supplied 'seqnames' must be identical to 'seqlevels(x)' >> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) >> >> #mapping but not on a readable format >> map<- as.matrix(findOverlaps(mysnps, tx)) > Warning message: > In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown > - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated] >> >> >> #making the mapping readable >> mapped_genes<- values(tx)$gene_id[map[, 2]] >> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) >> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) >> rownames(snp2gene)<- NULL >> snp2gene > [1] SNPNAME gene_id > <0 rows> (or 0-length row.names) >> >> >> #snp2gene se travaille mal alors on le transf?re en matrice >> snp2geneTmp = t(t(snp2gene)) >> >> #aller chercher les symboles correspondants ? nos gene id >> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) : > error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) : > keys must be supplied in a character vector with no NAs >> >> >> save.image(file = "map.RData") >> > > > > > > Simon No?l > CdeC > ________________________________________ > De : Hervé Pagès [hpages at fhcrc.org] > Date d'envoi : 5 d?cembre 2010 23:43 > ? : Simon No?l > Cc : bioconductor at r-project.org > Objet : Re: [BioC] maping SNPs > > Hi Simon, > > On 12/03/2010 10:17 AM, Simon No?l wrote: >> >> Hi, >> >> >> >> I have a really big list of SNPs names like : >> >> >> >> SNPNAME >> >> rs7547453 >> >> rs2840542 >> >> rs1999527 >> >> rs4648545 >> >> rs10915459 >> >> rs16838750 >> >> rs12128230 >> >> ... >> >> >> >> I woudlike to map them to their official gene symbol. What the best way to >> procede? > > Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings > from SNPs to genes and I don't think we have this kind of mappings > either in our collection of annotations (*.db packages). > > But if your SNPs are Human then you can do the mapping yourself by > using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures > packages. > > The latest SNPlocs.Hsapies.dbSNP.* package is > SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains > SNP locations relative to the GRCh37 genome: > > > library(SNPlocs.Hsapiens.dbSNP.20101109) > > getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 > ch10 > 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 > 1158707 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 > ch20 > 1147722 1105364 815729 740129 657719 757926 641905 645646 520666 > 586708 > ch21 ch22 chX chY chMT > 338254 331060 529608 67438 624 > > Note that it doesn't contain *all* SNPs from dbSNP Build 132: > only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109 > for the details). > > > ch22snps<- getSNPlocs("ch22") > > ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 7288968 S 16050994 > 3 6518357 M 16051107 > 4 7292503 R 16051209 > 5 6518368 Y 16051241 > > Note that the rs prefix has been dropped. > > So here is how to proceed: > > First you can use the following function to make a GRanges object from > your SNP ids: > > makeGRangesFromRefSNPids<- function(myids) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > ans_seqnames[!is.na(myrows)]<- seqname > ans_locs[!is.na(myrows)]<- locs$loc[myrows] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > This takes between 3 and 5 minutes: > > > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", > "rs10915459", "rs16838750", "rs12128230", "rs999999999") > > mysnps<- makeGRangesFromRefSNPids(myids) > > mysnps # a GRanges object with 1 SNP per row > GRanges with 8 ranges and 1 elementMetadata value > seqnames ranges strand | myids > <rle> <iranges> <rle> |<character> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] unknown [ 0, 0] * | rs999999999 > > seqlengths > ch1 unknown > NA NA > > The next step is to create a TranscriptDb object with > makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart() > from the GenomicFeatures package. This TranscriptDb object will > contain the transcript locations and their associated > genes extracted from the annotation source you choose. > For example, if you want to use RefSeq genes: > > ## Takes about 3 minutes: > > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > > txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: refGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 37924 > | exon_nrow: 230024 > | cds_nrow: 204571 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010) > | GenomicFeatures version at creation time: 1.2.2 > | RSQLite version at creation time: 0.9-4 > | DBSCHEMAVERSION: 1.0 > > Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb > object: this means that later you will be able to use the org.Hs.eg.db > package to map your gene ids to their symbol (the org.*.eg.db packages > are Entrez Gene ID centric). > > To extract the transcript locations together with their genes: > > > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > > tx # a GRanges object with 1 transcript per row > GRanges with 37924 ranges and 1 elementMetadata value > seqnames ranges strand | gene_id > <rle> <iranges> <rle> |<compressedcharacterlist> > [1] chr1 [ 69091, 70008] + | 79501 > [2] chr1 [323892, 328581] + | 100133331 > [3] chr1 [323892, 328581] + | 100132287 > [4] chr1 [323892, 328581] + | 100132062 > [5] chr1 [367659, 368597] + | 81399 > [6] chr1 [367659, 368597] + | 729759 > [7] chr1 [367659, 368597] + | 26683 > [8] chr1 [763064, 789740] + | 643837 > [9] chr1 [861121, 879961] + | 148398 > ... ... ... ... ... ... > [37916] chrY [27177050, 27198251] - | 9083 > [37917] chrY [27177050, 27198251] - | 442867 > [37918] chrY [27177050, 27198251] - | 442868 > [37919] chrY [27209230, 27246039] - | 114761 > [37920] chrY [27209230, 27246039] - | 474150 > [37921] chrY [27209230, 27246039] - | 474149 > [37922] chrY [27329790, 27330920] - | 252949 > [37923] chrY [27329790, 27330920] - | 474152 > [37924] chrY [27329790, 27330920] - | 474151 > > seqlengths > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > > Now you can use findOverlaps() on 'mysnps' and 'tx' to find > the transcripts hits by your snps. But before you can do this, > you need to rename the sequences in 'mysnps' because dbSNPs and > UCSC use different naming conventions for the chromosomes: > > > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > > Then: > > > map<- as.matrix(findOverlaps(mysnps, tx)) > > 'map' contains the mapping between your SNPs and their genes but not > in a readable form (this matrix contains indices) so we make the > 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for > the associated gene ids: > > > mapped_genes<- values(tx)$gene_id[map[, 2]] > > mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]], > elementLengths(mapped_genes)) > > snp2gene<- unique(data.frame(snp_id=mapped_snps, > gene_id=unlist(mapped_genes))) > > rownames(snp2gene)<- NULL > > snp2gene[1:4, ] > snp_id gene_id > 1 rs7547453 6497 > 2 rs2840542 79906 > 3 rs1999527 63976 > 4 rs4648545 7161 > > Note that there is no guarantee that the number of rows in this > data frame is the number of your original SNP ids because the > relation between SNP ids and gene ids is of course not one-to-one. > > Also the method described above considers that a SNP hits a gene > if it's located between the start and the end of one of its > transcripts but it doesn't take in account the exon structure of > the transcripts. If you want to do this you need to use exonsBy() > (from GenomicFeatures) to extract the exons grouped by transcripts > (this will be stored in a GRangesList object) and use this object > instead of 'tx' in the call to findOverlaps(). > > Hope this helps, > H. > > >> >> >> >> Simon No??l >> CdeC >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.3 years ago Hervé Pagès 16k

0

Entering edit mode

Thank's alot. It seem to work now:) But for the home made makeGRangesFromRefSNPids, it's not my work. You gave me that function ;) Simon No?l CdeC ________________________________________ De : Hervé Pagès [hpages at fhcrc.org] Date d'envoi : 9 f?vrier 2012 16:34 ? : Simon No?l Cc : bioconductor at r-project.org Objet : Re: RE : [BioC] maping SNPs Hi Simon, On 02/09/2012 12:28 PM, Simon No?l wrote: > Hello Mr. Pag?s, > > At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow. As you remember, with your help we changed it a little bit and we have got with R2.10 : > > library("IRanges") > library("GenomicRanges") > library("GenomicFeatures") > #? changer si une version plus r?cente de la librairie est t?l?charg?e. > library(SNPlocs.Hsapiens.dbSNP.20101109) > library("org.Hs.eg.db") > > #Allocation de la m?moire sous windows > memory.limit(size = 4095) > #v?rification de la librairie SNPlocs.Hsapiens.dbSNP > getSNPcount() > ch22snps<- getSNPlocs("ch22") > ch22snps[1:5, ] > > #Create a GRange objetc to use with GenomicRanges library > makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > if (verbose) > cat("Processing ", seqname, " SNPs ...\n", sep="") > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > hit_idx<- !is.na(myrows) > ans_seqnames[hit_idx]<- seqname > ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > #Test en utilisant les premi?res sondes du premier et second chormosome > #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") > #ouverture du fichier pour aller chercher nos num?ros rs > rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) > myids<- rs_SNPs[,1] > mysnps<- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row > #create a TranscriptDb > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > txdb > #extract the transcript locations together with their genes > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row > #rename chromosome to fit USCS standard > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) > #mapping but not on a readable format > map<- as.matrix(findOverlaps(mysnps, tx)) > > #making the mapping readable > mapped_genes<- values(tx)$gene_id[map[, 2]] > mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene)<- NULL > snp2gene > > #snp2gene se travaille mal alors on le transf?re en matrice > snp2geneTmp = t(t(snp2gene)) > #aller chercher les symboles correspondants ? nos gene id > symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > > save.image(file = "map.RData") > > > > > > And everything was working perfectly. > > Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping. I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error... You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which means a lot of things could have changed. You should not assume that your problems are caused only because you are using a more recent SNPlocs package. > Can you help me? The problems seems to start with "map<- as.matrix(findOverlaps(mysnps, tx))" The problem starts earlier with: > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) : levels of supplied 'seqnames' must be identical to 'seqlevels(x)' The right way to do this with recent version of GenomicRanges is to use seqlevels() instead of seqnames(). > and the other error seems to result from that problem. Failing to rename the seqlevels will surely cause you some troubles later down so I would try to address this issue first see if that solves the other problems. Also note that with recent versions of the SNPlocs packages (i.e. version >= 0.99.6), you can use rsidsToGRanges() to do what your home made makeGRangesFromRefSNPids() function does. The latter is much faster BUT, unlike the former, it will fail if some of the supplied rs ids are not found in the SNPlocs package (it will issue an error showing the list of rs ids that could not be found). I've already received some request to improve this so I'll try to work on this soon. Cheers, H. > > > > sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: x86_64-pc-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] org.Hs.eg.db_2.6.4 > [2] RSQLite_0.11.1 > [3] DBI_0.2-5 > [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 > [5] GenomicFeatures_1.6.7 > [6] AnnotationDbi_1.16.11 > [7] Biobase_2.14.0 > [8] GenomicRanges_1.6.7 > [9] IRanges_1.12.6 > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0 RCurl_1.9-5 > [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4 zlibbioc_1.0.0 > >> library("IRanges") > Attaching package: ?IRanges? > The following object(s) are masked from ?package:base?: > cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int, > pmin, pmin.int, rbind, rep.int, setdiff, table, union >> library("GenomicRanges") >> library("GenomicFeatures") > Loading required package: AnnotationDbi > Loading required package: Biobase > Welcome to Bioconductor > Vignettes contain introductory material. To view, type > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation("pkgname")'. > > Attaching package: ?Biobase? > The following object(s) are masked from ?package:IRanges?: > updateObject >> #? changer si une version plus r?cente de la librairie est t?l?charg?e. >> library(SNPlocs.Hsapiens.dbSNP.20110815) >> library("org.Hs.eg.db") > Loading required package: DBI >> >> >> #Allocation de la m?moire sous windows >> memory.limit(size = 4095) > [1] Inf > Warning message: > 'memory.limit()' is Windows-specific >> >> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP >> getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 > 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 > 1542256 1521919 1104719 1031214 949642 1084538 917737 886293 732039 788556 > ch21 ch22 chX chY chMT > 468379 454939 920890 75108 942 >> ch22snps<- getSNPlocs("ch22") >> ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 149201999 Y 16050408 > 3 146752890 S 16050612 > 4 139377059 Y 16050678 > 5 143300205 R 16050822 >> >> #########################? FAIRE CANOPUS################### >> >> #Create a GRange objetc to use with GenomicRanges library >> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > + { > + ans_seqnames<- character(length(myids)) > + ans_seqnames[]<- "unknown" > + ans_locs<- integer(length(myids)) > + for (seqname in names(getSNPcount())) { > + if (verbose) > + cat("Processing ", seqname, " SNPs ...\n", sep="") > + locs<- getSNPlocs(seqname) > + ids<- paste("rs", locs$RefSNP_id, sep="") > + myrows<- match(myids, ids) > + hit_idx<- !is.na(myrows) > + ans_seqnames[hit_idx]<- seqname > + ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > + } > + GRanges(seqnames=ans_seqnames, > + IRanges(start=ans_locs, width=1), > + RefSNP_id=myids) > + } >> >> >> #Test en utilisant les premi?res sondes du premier et second chormosome >> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") >> >> #ouverture du fichier pour aller chercher nos num?ros rs >> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) >> myids<- rs_SNPs[,1] >> >> mysnps<- makeGRangesFromRefSNPids(myids) >> mysnps # a GRanges object with 1 SNP per row > GRanges with 348411 ranges and 1 elementMetadata value: > seqnames ranges strand | RefSNP_id > <rle> <iranges> <rle> |<factor> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] ch1 [4535148, 4535148] * | rs7541288 > [9] ch1 [4581230, 4581230] * | rs12039682 > ... ... ... ... ... ... > [348403] chX [154514047, 154514047] * | rs499428 > [348404] chX [154514919, 154514919] * | rs507127 > [348405] chX [154737376, 154737376] * | rs5940372 > [348406] chX [154780283, 154780283] * | rs6642287 > [348407] chX [154830377, 154830377] * | rs5983658 > [348408] chX [154870197, 154870197] * | rs473772 > [348409] chX [154892230, 154892230] * | rs553678 > [348410] chX [154899846, 154899846] * | rs473491 > [348411] chX [154929412, 154929412] * | rs557132 > --- > seqlengths: > ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX unknown > NA NA NA NA NA ... NA NA NA NA >> >> #create a TranscriptDb >> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > Download the refGene table ... OK > Download the refLink table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | Genus and Species: Homo sapiens > | UCSC Table: refGene > | Resource URL: http://genome.ucsc.edu/ > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 41677 > | exon_nrow: 235596 > | cds_nrow: 206518 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012) > | GenomicFeatures version at creation time: 1.6.7 > | RSQLite version at creation time: 0.11.1 > | DBSCHEMAVERSION: 1.0 > | package: GenomicFeatures >> >> #extract the transcript locations together with their genes >> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) >> tx # a GRanges object with 1 transcript per row > GRanges with 41677 ranges and 3 elementMetadata values: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> |<integer> <character> > [1] chr1 [ 11874, 14408] + | 1127 NR_046018 > [2] chr1 [ 69091, 70008] + | 1128 NM_001005484 > [3] chr1 [323892, 328581] + | 1130 NR_028327 > [4] chr1 [323892, 328581] + | 1132 NR_028325 > [5] chr1 [323892, 328581] + | 1133 NR_028322 > [6] chr1 [367659, 368597] + | 1131 NM_001005221 > [7] chr1 [367659, 368597] + | 1134 NM_001005224 > [8] chr1 [367659, 368597] + | 1135 NM_001005277 > [9] chr1 [763064, 789740] + | 198 NR_015368 > ... ... ... ... ... ... ... > [41669] chrY [27177050, 27198251] - | 20790 NM_004678 > [41670] chrY [27177050, 27198251] - | 20793 NM_001002761 > [41671] chrY [27177050, 27198251] - | 20794 NM_001002760 > [41672] chrY [27209230, 27246039] - | 20791 NR_002177 > [41673] chrY [27209230, 27246039] - | 20792 NR_002178 > [41674] chrY [27209230, 27246039] - | 20795 NR_001525 > [41675] chrY [27329790, 27330920] - | 20796 NR_002179 > [41676] chrY [27329790, 27330920] - | 20797 NR_002180 > [41677] chrY [27329790, 27330920] - | 20798 NR_001526 > gene_id > <compressedcharacterlist> > [1] 100287102 > [2] 79501 > [3] 100133331 > [4] 100132062 > [5] 100132287 > [6] 729759 > [7] 26683 > [8] 81399 > [9] 643837 > ... ... > [41669] 9083 > [41670] 442868 > [41671] 442867 > [41672] 474150 > [41673] 474149 > [41674] 114761 > [41675] 474152 > [41676] 474151 > [41677] 252949 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 >> >> #rename chromosome to fit USCS standard >> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > Error in `seqnames<-`(`*tmp*`, value =<s4 object="" of="" class="" "rle"="">) : > levels of supplied 'seqnames' must be identical to 'seqlevels(x)' >> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) >> >> #mapping but not on a readable format >> map<- as.matrix(findOverlaps(mysnps, tx)) > Warning message: > In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown > - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated] >> >> >> #making the mapping readable >> mapped_genes<- values(tx)$gene_id[map[, 2]] >> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) >> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) >> rownames(snp2gene)<- NULL >> snp2gene > [1] SNPNAME gene_id > <0 rows> (or 0-length row.names) >> >> >> #snp2gene se travaille mal alors on le transf?re en matrice >> snp2geneTmp = t(t(snp2gene)) >> >> #aller chercher les symboles correspondants ? nos gene id >> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) : > error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) : > keys must be supplied in a character vector with no NAs >> >> >> save.image(file = "map.RData") >> > > > > > > Simon No?l > CdeC > ________________________________________ > De : Hervé Pagès [hpages at fhcrc.org] > Date d'envoi : 5 d?cembre 2010 23:43 > ? : Simon No?l > Cc : bioconductor at r-project.org > Objet : Re: [BioC] maping SNPs > > Hi Simon, > > On 12/03/2010 10:17 AM, Simon No?l wrote: >> >> Hi, >> >> >> >> I have a really big list of SNPs names like : >> >> >> >> SNPNAME >> >> rs7547453 >> >> rs2840542 >> >> rs1999527 >> >> rs4648545 >> >> rs10915459 >> >> rs16838750 >> >> rs12128230 >> >> ... >> >> >> >> I woudlike to map them to their official gene symbol. What the best way to >> procede? > > Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings > from SNPs to genes and I don't think we have this kind of mappings > either in our collection of annotations (*.db packages). > > But if your SNPs are Human then you can do the mapping yourself by > using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures > packages. > > The latest SNPlocs.Hsapies.dbSNP.* package is > SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains > SNP locations relative to the GRCh37 genome: > > > library(SNPlocs.Hsapiens.dbSNP.20101109) > > getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 > ch10 > 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 > 1158707 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 > ch20 > 1147722 1105364 815729 740129 657719 757926 641905 645646 520666 > 586708 > ch21 ch22 chX chY chMT > 338254 331060 529608 67438 624 > > Note that it doesn't contain *all* SNPs from dbSNP Build 132: > only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109 > for the details). > > > ch22snps<- getSNPlocs("ch22") > > ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 7288968 S 16050994 > 3 6518357 M 16051107 > 4 7292503 R 16051209 > 5 6518368 Y 16051241 > > Note that the rs prefix has been dropped. > > So here is how to proceed: > > First you can use the following function to make a GRanges object from > your SNP ids: > > makeGRangesFromRefSNPids<- function(myids) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > ans_seqnames[!is.na(myrows)]<- seqname > ans_locs[!is.na(myrows)]<- locs$loc[myrows] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > This takes between 3 and 5 minutes: > > > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", > "rs10915459", "rs16838750", "rs12128230", "rs999999999") > > mysnps<- makeGRangesFromRefSNPids(myids) > > mysnps # a GRanges object with 1 SNP per row > GRanges with 8 ranges and 1 elementMetadata value > seqnames ranges strand | myids > <rle> <iranges> <rle> |<character> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] unknown [ 0, 0] * | rs999999999 > > seqlengths > ch1 unknown > NA NA > > The next step is to create a TranscriptDb object with > makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart() > from the GenomicFeatures package. This TranscriptDb object will > contain the transcript locations and their associated > genes extracted from the annotation source you choose. > For example, if you want to use RefSeq genes: > > ## Takes about 3 minutes: > > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > > txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: refGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 37924 > | exon_nrow: 230024 > | cds_nrow: 204571 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010) > | GenomicFeatures version at creation time: 1.2.2 > | RSQLite version at creation time: 0.9-4 > | DBSCHEMAVERSION: 1.0 > > Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb > object: this means that later you will be able to use the org.Hs.eg.db > package to map your gene ids to their symbol (the org.*.eg.db packages > are Entrez Gene ID centric). > > To extract the transcript locations together with their genes: > > > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > > tx # a GRanges object with 1 transcript per row > GRanges with 37924 ranges and 1 elementMetadata value > seqnames ranges strand | gene_id > <rle> <iranges> <rle> |<compressedcharacterlist> > [1] chr1 [ 69091, 70008] + | 79501 > [2] chr1 [323892, 328581] + | 100133331 > [3] chr1 [323892, 328581] + | 100132287 > [4] chr1 [323892, 328581] + | 100132062 > [5] chr1 [367659, 368597] + | 81399 > [6] chr1 [367659, 368597] + | 729759 > [7] chr1 [367659, 368597] + | 26683 > [8] chr1 [763064, 789740] + | 643837 > [9] chr1 [861121, 879961] + | 148398 > ... ... ... ... ... ... > [37916] chrY [27177050, 27198251] - | 9083 > [37917] chrY [27177050, 27198251] - | 442867 > [37918] chrY [27177050, 27198251] - | 442868 > [37919] chrY [27209230, 27246039] - | 114761 > [37920] chrY [27209230, 27246039] - | 474150 > [37921] chrY [27209230, 27246039] - | 474149 > [37922] chrY [27329790, 27330920] - | 252949 > [37923] chrY [27329790, 27330920] - | 474152 > [37924] chrY [27329790, 27330920] - | 474151 > > seqlengths > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > > Now you can use findOverlaps() on 'mysnps' and 'tx' to find > the transcripts hits by your snps. But before you can do this, > you need to rename the sequences in 'mysnps' because dbSNPs and > UCSC use different naming conventions for the chromosomes: > > > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > > Then: > > > map<- as.matrix(findOverlaps(mysnps, tx)) > > 'map' contains the mapping between your SNPs and their genes but not > in a readable form (this matrix contains indices) so we make the > 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for > the associated gene ids: > > > mapped_genes<- values(tx)$gene_id[map[, 2]] > > mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]], > elementLengths(mapped_genes)) > > snp2gene<- unique(data.frame(snp_id=mapped_snps, > gene_id=unlist(mapped_genes))) > > rownames(snp2gene)<- NULL > > snp2gene[1:4, ] > snp_id gene_id > 1 rs7547453 6497 > 2 rs2840542 79906 > 3 rs1999527 63976 > 4 rs4648545 7161 > > Note that there is no guarantee that the number of rows in this > data frame is the number of your original SNP ids because the > relation between SNP ids and gene ids is of course not one-to-one. > > Also the method described above considers that a SNP hits a gene > if it's located between the start and the end of one of its > transcripts but it doesn't take in account the exon structure of > the transcripts. If you want to do this you need to use exonsBy() > (from GenomicFeatures) to extract the exons grouped by transcripts > (this will be stored in a GRangesList object) and use this object > instead of 'tx' in the call to findOverlaps(). > > Hope this helps, > H. > > >> >> >> >> Simon No??l >> CdeC >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.3 years ago SimonNoël ▴ 450

0

Entering edit mode

On 02/09/2012 01:43 PM, Simon No?l wrote: > Thank's alot. It seem to work now:) But for the home made makeGRangesFromRefSNPids, it's not my work. You gave me that function ;) Sure, I remember now. No problem, you can keep it ;-) H.

ADD REPLY • link 12.3 years ago Hervé Pagès 16k

Login before adding your answer.