From location to get the SNP rs ID

0

Entering edit mode

Fabrice Tourre ▴ 970

@fabrice-tourre-4394

Last seen 11.5 years ago

Hervé Pagès, When I get rsid from position, I think some of them cannot mapped to a rsid. How can we let they also return a value, otherwise we will have a problem to map the position to rsid. For example: posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T) chrs <- paste("ch",posi[,1],sep="") mylocs <- GRanges(chrs, IRanges(as.integer(posi[,2]), width=1)) rsids <- loc2rsid(mylocs) dat$SNP <- unlist(rsids) Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570", "rs145939433", : replacement has 92428 rows, data has 92489 > length(chrs) [1] 92489 > length(unlist(rsids)) [1] 92428 On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > On 05/31/2013 03:31 AM, Fabrice Tourre wrote: >> >> Hervé Pagès, >> >> It helps. Thank you very much. >> >> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042 >> and SNPlocs.Hsapiens.dbSNP.20120608. > > > Nope. That's why I sent you the code ;-) > > > H. > >> >> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> >>> Hi Fabrice, >>> >>> With the loc2rsid() function (see code at the bottom) and using >>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have >>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment): >>> >>> >>> mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1, >>> 1)), >>> IRanges(c(200, 16369861, 146, 73, 9411609, >>> 16051107), >>> width=1)) >>> >>> > mylocs >>> GRanges with 6 ranges and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] ch22 [ 200, 200] * >>> [2] ch22 [16369861, 16369861] * >>> [3] chM [ 146, 146] * >>> [4] chM [ 73, 73] * >>> [5] ch21 [ 9411609, 9411609] * >>> [6] ch22 [16051107, 16051107] * >>> --- >>> seqlengths: >>> ch21 ch22 chM >>> NA NA NA >>> >>> Then: >>> >>> library(SNPlocs.Hsapiens.dbSNP.20100427) >>> >>> > loc2rsid(mylocs) >>> CharacterList of length 6 >>> [[1]] character(0) >>> [[2]] rs75258394 rs78180314 >>> [[3]] character(0) >>> [[4]] character(0) >>> [[5]] rs76676778 >>> [[6]] rs6518357 >>> >>> You need to be careful to use the same chromosome naming convention as >>> the SNPlocs package (which uses dbSNP naming convention). For example >>> in the 'mylocs' object, the mitochondrial chromosome is named chM but >>> in the SNPlocs package it's chMT: >>> >>> > getSNPcount() >>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 >>> ch10 >>> 1369185 1422439 1186807 1191984 1064540 1040203 977275 919207 740516 >>> 859433 >>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 >>> ch20 >>> 855225 822825 615382 543041 499101 556394 464686 482359 375259 >>> 450685 >>> ch21 ch22 chX chY chMT >>> 253480 244482 445596 53908 667 >>> >>> This is why the locations on chM are not mapped to anything. But >>> after renaming: >>> >>> > seqlevels(mylocs)[3] <- "chMT" >>> > seqlevels(mylocs) >>> [1] "ch21" "ch22" "chMT" >>> >>> > loc2rsid(mylocs) >>> CharacterList of length 6 >>> [[1]] character(0) >>> [[2]] rs75258394 rs78180314 >>> [[3]] rs72619361 >>> [[4]] rs3087742 >>> [[5]] rs76676778 >>> [[6]] rs6518357 >>> >>> they get mapped. >>> >>> The CharacterList object returned by loc2rsid() can be set as a >>> metadata column of the input GRanges object: >>> >>> > mcols(mylocs)$rsid <- loc2rsid(mylocs) >>> > mylocs >>> GRanges with 6 ranges and 1 metadata column: >>> seqnames ranges strand | rsid >>> <rle> <iranges> <rle> | <characterlist> >>> [1] ch22 [ 200, 200] * | >>> [2] ch22 [16369861, 16369861] * | rs75258394,rs78180314 >>> [3] chMT [ 146, 146] * | rs72619361 >>> [4] chMT [ 73, 73] * | rs3087742 >>> [5] ch21 [ 9411609, 9411609] * | rs76676778 >>> [6] ch22 [16051107, 16051107] * | rs6518357 >>> --- >>> seqlengths: >>> ch21 ch22 chMT >>> NA NA NA >>> >>> Hope this helps, >>> H. >>> >>> >>> ## Maps the genomic locations in 'locs' to the corresponding rs ids. >>> ## 'locs' must be a GRanges object where all ranges are of with 1. >>> ## Because dbSNP can assign distinct ids to SNPs located at the same >>> ## position, the mapping is a 1-to-many relationship. >>> ## Returns a CharacterList of the same length as 'locs'. >>> loc2rsid <- function(locs) >>> { >>> if (!is(locs, "GRanges")) >>> stop("'locs' must be a GRanges object") >>> if (!all(width(locs) == 1L)) >>> stop("all ranges in 'locs' must be of width 1") >>> common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount())) >>> if (length(common_seqlevels) == 0L) >>> stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem >>> to ", >>> "be\n compatible with the chromosome names in the SNPlocs >>> ", >>> "package. Maybe they\n use a different naming convention? >>> ", >>> "If that's the case then you first need\n to rename the ", >>> "seqlevels in 'locs'. See '?seqlevels' for how to do >>> this.") >>> f <- as.factor(seqnames(locs)) >>> locs_by_chrom <- split(start(locs), f) >>> rsids_by_chrom <- lapply(seq_along(locs_by_chrom), >>> function(i) >>> { >>> seqname <- levels(f)[i] >>> locs2 <- locs_by_chrom[[i]] >>> ans2 <- vector("list", length=length(locs2)) >>> if (length(locs2) == 0L || !(seqname %in% >>> common_seqlevels)) >>> return(ans2) >>> snplocs <- getSNPlocs(seqname, as.GRanges=FALSE) >>> hits <- findMatches(locs2, snplocs$loc) >>> rsids <- paste0("rs", >>> snplocs$RefSNP_id[subjectHits(hits)]) >>> q_hits <- queryHits(hits) >>> tmp <- split(rsids, q_hits) >>> ans2[as.integer(names(tmp))] <- tmp >>> ans2 >>> }) >>> CharacterList(unsplit(rsids_by_chrom, f)) >>> } >>> >>> PS: This functionality will make it to the SNPlocs packages (in this >>> form or another). >>> >>> >>> >>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote: >>>> >>>> >>>> Dear list, >>>> >>>> I have a list of genome positions. Is there is a package to get the >>>> snp name at this position? >>>> >>>> It is the reverse function as rsid2loc in >>>> SNPlocs.Hsapiens.dbSNP.20120608 >>>> >>>> Thank you. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319

SNP Cancer SNPlocs ASSIGN SNP Cancer SNPlocs ASSIGN • 4.8k views

ADD COMMENT • link updated 12.7 years ago by Hervé Pagès 16k • written 12.7 years ago by Fabrice Tourre ▴ 970

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Fabrice, On 06/03/2013 11:55 AM, Fabrice Tourre wrote: > Hervé Pagès, > > When I get rsid from position, I think some of them cannot mapped to a > rsid. How can we let they also return a value, otherwise we will have > a problem to map the position to rsid. Any genomic position can be mapped to 0, 1, or more than 1 rs ids (the "more than 1 rs ids" case can happen because dbSNP sometimes assign distinct ids to SNPs located at the same position, don't ask me why). This is summarized by saying that the mapping is a 1-to-many relationship. > > For example: > > posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T) > chrs <- paste("ch",posi[,1],sep="") > mylocs <- GRanges(chrs, > IRanges(as.integer(posi[,2]), width=1)) > rsids <- loc2rsid(mylocs) > > dat$SNP <- unlist(rsids) > Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570", > "rs145939433", : > replacement has 92428 rows, data has 92489 >> length(chrs) > [1] 92489 >> length(unlist(rsids)) > [1] 92428 THE PROBLEM: Because the relationship is 1-to-many, loc2rsid() cannot in general return a character vector of the same length as the input vector (i.e. the GRanges object holding the genomic positions). This is why the function returns a *CharacterList* object. This CharacterList is guaranteed to have the length of the input vector. However the list elements of that CharacterList can be of length 0, 1, or more than 1. Because of that, you should not expect that the result of unlisting this CharacterList will give you a character vector of the length of the input vector. THE SOLUTION: The general advice is to not use unlist() if you want to preserve length and positioning (i.e. if you want the i-th element in the result to correspond to the i-th in the input): > x <- CharacterList("a", "bb", character(), "ccc") > x CharacterList of length 4 [[1]] a [[2]] bb [[3]] character(0) [[4]] ccc > unlist(x) [1] "a" "bb" "ccc" Use as.character() instead: > as.character(x) [1] "a" "bb" NA "ccc" as.character() is guaranteed to preserve length and positioning... or to fail. It will fail if a list element has a length >= 2. Because that means a choice needs to be made and as.character() won't make that choice for you: > x2 <- CharacterList(letters[1:3], "bb", character(), "ccc") > x2 CharacterList of length 4 [[1]] a b c [[2]] bb [[3]] character(0) [[4]] ccc > as.character(x2) Error in as.vector(x, mode = "character") : coercing an AtomicList object to an atomic vector is supported only for objects with top-level elements of length <= 1 Possible choices are (the goal being to end up with a CharacterList object where all the list elements have a length <= 1): (1) Keep 1 arbitrary element of any list element of length >= 2. For example, to keep the 1st element: idx <- elementLengths(x2) >= 2 x2_subset <- x2[idx] x2_subset <- sapply(x2_subset, `[`, 1L) x2[idx] <- x2_subset Equivalently, you could choose to keep the last element: idx <- elementLengths(x2) >= 2 x2_subset <- x2[idx] x2_subset <- sapply(x2_subset, function(x) x[length(x)]) x2[idx] <- x2_subset Note that the above sapply() can be replaced by much faster x2_subset <- unlist(x2_subset)[cumsum(elementLengths(x2_subset))] (2) Keep nothing: idx <- elementLengths(x2) >= 2 x2[idx] <- NA (3) Keep an element randomly... Then you should be able to do 'as.character(x2)'. Genomic positions mapped to more than 1 rs id are rare though so it could be that most of the times you won't run into that problem. However if the code you're writing is going to be re-used by other people, they might supply those problematic genomic positions in the input so it's good if your code can handle them gracefully... Cheers, H. > > On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> On 05/31/2013 03:31 AM, Fabrice Tourre wrote: >>> >>> Hervé Pagès, >>> >>> It helps. Thank you very much. >>> >>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042 >>> and SNPlocs.Hsapiens.dbSNP.20120608. >> >> >> Nope. That's why I sent you the code ;-) >> >> >> H. >> >>> >>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>> >>>> Hi Fabrice, >>>> >>>> With the loc2rsid() function (see code at the bottom) and using >>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have >>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment): >>>> >>>> >>>> mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1, >>>> 1)), >>>> IRanges(c(200, 16369861, 146, 73, 9411609, >>>> 16051107), >>>> width=1)) >>>> >>>> > mylocs >>>> GRanges with 6 ranges and 0 metadata columns: >>>> seqnames ranges strand >>>> <rle> <iranges> <rle> >>>> [1] ch22 [ 200, 200] * >>>> [2] ch22 [16369861, 16369861] * >>>> [3] chM [ 146, 146] * >>>> [4] chM [ 73, 73] * >>>> [5] ch21 [ 9411609, 9411609] * >>>> [6] ch22 [16051107, 16051107] * >>>> --- >>>> seqlengths: >>>> ch21 ch22 chM >>>> NA NA NA >>>> >>>> Then: >>>> >>>> library(SNPlocs.Hsapiens.dbSNP.20100427) >>>> >>>> > loc2rsid(mylocs) >>>> CharacterList of length 6 >>>> [[1]] character(0) >>>> [[2]] rs75258394 rs78180314 >>>> [[3]] character(0) >>>> [[4]] character(0) >>>> [[5]] rs76676778 >>>> [[6]] rs6518357 >>>> >>>> You need to be careful to use the same chromosome naming convention as >>>> the SNPlocs package (which uses dbSNP naming convention). For example >>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but >>>> in the SNPlocs package it's chMT: >>>> >>>> > getSNPcount() >>>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 >>>> ch10 >>>> 1369185 1422439 1186807 1191984 1064540 1040203 977275 919207 740516 >>>> 859433 >>>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 >>>> ch20 >>>> 855225 822825 615382 543041 499101 556394 464686 482359 375259 >>>> 450685 >>>> ch21 ch22 chX chY chMT >>>> 253480 244482 445596 53908 667 >>>> >>>> This is why the locations on chM are not mapped to anything. But >>>> after renaming: >>>> >>>> > seqlevels(mylocs)[3] <- "chMT" >>>> > seqlevels(mylocs) >>>> [1] "ch21" "ch22" "chMT" >>>> >>>> > loc2rsid(mylocs) >>>> CharacterList of length 6 >>>> [[1]] character(0) >>>> [[2]] rs75258394 rs78180314 >>>> [[3]] rs72619361 >>>> [[4]] rs3087742 >>>> [[5]] rs76676778 >>>> [[6]] rs6518357 >>>> >>>> they get mapped. >>>> >>>> The CharacterList object returned by loc2rsid() can be set as a >>>> metadata column of the input GRanges object: >>>> >>>> > mcols(mylocs)$rsid <- loc2rsid(mylocs) >>>> > mylocs >>>> GRanges with 6 ranges and 1 metadata column: >>>> seqnames ranges strand | rsid >>>> <rle> <iranges> <rle> | <characterlist> >>>> [1] ch22 [ 200, 200] * | >>>> [2] ch22 [16369861, 16369861] * | rs75258394,rs78180314 >>>> [3] chMT [ 146, 146] * | rs72619361 >>>> [4] chMT [ 73, 73] * | rs3087742 >>>> [5] ch21 [ 9411609, 9411609] * | rs76676778 >>>> [6] ch22 [16051107, 16051107] * | rs6518357 >>>> --- >>>> seqlengths: >>>> ch21 ch22 chMT >>>> NA NA NA >>>> >>>> Hope this helps, >>>> H. >>>> >>>> >>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids. >>>> ## 'locs' must be a GRanges object where all ranges are of with 1. >>>> ## Because dbSNP can assign distinct ids to SNPs located at the same >>>> ## position, the mapping is a 1-to-many relationship. >>>> ## Returns a CharacterList of the same length as 'locs'. >>>> loc2rsid <- function(locs) >>>> { >>>> if (!is(locs, "GRanges")) >>>> stop("'locs' must be a GRanges object") >>>> if (!all(width(locs) == 1L)) >>>> stop("all ranges in 'locs' must be of width 1") >>>> common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount())) >>>> if (length(common_seqlevels) == 0L) >>>> stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem >>>> to ", >>>> "be\n compatible with the chromosome names in the SNPlocs >>>> ", >>>> "package. Maybe they\n use a different naming convention? >>>> ", >>>> "If that's the case then you first need\n to rename the ", >>>> "seqlevels in 'locs'. See '?seqlevels' for how to do >>>> this.") >>>> f <- as.factor(seqnames(locs)) >>>> locs_by_chrom <- split(start(locs), f) >>>> rsids_by_chrom <- lapply(seq_along(locs_by_chrom), >>>> function(i) >>>> { >>>> seqname <- levels(f)[i] >>>> locs2 <- locs_by_chrom[[i]] >>>> ans2 <- vector("list", length=length(locs2)) >>>> if (length(locs2) == 0L || !(seqname %in% >>>> common_seqlevels)) >>>> return(ans2) >>>> snplocs <- getSNPlocs(seqname, as.GRanges=FALSE) >>>> hits <- findMatches(locs2, snplocs$loc) >>>> rsids <- paste0("rs", >>>> snplocs$RefSNP_id[subjectHits(hits)]) >>>> q_hits <- queryHits(hits) >>>> tmp <- split(rsids, q_hits) >>>> ans2[as.integer(names(tmp))] <- tmp >>>> ans2 >>>> }) >>>> CharacterList(unsplit(rsids_by_chrom, f)) >>>> } >>>> >>>> PS: This functionality will make it to the SNPlocs packages (in this >>>> form or another). >>>> >>>> >>>> >>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote: >>>>> >>>>> >>>>> Dear list, >>>>> >>>>> I have a list of genome positions. Is there is a package to get the >>>>> snp name at this position? >>>>> >>>>> It is the reverse function as rsid2loc in >>>>> SNPlocs.Hsapiens.dbSNP.20120608 >>>>> >>>>> Thank you. >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.7 years ago Hervé Pagès 16k

0

Entering edit mode

Hervé Pagès, It helps. Thank you very much. On Mon, Jun 3, 2013 at 4:58 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Fabrice, > > > On 06/03/2013 11:55 AM, Fabrice Tourre wrote: >> >> Hervé Pagès, >> >> When I get rsid from position, I think some of them cannot mapped to a >> rsid. How can we let they also return a value, otherwise we will have >> a problem to map the position to rsid. > > > Any genomic position can be mapped to 0, 1, or more than 1 rs ids (the > "more than 1 rs ids" case can happen because dbSNP sometimes assign > distinct ids to SNPs located at the same position, don't ask me why). > This is summarized by saying that the mapping is a 1-to-many > relationship. > >> >> For example: >> >> posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T) >> chrs <- paste("ch",posi[,1],sep="") >> mylocs <- GRanges(chrs, >> IRanges(as.integer(posi[,2]), width=1)) >> rsids <- loc2rsid(mylocs) >> >> dat$SNP <- unlist(rsids) >> Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570", >> "rs145939433", : >> replacement has 92428 rows, data has 92489 >>> >>> length(chrs) >> >> [1] 92489 >>> >>> length(unlist(rsids)) >> >> [1] 92428 > > > THE PROBLEM: Because the relationship is 1-to-many, loc2rsid() cannot > in general return a character vector of the same length as the input > vector (i.e. the GRanges object holding the genomic positions). This > is why the function returns a *CharacterList* object. This CharacterList > is guaranteed to have the length of the input vector. However the list > elements of that CharacterList can be of length 0, 1, or more than 1. > Because of that, you should not expect that the result of unlisting this > CharacterList will give you a character vector of the length of the > input vector. > > THE SOLUTION: The general advice is to not use unlist() if you want to > preserve length and positioning (i.e. if you want the i-th element in > the result to correspond to the i-th in the input): > > > x <- CharacterList("a", "bb", character(), "ccc") > > > x > CharacterList of length 4 > [[1]] a > [[2]] bb > [[3]] character(0) > [[4]] ccc > > > unlist(x) > [1] "a" "bb" "ccc" > > Use as.character() instead: > > > as.character(x) > [1] "a" "bb" NA "ccc" > > as.character() is guaranteed to preserve length and positioning... or > to fail. It will fail if a list element has a length >= 2. Because that > means a choice needs to be made and as.character() won't make that > choice for you: > > > x2 <- CharacterList(letters[1:3], "bb", character(), "ccc") > > > x2 > CharacterList of length 4 > [[1]] a b c > [[2]] bb > [[3]] character(0) > [[4]] ccc > > > as.character(x2) > Error in as.vector(x, mode = "character") : > coercing an AtomicList object to an atomic vector is supported only for > objects with top-level elements of length <= 1 > > Possible choices are (the goal being to end up with a CharacterList > object where all the list elements have a length <= 1): > > (1) Keep 1 arbitrary element of any list element of length >= 2. > For example, to keep the 1st element: > > idx <- elementLengths(x2) >= 2 > x2_subset <- x2[idx] > x2_subset <- sapply(x2_subset, `[`, 1L) > x2[idx] <- x2_subset > > Equivalently, you could choose to keep the last element: > > idx <- elementLengths(x2) >= 2 > x2_subset <- x2[idx] > x2_subset <- sapply(x2_subset, function(x) x[length(x)]) > x2[idx] <- x2_subset > > Note that the above sapply() can be replaced by much faster > > x2_subset <- unlist(x2_subset)[cumsum(elementLengths(x2_subset))] > > (2) Keep nothing: > > idx <- elementLengths(x2) >= 2 > x2[idx] <- NA > > (3) Keep an element randomly... > > Then you should be able to do 'as.character(x2)'. > > Genomic positions mapped to more than 1 rs id are rare though so it > could be that most of the times you won't run into that problem. > However if the code you're writing is going to be re-used by other > people, they might supply those problematic genomic positions in > the input so it's good if your code can handle them gracefully... > > Cheers, > > H. > >> >> On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> >>> On 05/31/2013 03:31 AM, Fabrice Tourre wrote: >>>> >>>> >>>> Hervé Pagès, >>>> >>>> It helps. Thank you very much. >>>> >>>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042 >>>> and SNPlocs.Hsapiens.dbSNP.20120608. >>> >>> >>> >>> Nope. That's why I sent you the code ;-) >>> >>> >>> H. >>> >>>> >>>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>> >>>>> >>>>> Hi Fabrice, >>>>> >>>>> With the loc2rsid() function (see code at the bottom) and using >>>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have >>>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment): >>>>> >>>>> >>>>> mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1, >>>>> 1)), >>>>> IRanges(c(200, 16369861, 146, 73, 9411609, >>>>> 16051107), >>>>> width=1)) >>>>> >>>>> > mylocs >>>>> GRanges with 6 ranges and 0 metadata columns: >>>>> seqnames ranges strand >>>>> <rle> <iranges> <rle> >>>>> [1] ch22 [ 200, 200] * >>>>> [2] ch22 [16369861, 16369861] * >>>>> [3] chM [ 146, 146] * >>>>> [4] chM [ 73, 73] * >>>>> [5] ch21 [ 9411609, 9411609] * >>>>> [6] ch22 [16051107, 16051107] * >>>>> --- >>>>> seqlengths: >>>>> ch21 ch22 chM >>>>> NA NA NA >>>>> >>>>> Then: >>>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20100427) >>>>> >>>>> > loc2rsid(mylocs) >>>>> CharacterList of length 6 >>>>> [[1]] character(0) >>>>> [[2]] rs75258394 rs78180314 >>>>> [[3]] character(0) >>>>> [[4]] character(0) >>>>> [[5]] rs76676778 >>>>> [[6]] rs6518357 >>>>> >>>>> You need to be careful to use the same chromosome naming convention as >>>>> the SNPlocs package (which uses dbSNP naming convention). For example >>>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but >>>>> in the SNPlocs package it's chMT: >>>>> >>>>> > getSNPcount() >>>>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 >>>>> ch10 >>>>> 1369185 1422439 1186807 1191984 1064540 1040203 977275 919207 >>>>> 740516 >>>>> 859433 >>>>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 >>>>> ch19 >>>>> ch20 >>>>> 855225 822825 615382 543041 499101 556394 464686 482359 >>>>> 375259 >>>>> 450685 >>>>> ch21 ch22 chX chY chMT >>>>> 253480 244482 445596 53908 667 >>>>> >>>>> This is why the locations on chM are not mapped to anything. But >>>>> after renaming: >>>>> >>>>> > seqlevels(mylocs)[3] <- "chMT" >>>>> > seqlevels(mylocs) >>>>> [1] "ch21" "ch22" "chMT" >>>>> >>>>> > loc2rsid(mylocs) >>>>> CharacterList of length 6 >>>>> [[1]] character(0) >>>>> [[2]] rs75258394 rs78180314 >>>>> [[3]] rs72619361 >>>>> [[4]] rs3087742 >>>>> [[5]] rs76676778 >>>>> [[6]] rs6518357 >>>>> >>>>> they get mapped. >>>>> >>>>> The CharacterList object returned by loc2rsid() can be set as a >>>>> metadata column of the input GRanges object: >>>>> >>>>> > mcols(mylocs)$rsid <- loc2rsid(mylocs) >>>>> > mylocs >>>>> GRanges with 6 ranges and 1 metadata column: >>>>> seqnames ranges strand | rsid >>>>> <rle> <iranges> <rle> | <characterlist> >>>>> [1] ch22 [ 200, 200] * | >>>>> [2] ch22 [16369861, 16369861] * | rs75258394,rs78180314 >>>>> [3] chMT [ 146, 146] * | rs72619361 >>>>> [4] chMT [ 73, 73] * | rs3087742 >>>>> [5] ch21 [ 9411609, 9411609] * | rs76676778 >>>>> [6] ch22 [16051107, 16051107] * | rs6518357 >>>>> --- >>>>> seqlengths: >>>>> ch21 ch22 chMT >>>>> NA NA NA >>>>> >>>>> Hope this helps, >>>>> H. >>>>> >>>>> >>>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids. >>>>> ## 'locs' must be a GRanges object where all ranges are of with 1. >>>>> ## Because dbSNP can assign distinct ids to SNPs located at the same >>>>> ## position, the mapping is a 1-to-many relationship. >>>>> ## Returns a CharacterList of the same length as 'locs'. >>>>> loc2rsid <- function(locs) >>>>> { >>>>> if (!is(locs, "GRanges")) >>>>> stop("'locs' must be a GRanges object") >>>>> if (!all(width(locs) == 1L)) >>>>> stop("all ranges in 'locs' must be of width 1") >>>>> common_seqlevels <- intersect(seqlevels(locs), >>>>> names(getSNPcount())) >>>>> if (length(common_seqlevels) == 0L) >>>>> stop("chromosome names (a.k.a. seqlevels) in 'locs' don't >>>>> seem >>>>> to ", >>>>> "be\n compatible with the chromosome names in the >>>>> SNPlocs >>>>> ", >>>>> "package. Maybe they\n use a different naming >>>>> convention? >>>>> ", >>>>> "If that's the case then you first need\n to rename the >>>>> ", >>>>> "seqlevels in 'locs'. See '?seqlevels' for how to do >>>>> this.") >>>>> f <- as.factor(seqnames(locs)) >>>>> locs_by_chrom <- split(start(locs), f) >>>>> rsids_by_chrom <- lapply(seq_along(locs_by_chrom), >>>>> function(i) >>>>> { >>>>> seqname <- levels(f)[i] >>>>> locs2 <- locs_by_chrom[[i]] >>>>> ans2 <- vector("list", length=length(locs2)) >>>>> if (length(locs2) == 0L || !(seqname %in% >>>>> common_seqlevels)) >>>>> return(ans2) >>>>> snplocs <- getSNPlocs(seqname, as.GRanges=FALSE) >>>>> hits <- findMatches(locs2, snplocs$loc) >>>>> rsids <- paste0("rs", >>>>> snplocs$RefSNP_id[subjectHits(hits)]) >>>>> q_hits <- queryHits(hits) >>>>> tmp <- split(rsids, q_hits) >>>>> ans2[as.integer(names(tmp))] <- tmp >>>>> ans2 >>>>> }) >>>>> CharacterList(unsplit(rsids_by_chrom, f)) >>>>> } >>>>> >>>>> PS: This functionality will make it to the SNPlocs packages (in this >>>>> form or another). >>>>> >>>>> >>>>> >>>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote: >>>>>> >>>>>> >>>>>> >>>>>> Dear list, >>>>>> >>>>>> I have a list of genome positions. Is there is a package to get the >>>>>> snp name at this position? >>>>>> >>>>>> It is the reverse function as rsid2loc in >>>>>> SNPlocs.Hsapiens.dbSNP.20120608 >>>>>> >>>>>> Thank you. >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319

ADD REPLY • link 12.7 years ago Fabrice Tourre ▴ 970

Login before adding your answer.