Question

News:New Bioconductor package for dbSNP151

4

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 16 hours ago

Seattle, WA, United States

Hi all,

A new SNPlocs package is now available with the content of dbSNP build 151:

http://bioconductor.org/packages/SNPlocs.Hsapiens.dbSNP151.GRCh38

SNPlocs packages contain SNP locations and alleles for various dbSNP builds. SNPs can be looked up by chromosome, range, or rs id, using snpsBySeqname(), snpsByOverlaps(), or snpsById(), respectively.

SNPlocs.Hsapiens.dbSNP151.GRCh38 contains 596,534,862 single-base substitutions (i.e. SNPs of class snp). See ?SNPlocs.Hsapiens.dbSNP151.GRCh38 for how the SNPs in the package were curated. For comparison, the SNPlocs package for dbSNP build 150 (SNPlocs.Hsapiens.dbSNP150.GRCh38) contains 305,513,252 single-base substitutions.

The XtraSNPlocs packages contain the SNP locations and alleles for all the other classes of SNPs: in-del, heterozygous, microsatellite, etc...

BSgenome::available.SNPs() gives the list of all available SNPlocs and XtraSNPlocs packages:

> library(BSgenome)
> available.SNPs()
 [1] "SNPlocs.Hsapiens.dbSNP.20101109"     
 [2] "SNPlocs.Hsapiens.dbSNP.20120608"     
 [3] "SNPlocs.Hsapiens.dbSNP141.GRCh38"    
 [4] "SNPlocs.Hsapiens.dbSNP142.GRCh37"    
 [5] "SNPlocs.Hsapiens.dbSNP144.GRCh37"    
 [6] "SNPlocs.Hsapiens.dbSNP144.GRCh38"    
 [7] "SNPlocs.Hsapiens.dbSNP149.GRCh38"    
 [8] "SNPlocs.Hsapiens.dbSNP150.GRCh38"    
 [9] "SNPlocs.Hsapiens.dbSNP151.GRCh38"    
[10] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38"
[11] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh37"
[12] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh38"

H.

SNPlocs.Hsapiens.dbSNP151.GRCh38 BSgenome News • 3.9k views

ADD COMMENT • link 7.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi!

I really like this package, which is important to say as I saw you write in another forum that it does not appear that popular. For me at least this package is important as I considerably reduce the amount of time I spend to make overlaps on dbSNP. When googling around a lot of people do want to make intersections on dbSNP, and often complains about the time it takes. Perhaps, people haven't found the package yet or that they are unfamiliar with the GRanges concept, I don't know what the cause could be, but I think this is a fantastic package!

My question: Is it possible to include the ref and alt alleles in the output from findOverlaps() and snpsByID() in the next version of this package or next release?, or is there already a way to access that information that I am not aware of? Right now I have to step outside of R and do this using bcftools isec, but it is neither pretty nor safe.

Jesper

ADD REPLY • link 6.2 years ago jesper.gadin ▴ 10

2

Entering edit mode

Hi Jesper,

Thanks for the positive feedback. Very appreciated and motivating!

In BioC devel (BSgenome 1.55.2) I've improved the SNPlocs extractors to return the inferred ref allele and alt allele(s):

library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38")
# UnstitchedGPos object with 166953 positions and 5 metadata columns:
#            seqnames       pos strand |   RefSNP_id alleles_as_ambig
#               <Rle> <integer>  <Rle> | <character>      <character>
#        [1]        X   3000004      * | rs369882522                Y
#        [2]        X   3000013      * | rs374307143                Y
#        [3]        X   3000014      * |  rs73437584                R
#        [4]        X   3000036      * | rs113897265                R
#        [5]        X   3000038      * |  rs79982205                M
#        ...      ...       ...    ... .         ...              ...
#   [166949]        X   7999835      * | rs775860267                S
#   [166950]        X   7999839      * | rs368090328                R
#   [166951]        X   7999902      * | rs190898710                R
#   [166952]        X   7999926      * | rs772716820                R
#   [166953]        X   7999951      * | rs181626818                R
#            genome_compat  ref_allele     alt_alleles
#                <logical> <character> <CharacterList>
#        [1]          TRUE           C               T
#        [2]          TRUE           C               T
#        [3]          TRUE           G               A
#        [4]          TRUE           A               G
#        [5]          TRUE           A               C
#        ...           ...         ...             ...
#   [166949]          TRUE           C               G
#   [166950]          TRUE           A               G
#   [166951]          TRUE           G               A
#   [166952]          TRUE           A               G
#   [166953]          TRUE           G               A
#   -------
#   seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

See ?SNPlocs for all the details (don't miss the "Obtaining the ref allele and alt allele(s)" subsection at the end of the examples section).

Cheers,

H.

ADD REPLY • link 6.2 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks!

It works like a charm. Extracting 100 variants from dbSNP151 took 4 seconds on my machine, which is very impressive. That is to compare to 17 minutes for an awk script looping through dbSNP, and 4 min using a unix join on sorted files.

Jesper

ADD REPLY • link 6.1 years ago jesper.gadin ▴ 10