Hi all,
A new SNPlocs package is now available with the content of dbSNP build 151:
http://bioconductor.org/packages/SNPlocs.Hsapiens.dbSNP151.GRCh38
SNPlocs packages contain SNP locations and alleles for various dbSNP builds. SNPs can be looked up by chromosome, range, or rs id, using snpsBySeqname()
, snpsByOverlaps()
, or snpsById()
, respectively.
SNPlocs.Hsapiens.dbSNP151.GRCh38 contains 596,534,862 single-base substitutions (i.e. SNPs of class snp
). See ?SNPlocs.Hsapiens.dbSNP151.GRCh38
for how the SNPs in the package were curated. For comparison, the SNPlocs package for dbSNP build 150 (SNPlocs.Hsapiens.dbSNP150.GRCh38) contains 305,513,252 single-base substitutions.
The XtraSNPlocs packages contain the SNP locations and alleles for all the other classes of SNPs: in-del
, heterozygous
, microsatellite
, etc...
BSgenome::available.SNPs()
gives the list of all available SNPlocs and XtraSNPlocs packages:
> library(BSgenome) > available.SNPs() [1] "SNPlocs.Hsapiens.dbSNP.20101109" [2] "SNPlocs.Hsapiens.dbSNP.20120608" [3] "SNPlocs.Hsapiens.dbSNP141.GRCh38" [4] "SNPlocs.Hsapiens.dbSNP142.GRCh37" [5] "SNPlocs.Hsapiens.dbSNP144.GRCh37" [6] "SNPlocs.Hsapiens.dbSNP144.GRCh38" [7] "SNPlocs.Hsapiens.dbSNP149.GRCh38" [8] "SNPlocs.Hsapiens.dbSNP150.GRCh38" [9] "SNPlocs.Hsapiens.dbSNP151.GRCh38" [10] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38" [11] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh37" [12] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh38"
H.
Hi!
I really like this package, which is important to say as I saw you write in another forum that it does not appear that popular. For me at least this package is important as I considerably reduce the amount of time I spend to make overlaps on dbSNP. When googling around a lot of people do want to make intersections on dbSNP, and often complains about the time it takes. Perhaps, people haven't found the package yet or that they are unfamiliar with the GRanges concept, I don't know what the cause could be, but I think this is a fantastic package!
My question: Is it possible to include the ref and alt alleles in the output from findOverlaps() and snpsByID() in the next version of this package or next release?, or is there already a way to access that information that I am not aware of? Right now I have to step outside of R and do this using bcftools isec, but it is neither pretty nor safe.
Jesper
Hi Jesper,
Thanks for the positive feedback. Very appreciated and motivating!
In BioC devel (BSgenome 1.55.2) I've improved the SNPlocs extractors to return the inferred ref allele and alt allele(s):
See
?SNPlocs
for all the details (don't miss the "Obtaining the ref allele and alt allele(s)" subsection at the end of the examples section).Cheers,
H.
Thanks!
It works like a charm. Extracting 100 variants from dbSNP151 took 4 seconds on my machine, which is very impressive. That is to compare to 17 minutes for an awk script looping through dbSNP, and 4 min using a unix join on sorted files.
Jesper