Hi all, I am developing a method to select automatically primers to validate splicing events. I am using SNPlocs.Hsapiens.dbSNP144.GRCh37 and injectSNSPs to identify regions with genomic variants (to avoid placing the primers on them). So far, so good.
However, the number of SNPs is very high (around 150 million) and is almost impossible to find a sufficiently large region with no variants to place the primers. Is it possible to inject on the reference genome only the SNPs with a minor allele frequency (MAF) larger than a threshold (say 5%)? Is the information of the MAF somewhere in the annotation data of bioconductor?
Thanks,
Angel
MAF is a population-specific parameter. You may be able to get some information out of AnnotationHub; I can't verify as I have a terrible connection at the moment.
It seems that UCSC has a table that will have the information
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp141Common.sql
so you may be able to get useful statistics with rtracklayer. Another approach is to query the 1000 genomes VCFs; snpStats::col.summary will compute MAF, via VariantAnnotation::genotypesToSnpMatrix