How to Query Only SNPs from the biomaRt Variation Database
1
0
Entering edit mode
kschecht • 0
@38d13df8
Last seen 3 months ago
United States

Hi all, my apologies if this question is trivial or if I'm misunderstanding something fundamental to biomaRt. I'm having difficulty figuring out how to query only SNPs, and not indels, somatic variations, etc., for a specific gene/transcript in the biomaRt package. On the Ensembl website, the Variation Table allows you to filter by 'Class' and lets you select SNPs, but I've been unsuccessful in finding something similar in the 'filters' or 'attributes' in the biomaRt package.

For example, when I query the variations within the very short human gene RNY3 (transcript is 102 bp) with:

ensembl_snp <- useMart("ENSEMBL_MART_SNP")
dataset <- useDataset("hsapiens_snp", mart = ensembl_snp)
snps <- getBM(
attributes = c("refsnp_id", "chrom_start", "chrom_end", "chr_name", "allele"),
filters = "ensembl_gene",
values = "ENSG00000202354",
mart = dataset
)

print(snps)

I receive 94 variations which I can tell are a mix of different 'Classes' due to their 'allele':

refsnp_id chrom_start chrom_end chr_name        allele
1   rs146500940   148680859 148680859        7         T/C/G
2   rs188517771   148680903 148680903        7         T/C/G
...
74 rs1468891485   148680901 148680903        7      TTT/TTTT
...
81 rs1823259365   148680850 148680850        7           T/-

I'm also not sure about filtering 'alleles' that look like SNPs because I'm nervous that they are somatic SNVs.

If someone knows a good way to solve this I would be very grateful!

biomaRt • 594 views
ADD COMMENT
0
Entering edit mode

When you say 'Variation table' are you describing something you see in the Biomart interface, or an entirely different part of the website? The biomaRt package is simply an interface to the Biomart server, and it doesn't do generalized queries to Ensembl. If you can't filter using Biomart directly, I don't believe you can do so using biomaRt, and I don't see anything in the Biomart interface that filters by allele class. But maybe I am missing something obvious?

ADD REPLY
0
Entering edit mode

Thanks for the reply! When I said 'Variation Table' I meant the feature from the Ensembl website, but thanks to your explanation I understand now that a similar feature is probably not present in biomaRt. Do you have any recommendations on a different package that would return only SNPs for a given gene/transcript/genome range?

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 5 minutes ago
United States

Depending on how current you want the data to be, you could use a SNPLocs package like the one based on dbSNP 155.

All the functions to get stuff have been ported over to the BSgenome package, so you want to load that package and then do ?SNPlocs-class to get the relevant functions. As an example, if you want the SNPs for BRCA1, it goes like this:

> library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
<snip>
## BRCA is on Chr17, from 43044678-43124096
## SNPlocs objects use NCBI chromosomes (no chr part)
## so we use 17:43044678-43124096
> snps <- snpsByOverlaps(SNPlocs.Hsapiens.dbSNP144.GRCh38, GRanges("17:43044678-43124096"))
> snps
UnstitchedGPos object with 5042 positions and 2 metadata columns:
         seqnames       pos strand |   RefSNP_id alleles_as_ambig
            <Rle> <integer>  <Rle> | <character>      <character>
     [1]       17  43044721      * | rs780228876                Y
     [2]       17  43044745      * | rs535563803                Y
     [3]       17  43044750      * | rs758556342                R
     [4]       17  43044778      * |   rs1060921                W
     [5]       17  43044783      * | rs556653436                R
     ...      ...       ...    ... .         ...              ...
  [5038]       17  43124091      * | rs754763517                R
  [5039]       17  43124093      * | rs778775133                Y
  [5040]       17  43124094      * |  rs80357475                M
  [5041]       17  43124095      * |  rs80357111                V
  [5042]       17  43124096      * |  rs80357287                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

And if you want to get really fancy, you can add the underlying alleles.

> library(BSgenome.Hsapiens.UCSC.hg38)
> Hsapiens
| BSgenome object for Human
| - organism: Homo sapiens
| - provider: UCSC
| - genome: hg38
| - release date: 2023/01/31
| - 711 sequence(s):
|     chr1                    chr2                    chr3                   
|     chr4                    chr5                    chr6                   
|     chr7                    chr8                    chr9                   
|     chr10                   chr11                   chr12                  
|     chr13                   chr14                   chr15                  
|     ...                     ...                     ...                    
|     chr19_KV575256v1_alt    chr19_KV575257v1_alt    chr19_KV575258v1_alt   
|     chr19_KV575259v1_alt    chr19_KV575260v1_alt    chr19_MU273387v1_alt   
|     chr22_KN196485v1_alt    chr22_KN196486v1_alt    chr22_KQ458387v1_alt   
|     chr22_KQ458388v1_alt    chr22_KQ759761v1_alt    chrX_KV766199v1_alt    
|     chrX_MU273395v1_alt     chrX_MU273396v1_alt     chrX_MU273397v1_alt    
| 
| Tips: call 'seqnames()' on the object to get all the sequence names, call
| 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
| access a given sequence, see '?BSgenome' for more information.

## Note that this object has UCSC chromosome identifiers, so we have to change to NCBI
> seqlevelsStyle(Hsapiens) <- "NCBI"
> snps <- snpsByOverlaps(SNPlocs.Hsapiens.dbSNP144.GRCh38, GRanges("17:43044678-43124096"), genome = Hsapiens)
> snps
UnstitchedGPos object with 5042 positions and 5 metadata columns:
         seqnames       pos strand |   RefSNP_id alleles_as_ambig genome_compat
            <Rle> <integer>  <Rle> | <character>      <character>     <logical>
     [1]       17  43044721      * | rs780228876                Y          TRUE
     [2]       17  43044745      * | rs535563803                Y          TRUE
     [3]       17  43044750      * | rs758556342                R          TRUE
     [4]       17  43044778      * |   rs1060921                W          TRUE
     [5]       17  43044783      * | rs556653436                R          TRUE
     ...      ...       ...    ... .         ...              ...           ...
  [5038]       17  43124091      * | rs754763517                R          TRUE
  [5039]       17  43124093      * | rs778775133                Y          TRUE
  [5040]       17  43124094      * |  rs80357475                M          TRUE
  [5041]       17  43124095      * |  rs80357111                V          TRUE
  [5042]       17  43124096      * |  rs80357287                Y          TRUE
          ref_allele     alt_alleles
         <character> <CharacterList>
     [1]           T               C
     [2]           C               T
     [3]           G               A
     [4]           T               A
     [5]           A               G
     ...         ...             ...
  [5038]           A               G
  [5039]           C               T
  [5040]           C               A
  [5041]           A             C,G
  [5042]           T               C
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome
0
Entering edit mode

This is exactly what I was searching for, I appreciate your help!

ADD REPLY

Login before adding your answer.

Traffic: 954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6