Search
Question: adding attributes affects number of results returned
0
gravatar for Marianna Foos
6 weeks ago by
Marianna Foos0 wrote:

I'm trying to get gene names for the chromosomal location of SNPs (i.e. not trying to gene expression, etc. with SNP, literally just which gene the SNP falls in) using the dbSNP IDs (the ones that start with "rs"). When I supply a long list of attributes to return, I get no results (for this arbitrary SNP), but when I ask only for ENSG gene ID, I get a result. I've looked through the documentation for a flag to change, but have been unable to find one. Thanks!

mart.snps <- useMart('ENSEMBL_MART_SNP', 'hsapiens_snp')

snp = "rs368991480"

getBM(attributes = c("refsnp_id", "ensembl_gene_stable_id", "associated_gene", "chr_name", "chrom_start", "chrom_end", "chrom_strand"),
      filters = "snp_filter",
      values = snp,
      mart = mart.snps)

# returns 0 rows

getBM(attributes = c("refsnp_id", "ensembl_gene_stable_id"),
      filters = "snp_filter",
      values = snp,
      mart = mart.snps)

# returns 1 row

 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.30.0

loaded via a namespace (and not attached):
 [1] IRanges_2.8.1        parallel_3.3.2       DBI_0.5-1            tools_3.3.2          RCurl_1.95-4.8       memoise_1.0.0        Rcpp_0.12.8         
 [8] Biobase_2.34.0       AnnotationDbi_1.36.0 RSQLite_1.1-1        S4Vectors_0.12.1     BiocGenerics_0.20.0  digest_0.6.11        stats4_3.3.2        
[15] bitops_1.0-6         XML_3.98-1.5        

ADD COMMENTlink modified 6 weeks ago by Thomas Maurel570 • written 6 weeks ago by Marianna Foos0
2
gravatar for James W. MacDonald
6 weeks ago by
United States
James W. MacDonald42k wrote:

A good rule of thumb is to try doing a similar query at the Ensembl Biomart site if it fails within biomaRt. When I try that, I can make a query for everything but the associated gene (same as in biomaRt). But when I add in the associated_gene (which if you check listAttributes is the 'Associated gene with phenotype' filter), this pops up an extra dataset that I then have to choose. Since we are now using two datasets, this implies we are doing a join across two different databases.

Since biomaRt is designed to do directed queries against a single database, it appears to fail when asked to do a join across different databases. In addition, when I try to do the query including associated_gene at the Ensembl Biomart, it just hourglasses for a while and then pops up a dialog box saying there appears to be a problem with my query. So it's not clear if this query is something you can do at this time, regardless.

ADD COMMENTlink written 6 weeks ago by James W. MacDonald42k

Yes, it smelled like a join issue :) When I looked earlier, I checked to make sure that all the attributes were on the same "page" of attributes (as per attributePages()) to debug the join issue, saw that they were, and moved along. Now I see that there are multiple attributes named refsnp_id, chr_name, etc., and so maybe a join is [effectively] being forced where it's not needed?

The block starting at 448 here https://github.com/Bioconductor-mirror/biomaRt/blob/master/R/biomaRt.R should either prevent or object to this, but it appears to be wrapped in an "if(false)"

I guess, to summarize - bug report: listAttributes() appears to tell me I can do something I can't do, which silently fails.

Good tip about trying to perform the query on the Biomart site, thanks. I don't get the hourglassing, I just get the empty query, which is the same result in the long run, I suppose. Well, you're right that it's a phenotype associated gene anyway, so I guess it's moot. Thanks!

ADD REPLYlink written 6 weeks ago by Marianna Foos0

Thank you both for looking into this.  James' suggestion of trying the query on the Ensembl web interface is exactly how I debug a lot of the issues that get reported.  At the moment I find it much easier than trying to process the interim pages returned by the biomaRt code.

I'll take a look at the code block that can never be run, try to understand why it was blanked out and hopefully write something to catch this problem in the future.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Mike Smith1.4k
0
gravatar for Mike Smith
6 weeks ago by
Mike Smith1.4k
EMBL Heidelberg
Mike Smith1.4k wrote:

I've had a more detailed look at this, and I don't think the issues are to do with selecting attributes from different pages or datasets.  To me it more seems that when the underlying ensembl database contains no phenotypic information associated with a SNP, Biomart simply doesn't return anything for that variant, no matter what other attributes you've asked for.  To demonstrate this, the code below looks for 3 SNPs, including the one in your original question.

library(biomaRt)
mart.snps <- useMart('ENSEMBL_MART_SNP', 'hsapiens_snp')

snps = c("rs368991480",
         "rs1082714",
         "rs1085093")

getBM(attributes = c("refsnp_id", 
                     "ensembl_gene_stable_id", 
                     "associated_gene"),
      filters = "snp_filter",
      values = snps,
      mart = mart.snps)
  refsnp_id ensembl_gene_stable_id associated_gene
1 rs1082714                                  CAND1
2 rs1085093        ENSG00000203943            GLTP

You can see that associated gene information is returned for the two new SNPs, and our original one is simply dropped.  This seems inconsistent since it doesn't happen to the intronic SNP (rs1082714) that doesn't have an associated ensembl gene id. To me this is obviously undesirable behaviour, but the issue is with ensembl / Biomart, rather than the biomaRt package itself.

ADD COMMENTlink written 6 weeks ago by Mike Smith1.4k

Dear Mike,

I had a look on the Ensembl biomart Interface and I get a different behaviour than what you have reported below with the biomaRt package.

If I filter on the following variants rs368991480,rs1082714,rs1085093 and ask for the refsnp_id and ensembl_gene_stable_id attributes, I get the three variants back:

http://www.ensembl.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_snp.default.snp.refsnp_id|hsapiens_snp.default.snp.ensembl_gene_stable_id&FILTERS=hsapiens_snp.default.filters.snp_filter."rs368991480,rs1082714,rs1085093"&VISIBLEPANEL=resultspanel

(Please see attached screenshot).

If I joint the Ensembl variation mart with the Ensembl gene mart to get the associated_gene attribute, I get the two variants associated with a gene back (rs368991480 and rs1085093):

http://www.ensembl.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_snp.default.snp.refsnp_id|hsapiens_snp.default.snp.ensembl_gene_stable_id|hsapiens_gene_ensembl.default.snp.external_gene_name&FILTERS=hsapiens_snp.default.filters.snp_filter."rs368991480,rs1082714,rs1085093"&VISIBLEPANEL=resultspanel

(please see attached screenshot).

This is because rs1082714 is not mapped to any genes so it can’t be found in the Ensembl gene mart and as a result is not returned back by Biomart.

I’ve also checked with the biomart perl and Restful service and I get the same results back.

Do you know how and where the biomaRt package get the associated_gene attribute?

When I remove the associated_gene attribute, I get the three variants back from the BiomaRt package: 

> getBM(attributes = c("refsnp_id", 
+                      "ensembl_gene_stable_id"),
+       filters = "snp_filter",
+       values = snps,
+       mart = mart.snps)
    refsnp_id ensembl_gene_stable_id
1   rs1082714                       
2   rs1085093        ENSG00000203943
3 rs368991480        ENSG00000175445

Kind Regards,

Thomas

 

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Thomas Maurel570

Hi Thomas, 

Thanks for looking into this.  The example I gave doesn't try to join two datasets at all.  It's working exclusively in the Human Short Variants dataset.  From attributes I'm selecting  refsnp_id and ensembl_gene_stable_id as you've done.  I then also tick VARIANT ASSOCIATED INFORMATION -> Phenotype annotation -> Associated gene with phenotype.  You should be able to reproduce the query with the link here:

Biomart Query

This final attribute is the one actually called associated_gene in the dataset.

I actually get two results for rs1085093 (I assume this is because the SNP affects two transcripts), so I'm not sure why I only get a single result for it from biomaRt.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Mike Smith1.4k
0
gravatar for Thomas Maurel
6 weeks ago by
Thomas Maurel570
United Kingdom
Thomas Maurel570 wrote:

Hi Mike,

Oh I see sorry, I got confused with an other associated gene name attribute. I guess what I gave above is a temporary solution then ;-).

Yes, this look like a bug from our side, rs368991480 seems to be missing from our BioMart phenotype table. I will investigate and see if we can get it fixed for our next release e!88.

Yes, you are right it's because rs1085093 is linked to two transcripts (We do our mapping from variants to Transcript). Maybe the package is doing some post-processing in the background, if you add the "ensembl_transcript_stable_id" attribute, you get the two transcripts back:

> getBM(attributes = c("refsnp_id", 
+                      "ensembl_gene_stable_id",
+                      "ensembl_transcript_stable_id",
+                      "associated_gene"),
+       filters = "snp_filter",
+       values = snps,
+       mart = mart.snps)
  refsnp_id ensembl_gene_stable_id ensembl_transcript_stable_id associated_gene
1 rs1082714                                                               CAND1
2 rs1085093        ENSG00000203943              ENST00000370673            GLTP
3 rs1085093        ENSG00000203943              ENST00000370671            GLTP

 

ADD COMMENTlink written 6 weeks ago by Thomas Maurel570

Cool, glad what I've said make sense.  I've no idea how the various ensembl interfaces reference the data, but I note that there's no phenotype data available on the standard page for the SNP either (rs368991480).  The link on the left side is simply greyed out.

I'll have a look at the biomaRt code to check whether the duplicate entries are removed by default somewhere.

ADD REPLYlink written 6 weeks ago by Mike Smith1.4k
1

Dear Mike,

You are right, rs368991480 don't have any associated phenotypes so we can't bring in ENSG00000175445 associated gene name. Regarding the issue with rs368991480 being dropped from the result, I changed the way we generate our phenotype table so for our next release (e!88), you will get something like the following:

library(biomaRt)
mart.snps <- useMart('ENSEMBL_MART_SNP', 'hsapiens_snp')

snps = c("rs368991480",
         "rs1082714",
         "rs1085093")

getBM(attributes = c("refsnp_id", 
                     "ensembl_gene_stable_id", 
                     "associated_gene"),
      filters = "snp_filter",
      values = snps,
      mart = mart.snps)
  refsnp_id ensembl_gene_stable_id associated_gene
1 rs1082714                                  CAND1
2 rs1085093        ENSG00000203943            GLTP
3 rs368991480      ENSG00000175445

Kind Regards,

Thomas

ADD REPLYlink written 5 weeks ago by Thomas Maurel570

Awesome, that looks like a result that is much more consistent with how it works with other combinations of variables.  Thanks very much.

ADD REPLYlink written 5 weeks ago by Mike Smith1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 161 users visited in the last hour