Question

Inconsistencies in biomart results

0

Entering edit mode

Laurent Gatto 1.6k

@laurent-gatto-5645

Last seen 4 months ago

Belgium

Has anyone any insight into the following annoying inconsistent result?: I get incompatible results for different Biomart queries, each being a subset of the other.

My original query consists of retrieve GO terms for 5032 UniProt identifiers. The full query is described below

library("biomaRt")
attr <- c("uniprot_swissprot", "go_id", "namespace_1003", "name_1006",
          "go_linkage_type")
filtername <- "uniprot_swissprot"

## the filters can be sourced from here
# source("http://cpu.sysbiol.cam.ac.uk/misc/filtervalues.R")
# length(filtervalues)

## [1] 5032

head(filtervalues)
## [1] "Q9JHU4"   "Q9QXS1-3" "Q9ERU9"   "P26039"   "Q8BTM8"   "A2ARV4"

## or extracted from a proteomics data
# suppressPackageStartupMessages(library(pRolocdata))
# data(hyperLOPIT2015)
# filtervalues <- featureNames(hyperLOPIT2015)

mart <- useDataset("mmusculus_gene_ensembl",
                   mart = useMart("ENSEMBL_MART_ENSEMBL"))

res_all <- getBM(attributes = attr, filters = filtername,
                 values = filtervalues, mart = mart)

head(res_all)
##   uniprot_swissprot      go_id     namespace_1003
## 1            A2A432 GO:0005737 cellular_component
## 2            A2A432 GO:0070062 cellular_component
## 3            A2A432 GO:0007049 biological_process
## 4            A2A432 GO:0061630 molecular_function
## 5            A2A432 GO:0005654 cellular_component
## 6            A2A432 GO:0003684 molecular_function
##                           name_1006 go_linkage_type
## 1                         cytoplasm             ISO
## 2             extracellular exosome             ISO
## 3                        cell cycle             IEA
## 4 ubiquitin protein ligase activity             IBA
## 5                       nucleoplasm             ISO
## 6               damaged DNA binding             ISO

I am focusing in the results for 3 protein identifiers in particular

id0 <- c("Q8VDM4", "Q5SSW2", "Q9QUM9")
sort(i <- match(id0, filtervalues))
## [1]  359 1063 2717

I am now repeating the query above, but only with a subset of the input that contains my features of interest

res1 <- getBM(attributes = attr, filters = filtername,
              values = filtervalues[i], mart = mart)

res2 <- getBM(attributes = attr, filters = filtername,
              values = filtervalues[300:3000], mart = mart)

res3 <- getBM(attributes = attr, filters = filtername,
              values = filtervalues[300:4000], mart = mart)

res4 <- getBM(attributes = attr, filters = filtername,
              values = filtervalues[300:5000], mart = mart)

And here, I check how many of my results match my 3 protein of interest

sum(res_all$uniprot_swissprot %in% id0)
## [1] 0
sum(res1$uniprot_swissprot %in% id0)
## [1] 70
sum(res2$uniprot_swissprot %in% id0)
## [1] 70
sum(res3$uniprot_swissprot %in% id0)
## [1] 15
sum(res4$uniprot_swissprot %in% id0)
## [1] 0

I suppose the issue lies on the Biomart side, rather than biomaRt, but I was wondering of anyone had an idea?

sessionInfo()
## R version 3.3.1 Patched (2016-08-02 r71022)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
##
## other attached packages:
## [1] biomaRt_2.29.2
##
## loaded via a namespace (and not attached):
##  [1] IRanges_2.7.12       msdata_0.12.1        XML_3.98-1.4        
##  [4] bitops_1.0-6         DBI_0.5              stats4_3.3.1        
##  [7] magrittr_1.5         evaluate_0.9         RSQLite_1.0.0       
## [10] stringi_1.1.1        S4Vectors_0.11.10    tools_3.3.1         
## [13] stringr_1.0.0        Biobase_2.33.0       RCurl_1.95-4.8      
## [16] parallel_3.3.1       BiocGenerics_0.19.2  AnnotationDbi_1.35.4
## [19] knitr_1.14

biomart • 2.6k views

ADD COMMENT • link updated 8.2 years ago by Mike Smith ★ 6.6k • written 8.9 years ago by Laurent Gatto 1.6k

2

Entering edit mode

This is really weird. You're right, it seems to be a problem with biomart itself as I get the same inconsistencies if I use their web interface and avoid R completely.

I wonder if there's some limit to the number of values you can use as a filter? Biomart recommend a maximum of 500, which I always assumed was just for speed, but perhaps it's more fundamental.

The example below shows how adding one extra ID to filter against, can remove another ID from the results.

source("http://cpu.sysbiol.cam.ac.uk/misc/filtervalues.R")
attr <- c("uniprot_swissprot", "go_id", "namespace_1003", "name_1006",
          "go_linkage_type")
filtername <- "uniprot_swissprot"

## several very similar sets of IDs to filter
filter_list <- list()
filter_list[[1]] <- filtervalues[1:3423]
filter_list[[2]] <- filtervalues[1:3424]
filter_list[[3]] <- filtervalues[2:3424]
filter_list[[4]] <- filtervalues[3:3425]
filter_list[[5]] <- filtervalues[3:3426]

res <- lapply(filter_list, getBM, 
       attributes = attr, 
       filters = filtername,
       mart = mart)

## print how many results match our ID of interest
lapply(res, function(x) { 
    length(which(x$uniprot_swissprot == "Q8VDM4"))
})

[[1]]
[1] 16

[[2]]
[1] 0

[[3]]
[1] 16

[[4]]
[1] 16

[[5]]
[1] 0

That isn't really conclusive, but it might be somewhere to dig if you want to get the bottom of this.

ADD REPLY • link 8.9 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Thanks, Mike. Nice test with the extra id!

ADD REPLY • link 8.9 years ago Laurent Gatto 1.6k

1

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 10 hours ago

EMBL Heidelberg

Dredging up an old post here, but I've modified the getBM() function in biomaRt to submit queries in batches if the number of values exceeds 500. If you have multiple filters each of which have more than 500 values it should generate multiple mutually exclusive queries so that all combinations are run without breaking the 500 value limit. All of this is done internally, so existing biomaRt scripts shouldn't need to be changed. It will also display a progress bar so you can tell it is still proceeding.

If anyone finds any issues with this, please let me know.

You can test the code with the following example:

library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl" )

Download a list of 20,000 Uniprot/TrEMBL IDs to use as our query values, and then submit the biomaRt query.

protein_ids <- read.table("http://msmith.de/data/20k_uniprot_ids.txt",
                          header = TRUE,
                          stringsAsFactors = FALSE,
                          sep = "\t")

GO_terms <- getBM(attributes = c("uniprotsptrembl",
                                 "go_id",
                                 "name_1006",
                                 "namespace_1003",
                                 "go_linkage_type"),
                      filters = c('uniprotsptrembl'),
                      values = protein_ids,
                      mart = ensembl)

If we run this with the current release version, we see that ~8% of the protein IDs were silently dropped from the return:

> packageVersion("biomaRt")
[1] ‘2.32.0’
> table(protein_ids[,1] %in% GO_terms$uniprotsptrembl)

FALSE  TRUE 
 1618 18382

Using the devel version this no longer happens:

> packageVersion("biomaRt")
[1] ‘2.33.1’
> table(protein_ids[,1] %in% GO_terms$uniprotsptrembl)

 TRUE 
20000

ADD COMMENT • link 8.2 years ago Mike Smith ★ 6.6k

0

Entering edit mode

That's great news, thanks a lot for implementing this Mike.

Kind Regards,

Thomas

ADD REPLY • link 8.2 years ago Thomas Maurel ▴ 800

0

Entering edit mode

No problem, this seemed like a bit of an insidious bug when I thought about it.

Let me know if it causes any issues your end. I don't know if you ever ban IPs or anything, but if someone asks for data on 30,000 genes you'll get 60 queries in quick succession now, which might look strange in the logs. That said, right now it only submits in one at a time, so you won't be bombarded with queries in parallel.

ADD REPLY • link 8.2 years ago Mike Smith ★ 6.6k

score 3 · Accepted Answer · 2016-08-22

3

Entering edit mode

Thomas Maurel ▴ 800

@thomas-maurel-5295

Last seen 2.4 years ago

United Kingdom

Dear Laurent,

I am afraid that the issue is coming from the volume of data that you are requesting, this is causing BioMart to time out part way through the query. As a result the data coming back is truncated. Could you please split your queries in batches of 500 ids?

Also, I've noticed that some ids are badly formatted in filtervalues, e.g:

[2671] "O54950"               "O55028"               "O70281"               "O88379"               "P26043;P26040;P26041"

I am afraid that the last one won't be processed by the Biomart filter as it will see this as one ID.

Also, I am afraid that we don't have Uniprot isoforms in Biomart at the moment, so the following won't be processed by the filter:

[2551] "Q9CX30-2"

Hope this helps,

Kind Regards,

Thomas

ADD COMMENT • link 8.9 years ago Thomas Maurel ▴ 800

0

Entering edit mode

Thank, Thomas. Splitting is exactly what I did, but I used batches of 1000; will update to 500.

Re "P26043;P26040;P26041", it is a protein group composed of these three proteins, and not getting any result is actually a feature in this case. Too bad however for the isoforms, I might consider doing additional processing for there. Thanks for the clarifications!

ADD REPLY • link 8.9 years ago Laurent Gatto 1.6k