Search
Question: Inconsistencies in biomart results
0
2.1 years ago by
Laurent Gatto1.0k
United Kingdom
Laurent Gatto1.0k wrote:

Has anyone any insight into the following annoying inconsistent result?: I get incompatible results for different Biomart queries, each being a subset of the other.

My original query consists of retrieve GO terms for 5032 UniProt identifiers. The full query is described below

library("biomaRt")
attr <- c("uniprot_swissprot", "go_id", "namespace_1003", "name_1006",
filtername <- "uniprot_swissprot"

## the filters can be sourced from here
# source("http://cpu.sysbiol.cam.ac.uk/misc/filtervalues.R")
# length(filtervalues)

## [1] 5032

## [1] "Q9JHU4"   "Q9QXS1-3" "Q9ERU9"   "P26039"   "Q8BTM8"   "A2ARV4"

## or extracted from a proteomics data
# suppressPackageStartupMessages(library(pRolocdata))
# data(hyperLOPIT2015)
# filtervalues <- featureNames(hyperLOPIT2015)

mart <- useDataset("mmusculus_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL"))

res_all <- getBM(attributes = attr, filters = filtername,
values = filtervalues, mart = mart)

##   uniprot_swissprot      go_id     namespace_1003
## 1            A2A432 GO:0005737 cellular_component
## 2            A2A432 GO:0070062 cellular_component
## 3            A2A432 GO:0007049 biological_process
## 4            A2A432 GO:0061630 molecular_function
## 5            A2A432 GO:0005654 cellular_component
## 6            A2A432 GO:0003684 molecular_function
## 1                         cytoplasm             ISO
## 2             extracellular exosome             ISO
## 3                        cell cycle             IEA
## 4 ubiquitin protein ligase activity             IBA
## 5                       nucleoplasm             ISO
## 6               damaged DNA binding             ISO


I am focusing in the results for 3 protein identifiers in particular

id0 <- c("Q8VDM4", "Q5SSW2", "Q9QUM9")
sort(i <- match(id0, filtervalues))
## [1]  359 1063 2717


I am now repeating the query above, but only with a subset of the input that contains my features of interest

res1 <- getBM(attributes = attr, filters = filtername,
values = filtervalues[i], mart = mart)

res2 <- getBM(attributes = attr, filters = filtername,
values = filtervalues[300:3000], mart = mart)

res3 <- getBM(attributes = attr, filters = filtername,
values = filtervalues[300:4000], mart = mart)

res4 <- getBM(attributes = attr, filters = filtername,
values = filtervalues[300:5000], mart = mart)


And here, I check how many of my results match my 3 protein of interest

sum(res_all$uniprot_swissprot %in% id0) ## [1] 0 sum(res1$uniprot_swissprot %in% id0)
## [1] 70
sum(res2$uniprot_swissprot %in% id0) ## [1] 70 sum(res3$uniprot_swissprot %in% id0)
## [1] 15
sum(res4$uniprot_swissprot %in% id0) ## [1] 0  I suppose the issue lies on the Biomart side, rather than biomaRt, but I was wondering of anyone had an idea? sessionInfo() ## R version 3.3.1 Patched (2016-08-02 r71022) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 14.04.5 LTS ## ## locale: ## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 ## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 ## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] biomaRt_2.29.2 ## ## loaded via a namespace (and not attached): ## [1] IRanges_2.7.12 msdata_0.12.1 XML_3.98-1.4 ## [4] bitops_1.0-6 DBI_0.5 stats4_3.3.1 ## [7] magrittr_1.5 evaluate_0.9 RSQLite_1.0.0 ## [10] stringi_1.1.1 S4Vectors_0.11.10 tools_3.3.1 ## [13] stringr_1.0.0 Biobase_2.33.0 RCurl_1.95-4.8 ## [16] parallel_3.3.1 BiocGenerics_0.19.2 AnnotationDbi_1.35.4 ## [19] knitr_1.14 ADD COMMENTlink modified 16 months ago by Mike Smith2.9k • written 2.1 years ago by Laurent Gatto1.0k 2 This is really weird. You're right, it seems to be a problem with biomart itself as I get the same inconsistencies if I use their web interface and avoid R completely. I wonder if there's some limit to the number of values you can use as a filter? Biomart recommend a maximum of 500, which I always assumed was just for speed, but perhaps it's more fundamental. The example below shows how adding one extra ID to filter against, can remove another ID from the results. source("http://cpu.sysbiol.cam.ac.uk/misc/filtervalues.R") attr <- c("uniprot_swissprot", "go_id", "namespace_1003", "name_1006", "go_linkage_type") filtername <- "uniprot_swissprot" ## several very similar sets of IDs to filter filter_list <- list() filter_list[[1]] <- filtervalues[1:3423] filter_list[[2]] <- filtervalues[1:3424] filter_list[[3]] <- filtervalues[2:3424] filter_list[[4]] <- filtervalues[3:3425] filter_list[[5]] <- filtervalues[3:3426] res <- lapply(filter_list, getBM, attributes = attr, filters = filtername, mart = mart) ## print how many results match our ID of interest lapply(res, function(x) { length(which(x$uniprot_swissprot == "Q8VDM4"))
})
[[1]]
[1] 16

[[2]]
[1] 0

[[3]]
[1] 16

[[4]]
[1] 16

[[5]]
[1] 0

That isn't really conclusive, but it might be somewhere to dig if you want to get the bottom of this.

Thanks, Mike. Nice test with the extra id!

3
2.1 years ago by
Thomas Maurel740
United Kingdom
Thomas Maurel740 wrote:

Dear Laurent,

I am afraid that the issue is coming from the volume of data that you are requesting, this is causing BioMart to time out part way through the query. As a result the data coming back is truncated. Could you please split your queries in batches of 500 ids?

Also, I've noticed that some ids are badly formatted in filtervalues, e.g:

[2671] "O54950"               "O55028"               "O70281"               "O88379"               "P26043;P26040;P26041"

I am afraid that the last one won't be processed by the Biomart filter as it will see this as one ID.

Also, I am afraid that we don't have Uniprot isoforms in Biomart at the moment, so the following won't be processed by the filter:

[2551] "Q9CX30-2"

Hope this helps,

Kind Regards,

Thomas

Thank, Thomas. Splitting is exactly what I did, but I used batches of 1000; will update to 500.

Re "P26043;P26040;P26041", it is a protein group composed of these three proteins, and not getting any result is actually a feature in this case. Too bad however for the isoforms, I might consider doing additional processing for there. Thanks for the clarifications!

1
16 months ago by
Mike Smith2.9k
EMBL Heidelberg / de.NBI
Mike Smith2.9k wrote:

Dredging up an old post here, but I've modified the getBM() function in biomaRt to submit queries in batches if the number of values exceeds 500.  If you have multiple filters each of which have more than 500 values it should generate multiple mutually exclusive queries so that all combinations are run without breaking the 500 value limit.  All of this is done internally, so existing biomaRt scripts shouldn't need to be changed.  It will also display a progress bar so you can tell it is still proceeding.

If anyone finds any issues with this, please let me know.

You can test the code with the following example:

library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl" )

Download a list of 20,000 Uniprot/TrEMBL IDs to use as our query values, and then submit the biomaRt query.

protein_ids <- read.table("http://msmith.de/data/20k_uniprot_ids.txt",
stringsAsFactors = FALSE,
sep = "\t")

GO_terms <- getBM(attributes = c("uniprotsptrembl",
"go_id",
"name_1006",
"namespace_1003",
filters = c('uniprotsptrembl'),
values = protein_ids,
mart = ensembl)

If we run this with the current release version, we see that ~8% of the protein IDs were silently dropped from the return:

> packageVersion("biomaRt")
[1] ‘2.32.0’
> table(protein_ids[,1] %in% GO_terms$uniprotsptrembl) FALSE TRUE 1618 18382  Using the devel version this no longer happens: > packageVersion("biomaRt") [1] ‘2.33.1’ > table(protein_ids[,1] %in% GO_terms$uniprotsptrembl)

TRUE
20000 

That's great news, thanks a lot for implementing this Mike.

Kind Regards,

Thomas

No problem, this seemed like a bit of an insidious bug when I thought about it.

Let me know if it causes any issues your end.  I don't know if you ever ban IPs or anything, but if someone asks for data on 30,000 genes you'll get 60 queries in quick succession now, which might look strange in the logs.  That said, right now it only submits in one at a time, so you won't be bombarded with queries in parallel.