Has anyone any insight into the following annoying inconsistent result?: I get incompatible results for different Biomart queries, each being a subset of the other.
My original query consists of retrieve GO terms for 5032 UniProt identifiers. The full query is described below
library("biomaRt") attr <- c("uniprot_swissprot", "go_id", "namespace_1003", "name_1006", "go_linkage_type") filtername <- "uniprot_swissprot" ## the filters can be sourced from here # source("http://cpu.sysbiol.cam.ac.uk/misc/filtervalues.R") # length(filtervalues) ## [1] 5032 head(filtervalues) ## [1] "Q9JHU4" "Q9QXS1-3" "Q9ERU9" "P26039" "Q8BTM8" "A2ARV4" ## or extracted from a proteomics data # suppressPackageStartupMessages(library(pRolocdata)) # data(hyperLOPIT2015) # filtervalues <- featureNames(hyperLOPIT2015) mart <- useDataset("mmusculus_gene_ensembl", mart = useMart("ENSEMBL_MART_ENSEMBL")) res_all <- getBM(attributes = attr, filters = filtername, values = filtervalues, mart = mart) head(res_all) ## uniprot_swissprot go_id namespace_1003 ## 1 A2A432 GO:0005737 cellular_component ## 2 A2A432 GO:0070062 cellular_component ## 3 A2A432 GO:0007049 biological_process ## 4 A2A432 GO:0061630 molecular_function ## 5 A2A432 GO:0005654 cellular_component ## 6 A2A432 GO:0003684 molecular_function ## name_1006 go_linkage_type ## 1 cytoplasm ISO ## 2 extracellular exosome ISO ## 3 cell cycle IEA ## 4 ubiquitin protein ligase activity IBA ## 5 nucleoplasm ISO ## 6 damaged DNA binding ISO
I am focusing in the results for 3 protein identifiers in particular
id0 <- c("Q8VDM4", "Q5SSW2", "Q9QUM9") sort(i <- match(id0, filtervalues)) ## [1] 359 1063 2717
I am now repeating the query above, but only with a subset of the input that contains my features of interest
res1 <- getBM(attributes = attr, filters = filtername, values = filtervalues[i], mart = mart) res2 <- getBM(attributes = attr, filters = filtername, values = filtervalues[300:3000], mart = mart) res3 <- getBM(attributes = attr, filters = filtername, values = filtervalues[300:4000], mart = mart) res4 <- getBM(attributes = attr, filters = filtername, values = filtervalues[300:5000], mart = mart)
And here, I check how many of my results match my 3 protein of interest
sum(res_all$uniprot_swissprot %in% id0) ## [1] 0 sum(res1$uniprot_swissprot %in% id0) ## [1] 70 sum(res2$uniprot_swissprot %in% id0) ## [1] 70 sum(res3$uniprot_swissprot %in% id0) ## [1] 15 sum(res4$uniprot_swissprot %in% id0) ## [1] 0
I suppose the issue lies on the Biomart side, rather than biomaRt
, but I was wondering of anyone had an idea?
sessionInfo() ## R version 3.3.1 Patched (2016-08-02 r71022) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 14.04.5 LTS ## ## locale: ## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 ## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 ## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] biomaRt_2.29.2 ## ## loaded via a namespace (and not attached): ## [1] IRanges_2.7.12 msdata_0.12.1 XML_3.98-1.4 ## [4] bitops_1.0-6 DBI_0.5 stats4_3.3.1 ## [7] magrittr_1.5 evaluate_0.9 RSQLite_1.0.0 ## [10] stringi_1.1.1 S4Vectors_0.11.10 tools_3.3.1 ## [13] stringr_1.0.0 Biobase_2.33.0 RCurl_1.95-4.8 ## [16] parallel_3.3.1 BiocGenerics_0.19.2 AnnotationDbi_1.35.4 ## [19] knitr_1.14
This is really weird. You're right, it seems to be a problem with biomart itself as I get the same inconsistencies if I use their web interface and avoid R completely.
I wonder if there's some limit to the number of values you can use as a filter? Biomart recommend a maximum of 500, which I always assumed was just for speed, but perhaps it's more fundamental.
The example below shows how adding one extra ID to filter against, can remove another ID from the results.
That isn't really conclusive, but it might be somewhere to dig if you want to get the bottom of this.Thanks, Mike. Nice test with the extra id!