Question

Error: This function only works when used with the ensembl BioMart.

0

Entering edit mode

anchel.gonzalez • 0

@anchelgonzalez-9223

Last seen 8.4 years ago

Netherlands

Hello everyone,

First of all, I am a biologist, not a bioinformatician or statistician, who is trying to use R for a particular application. Therefore the answer could be very obvious but not easy to understand to me, so I would like to ask you to be patient, please.

I received the following code from a bioinformatician in the past to get the hgnc symbols for the gene ids in biomaRt:

> mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")
> ann <- getGene(id = rownames(data), type = "ensembl_gene_id", mart = mart)
> m <- match(rownames(data),ann[,9])
> genes <- ann[m,1:8]

After running it, it gave me an error because, apparently, the BioMart site doesn't provide the biomart service at present. So I switched to the ensembl biomart server as recommended here: biomaRt no longer works with proxy after upgrade

So this is how my code looks like now:
> ensembl = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org")
> ann <- getGene(id = rownames(data), type = "ensembl_gene_id", mart = mart)
> m <- match(rownames(data),ann[,9])
> genes <- ann[m,1:8]

However, now I get the following error after the second row:

"Error in martCheck(mart, "ensembl") :
This function only works when used with the ensembl BioMart."

I have been reading the biomaRt user's guide and checking online for a solution, but I just don't know enough to understand how I should change the code to fix it. I would really appreciate any help!

The package version of bioMart that I am using is 2.26.0

My R software version is 3.2.2

biomart error • 6.2k views

ADD COMMENT • link 8.4 years ago anchel.gonzalez • 0

0

Entering edit mode

anchel.gonzalez • 0

@anchelgonzalez-9223

Last seen 8.4 years ago

Netherlands

Thank you, switching to the getBM() function solved the problem, but not completely...

The function returns 14778 genes, whereas my data contains 15130 annotated genes:

 #Choose samples
> A=which(Metadata_Anchel$comparison=="A")
> B=which(Metadata_Anchel$comparison=="B")
> Data_A=Data_DM[,A]
> Metadata_A=Metadata_Anchel[A,]
> Data_B=Data_DM[,B]
> Metadata_B=Metadata_Anchel[B,]
>
> #Merge the samples
> Data_A_vs_B=cbind(Data_A,Data_B)
> Metadata_A_vs_B=rbind(Metadata_A,Metadata_B)
>
> #Define the vector that will be used for the design matrix to test differential expression
> matrix_A_vs_B=c(rep(1,2),rep(2,2))
>
> #Get the hgnc symbols for the gene ids
> mart = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org")
> ann <- getBM(c("hgnc_symbol","description","chromosome_name","band","strand","start_position","end_position","ensembl_gene_id"), "ensembl_gene_id", rownames(Data_A_vs_B), mart)
>
> dim(Data_A_vs_B)
[1] 15130     4
> dim(ann)
[1] 14778     8

Then, when I run the Limma package to cross the tables and do the statistical analysis, the "ensemble_gene_id" entries don't match, and I have 352 gaps assigned as NA (the difference between "Data_A_vs_B" and "ann"):

 
> #Run the limma script to test for differential expression
> design=model.matrix(~matrix_A_vs_B)
> nf=calcNormFactors(Data_A_vs_B)
> y=voom(Data_A_vs_B,design,plot=TRUE,lib.size=colSums(Data_A_vs_B)*nf)
> y$genes <- ann
> fit=lmFit(y,design)
> fit=eBayes(fit)
> summary(decideTests(fit))
   (Intercept) matrix_A_vs_B
-1         254           158
0         4837         14680
1        10039           292
> topTable(fit,coef=2,n=20,sort.by="p")
                 hgnc_symbol
ENSG00000130600        WDR91
ENSG00000179051        CDH15
ENSG00000009709          VGF
ENSG00000171004        PANK3
ENSG00000130702 TRAF3IP2-AS1
ENSG00000155011      RPS6KA1
ENSG00000107859      BHLHE41
ENSG00000158258         <NA>
ENSG00000132329       HHIPL1
ENSG00000129757         ARL2
ENSG00000165495       ZNF787
ENSG00000108823        SLFN5
ENSG00000013297        FBXO9
ENSG00000106003       SNHG19
ENSG00000129965        ALDH2
ENSG00000174600   ST6GALNAC4
ENSG00000049130       TP53I3
ENSG00000108001        TLCD1
ENSG00000112759        IGSF8
ENSG00000148677            
                                                                                                                                                      description
ENSG00000130600                                                                                           WD repeat domain 91 [Source:HGNC Symbol;Acc:HGNC:24997]
ENSG00000179051                                                                    cadherin 15, type 1, M-cadherin (myotubule) [Source:HGNC Symbol;Acc:HGNC:1754]
ENSG00000009709                                                                             VGF nerve growth factor inducible [Source:HGNC Symbol;Acc:HGNC:12684]
ENSG00000171004                                                                                         pantothenate kinase 3 [Source:HGNC Symbol;Acc:HGNC:19365]
ENSG00000130702                                                                                      TRAF3IP2 antisense RNA 1 [Source:HGNC Symbol;Acc:HGNC:40005]
ENSG00000155011                                                             ribosomal protein S6 kinase, 90kDa, polypeptide 1 [Source:HGNC Symbol;Acc:HGNC:10430]
ENSG00000107859                                                                     basic helix-loop-helix family, member e41 [Source:HGNC Symbol;Acc:HGNC:16617]
ENSG00000158258                                                                                                                                              <NA>
ENSG00000132329                                                                                                   HHIP-like 1 [Source:HGNC Symbol;Acc:HGNC:19710]
ENSG00000129757                                                                                  ADP-ribosylation factor-like 2 [Source:HGNC Symbol;Acc:HGNC:693]
ENSG00000165495                                                                                       zinc finger protein 787 [Source:HGNC Symbol;Acc:HGNC:26998]
ENSG00000108823                                                                                      schlafen family member 5 [Source:HGNC Symbol;Acc:HGNC:28286]
ENSG00000013297                                                                                               F-box protein 9 [Source:HGNC Symbol;Acc:HGNC:13588]
ENSG00000106003                                                                              small nucleolar RNA host gene 19 [Source:HGNC Symbol;Acc:HGNC:49574]
ENSG00000129965                                                                 aldehyde dehydrogenase 2 family (mitochondrial) [Source:HGNC Symbol;Acc:HGNC:404]
ENSG00000174600 ST6 (alpha-N-acetyl-neuraminyl-2,3-beta-galactosyl-1,3)-N-acetylgalactosaminide alpha-2,6-sialyltransferase 4 [Source:HGNC Symbol;Acc:HGNC:17846]
ENSG00000049130                                                                         tumor protein p53 inducible protein 3 [Source:HGNC Symbol;Acc:HGNC:19373]
ENSG00000108001                                                                                       TLC domain containing 1 [Source:HGNC Symbol;Acc:HGNC:25177]
ENSG00000112759                                                                          immunoglobulin superfamily, member 8 [Source:HGNC Symbol;Acc:HGNC:17813]
ENSG00000148677                                                                                     Uncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:H3BRN7]
                chromosome_name   band strand start_position end_position ensembl_gene_id     logFC
ENSG00000130600               7    q33     -1      135183839    135211534 ENSG00000105875 10.061646
ENSG00000179051              16  q24.3      1       89171767     89195492 ENSG00000129910  5.349813
ENSG00000009709               7  q22.1     -1      101162509    101165593 ENSG00000128564  6.032908
ENSG00000171004               5    q34     -1      168548495    168579600 ENSG00000120137  4.727610
ENSG00000130702               6    q21      1      111483511    111598302 ENSG00000231889  4.111670
ENSG00000155011               1 p36.11      1       26529761     26575030 ENSG00000117676  3.905615
ENSG00000107859              12  p12.1     -1       26120026     26125127 ENSG00000123095  4.608072
ENSG00000158258            <NA>   <NA>     NA             NA           NA            <NA>  3.849851
ENSG00000132329              14  q32.2      1       99645110     99680569 ENSG00000182218  5.040559
ENSG00000129757              11  q13.1      1       65014113     65022184 ENSG00000213465  3.424920
ENSG00000165495              19 q13.43     -1       56087366     56121280 ENSG00000142409  6.335747
ENSG00000108823              17    q12      1       35243036     35273655 ENSG00000166750  3.348464
ENSG00000013297               6  p12.1      1       53051991     53100873 ENSG00000112146 -5.514139
ENSG00000106003              16  p13.3     -1        2154797      2155358 ENSG00000260260  4.412275
ENSG00000129965              12 q24.12      1      111766887    111817529 ENSG00000111275  3.215444
ENSG00000174600               9 q34.11     -1      127907886    127917038 ENSG00000136840  4.261586
ENSG00000049130               2  p23.3     -1       24077433     24085861 ENSG00000115129 -4.344771
ENSG00000108001              17  q11.2     -1       28724348     28727935 ENSG00000160606  4.565692
ENSG00000112759               1  q23.2     -1      160091340    160098943 ENSG00000162729  3.681301
ENSG00000148677              15    q23     -1       68193801     68229718 ENSG00000260007  3.163246
                 AveExpr         t      P.Value   adj.P.Val        B
ENSG00000130600 8.300644 15.521686 9.471365e-08 0.001433018 7.731591
ENSG00000179051 5.047187 11.720653 1.036651e-06 0.007842266 5.729553
ENSG00000009709 4.203276 10.955718 1.824472e-06 0.009201422 5.027085
ENSG00000171004 4.067789  9.371134 6.623554e-06 0.011789728 4.132719
ENSG00000130702 5.035937  9.246298 7.388210e-06 0.011789728 4.194915
ENSG00000155011 5.449131  9.172570 7.885189e-06 0.011789728 4.168109
ENSG00000107859 3.962637  9.122268 8.245367e-06 0.011789728 3.948020
ENSG00000158258 5.424068  9.115529 8.294977e-06 0.011789728 4.124020
ENSG00000132329 4.008954  9.062435 8.697583e-06 0.011789728 3.863129
ENSG00000129757 8.724135  8.920984 9.879237e-06 0.011789728 4.032573
ENSG00000165495 3.087339  8.746510 1.158675e-05 0.011789728 3.295786
ENSG00000108823 6.724002  8.614238 1.309798e-05 0.011789728 3.750933
ENSG00000013297 8.228763 -8.586334 1.344376e-05 0.011789728 3.729286
ENSG00000106003 3.620577  8.577444 1.355602e-05 0.011789728 3.489965
ENSG00000129965 8.727431  8.564536 1.372087e-05 0.011789728 3.723597
ENSG00000174600 3.884994  8.536712 1.408374e-05 0.011789728 3.512196
ENSG00000049130 5.237542 -8.515026 1.437390e-05 0.011789728 3.529719
ENSG00000108001 3.433202  8.507214 1.448001e-05 0.011789728 3.392038
ENSG00000112759 7.332290  8.481276 1.483863e-05 0.011789728 3.640597
ENSG00000148677 7.241413  8.306703 1.752203e-05 0.011789728 3.484323
>

I think this is the last obstacle between me and the data :)

Do you know how I can fix the problem?

Thank you once again in advance,

Anchel

ADD COMMENT • link 8.4 years ago anchel.gonzalez • 0

0

Entering edit mode

anchel.gonzalez • 0

@anchelgonzalez-9223

Last seen 8.4 years ago

Netherlands

Dear James,

I appreciate very much your patience and I hope you still have some time to help me.

The match() function is not doing the job correctly and I can't figure out why (although I realize it can be something very stupid that I am missing...)

This is how I am trying to merge and match my data "Data_A_vs_B" with the object I get back from getBM():

> #Get the hgnc symbols for the gene ids
> mart = useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host="www.ensembl.org")
> ann <- getBM(c("hgnc_symbol","description","chromosome_name","band","strand","start_position","end_position","ensembl_gene_id"), "ensembl_gene_id", rownames(Data_A_vs_B), mart)
>
> head(Data_A_vs_B)
                M14  M20 M13 M19
ENSG00000110514 163  240 204 166
ENSG00000086015 296  224 139 122
ENSG00000169740 207  454 182 165
ENSG00000261609  31   75  56  67
ENSG00000169744 375 1151  84 159
ENSG00000261604  17   69  19  13
>
> head(ann)
  hgnc_symbol
1      TSPAN6
2        DPM1
3       SCYL3
4    C1orf112
5         CFH
6       FUCA2
                                                                                                 description
1                                                          tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
2 dolichyl-phosphate mannosyltransferase polypeptide 1, catalytic subunit [Source:HGNC Symbol;Acc:HGNC:3005]
3                                               SCY1-like, kinase-like 3 [Source:HGNC Symbol;Acc:HGNC:19285]
4                                    chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
5                                                     complement factor H [Source:HGNC Symbol;Acc:HGNC:4883]
6                                          fucosidase, alpha-L- 2, plasma [Source:HGNC Symbol;Acc:HGNC:4008]
  chromosome_name   band strand start_position end_position ensembl_gene_id
1               X  q22.1     -1      100627109    100639991 ENSG00000000003
2              20 q13.13     -1       50934867     50958555 ENSG00000000419
3               1  q24.2     -1      169849631    169894267 ENSG00000000457
4               1  q24.2      1      169662007    169854080 ENSG00000000460
5               1  q31.3      1      196651878    196747504 ENSG00000000971
6               6  q24.2     -1      143494811    143511690 ENSG00000001036
>
> #Match your data with "ann" information based on "ensembl_gene_id" column:
> ann.fixed <- ann[match(Data_A_vs_B[,1], ann[,8]),]
>
> head(ann.fixed)
     hgnc_symbol description chromosome_name band strand start_position end_position ensembl_gene_id
NA          <NA>        <NA>            <NA> <NA>     NA             NA           NA            <NA>
NA.1        <NA>        <NA>            <NA> <NA>     NA             NA           NA            <NA>
NA.2        <NA>        <NA>            <NA> <NA>     NA             NA           NA            <NA>
NA.3        <NA>        <NA>            <NA> <NA>     NA             NA           NA            <NA>
NA.4        <NA>        <NA>            <NA> <NA>     NA             NA           NA            <NA>
NA.5        <NA>        <NA>            <NA> <NA>     NA             NA           NA            <NA>
>

So I don't think the function is working as I want to, because I find "NA" in my entire dataset. This is also clear when I try to do the differential expression:

> #Run the limma script to test for differential expression
> design=model.matrix(~matrix_A_vs_B)
> nf=calcNormFactors(Data_A_vs_B)
> y=voom(Data_A_vs_B,design,plot=TRUE,lib.size=colSums(Data_A_vs_B)*nf)
> y$genes <- ann.fixed
> fit=lmFit(y,design)
> fit=eBayes(fit)
> summary(decideTests(fit))
   (Intercept) matrix_A_vs_B
-1         254           158
0         4837         14680
1        10039           292
> topTable(fit,coef=2,n=20,sort.by="p")
                hgnc_symbol description chromosome_name band strand start_position end_position
ENSG00000130600        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000179051        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000009709        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000171004        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000130702        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000155011        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000107859        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000158258        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000132329        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000129757        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000165495        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000108823        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000013297        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000106003        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000129965        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000174600        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000049130        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000108001        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000112759        <NA>        <NA>            <NA> <NA>     NA             NA           NA
ENSG00000148677        <NA>        <NA>            <NA> <NA>     NA             NA           NA
                ensembl_gene_id     logFC  AveExpr         t      P.Value   adj.P.Val        B
ENSG00000130600            <NA> 10.061646 8.300644 15.521686 9.471365e-08 0.001433018 7.731591
ENSG00000179051            <NA>  5.349813 5.047187 11.720653 1.036651e-06 0.007842266 5.729553
ENSG00000009709            <NA>  6.032908 4.203276 10.955718 1.824472e-06 0.009201422 5.027085
ENSG00000171004            <NA>  4.727610 4.067789  9.371134 6.623554e-06 0.011789728 4.132719
ENSG00000130702            <NA>  4.111670 5.035937  9.246298 7.388210e-06 0.011789728 4.194915
ENSG00000155011            <NA>  3.905615 5.449131  9.172570 7.885189e-06 0.011789728 4.168109
ENSG00000107859            <NA>  4.608072 3.962637  9.122268 8.245367e-06 0.011789728 3.948020
ENSG00000158258            <NA>  3.849851 5.424068  9.115529 8.294977e-06 0.011789728 4.124020
ENSG00000132329            <NA>  5.040559 4.008954  9.062435 8.697583e-06 0.011789728 3.863129
ENSG00000129757            <NA>  3.424920 8.724135  8.920984 9.879237e-06 0.011789728 4.032573
ENSG00000165495            <NA>  6.335747 3.087339  8.746510 1.158675e-05 0.011789728 3.295786
ENSG00000108823            <NA>  3.348464 6.724002  8.614238 1.309798e-05 0.011789728 3.750933
ENSG00000013297            <NA> -5.514139 8.228763 -8.586334 1.344376e-05 0.011789728 3.729286
ENSG00000106003            <NA>  4.412275 3.620577  8.577444 1.355602e-05 0.011789728 3.489965
ENSG00000129965            <NA>  3.215444 8.727431  8.564536 1.372087e-05 0.011789728 3.723597
ENSG00000174600            <NA>  4.261586 3.884994  8.536712 1.408374e-05 0.011789728 3.512196
ENSG00000049130            <NA> -4.344771 5.237542 -8.515026 1.437390e-05 0.011789728 3.529719
ENSG00000108001            <NA>  4.565692 3.433202  8.507214 1.448001e-05 0.011789728 3.392038
ENSG00000112759            <NA>  3.681301 7.332290  8.481276 1.483863e-05 0.011789728 3.640597
ENSG00000148677            <NA>  3.163246 7.241413  8.306703 1.752203e-05 0.011789728 3.484323
>

Most probably I have to fix something in the code that I am missing, right? But I cannot figure out what.

Kind regards,

Anchel

ADD COMMENT • link 8.4 years ago anchel.gonzalez • 0

0

Entering edit mode

anchel.gonzalez • 0

@anchelgonzalez-9223

Last seen 8.4 years ago

Netherlands

Hello, I figured out the problem. It was indeed kind of silly, but I warned you about my complete lack of experience using code... :)

My data frame "Data_A_vs_B" contains the "ensemble gene ids" inserted as rownames (not as a separate column):

> head(Data_A_vs_B)
                M14  M20 M13 M19
ENSG00000110514 163  240 204 166
ENSG00000086015 296  224 139 122
ENSG00000169740 207  454 182 165
ENSG00000261609  31   75  56  67
ENSG00000169744 375 1151  84 159
ENSG00000261604  17   69  19  13

Therefore I have to specify this when using the match() function:

ann.fixed <- ann[match(rownames(Data_A_vs_B), ann[,8]),]

This worked, and now my data is perfectly matched with all the relevant information!

ADD COMMENT • link 8.4 years ago anchel.gonzalez • 0

score 7 · Accepted Answer · 2015-11-18

This has to do with the martCheck() function in getGene(), which is sort of broken right now. But do note that getGene() is just a thin wrapper around getBM(), which is the function that all the cool kids use anyway, so you should be using that too.

Something like

ann <- getBM(c("hgnc_symbol","description","chromosome_name","band","strand","start_position","end_position","ensembl_gene_id"), "ensembl_gene_id", rownames(data), mart)

Will do the exact same thing.

>getBM(c("hgnc_symbol","description","chromosome_name","band","strand","start_position","end_position","ensembl_gene_id"),"ensembl_gene_id", ensids[1], mart)
  hgnc_symbol                                            description
1        A1BG alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
  chromosome_name   band strand start_position end_position ensembl_gene_id
1              19 q13.43     -1       58345178     58353499 ENSG00000121410

Or you could use the Homo.sapiens package:

> select(Homo.sapiens, ensids[1], c("SYMBOL","GENENAME","MAP","CDSSTART","CDSEND","CDSSTRAND"), "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
          ENSEMBL               GENENAME     MAP SYMBOL CDSSTRAND CDSSTART
1 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58864770
2 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58864658
3 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58864294
4 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58863649
5 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58862757
6 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58861736
7 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58858719
8 ENSG00000121410 alpha-1-B glycoprotein 19q13.4   A1BG         - 58858388
    CDSEND
1 58864803
2 58864693
3 58864563
4 58863921
5 58863053
6 58862017
7 58859006
8 58858395
>

score 3 · Accepted Answer · 2015-11-21

The problem here is that you are assuming that there is some check when you do

y$genes <- ann

when in fact there is not. So a significant portion of your results are probably incorrectly annotated. In other words, limma assumes that the first row of the y$genes data.frame corresponds to the first row of the other data in that MArrayLM object, but I don't see anywhere that you have ensured that to be true. You almost certainly need to reorder the 'ann' object, and stick in some NA rows, to 'bump it out' to match the other data in your 'y' object. The match() function is the tool of choice for that job. Here is a simple example to give you an idea.

> z <- data.frame(ID = LETTERS[1:5], vals = 1:5)
> z
  ID vals
1  A    1
2  B    2
3  C    3
4  D    4
5  E    5
> annot <- data.frame(ID = LETTERS[c(3,2,4,5)], annot = paste0("gene", 1:4))
> annot
  ID annot
1  C gene1
2  B gene2
3  D gene3
4  E gene4

So let's say 'z' is your data, and 'annot' is what you got back from getBM(). Note that we are missing one ID (A), and the order is wrong. This should be your expectation! If you just put 'annot' into the gene slot, then it will think that the first row of annot matches up with the first row of z, and there isn't anything that matches up with the last row of z. Obviously this is a problem.

> annot.fixed <- annot[match(z[,1], annot[,1]),]
> annot.fixed
     ID annot
NA <NA>  <NA>
2     B gene2
1     C gene1
3     D gene3
4     E gene4

Now we have reordered the annot, and added a row of missing data for the ID that we couldn't annotate, and if we stick that into the MArrayLM object (e.g, do y$genes <- annot.fixed), then it will be OK.

As to why you don't get back everything you sent to getBM()? Some of those Ensembl Gene IDs may have been retired would be my best guess.