Biomart cannot get multiple genes?
1
0
Entering edit mode
Mike ▴ 10
@mike-18117
Last seen 4.9 years ago

So I am trying get gene information from MCA genes.

I used :

library(biomaRt)


human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mca_filter <- mca@var.genes
attr <- c("ensembl_gene_id", "hgnc_symbol","chromosome_name",'entrezgene', "start_position", "end_position")
Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = mca_filter,
              mart = human)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

But the problem is that I cannot get multiple gene information by using biomaRt. (collecting single/individual information is fine) 

For example, 'mca_filter' contains "Selenop" gene. So

> which(mca_filter == "Selenop")
[1] 400

and it can also get gene information by using this code

Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = "Selenop",
              mart = human)

which gives this result

>Info
  ensembl_gene_id hgnc_symbol chromosome_name entrezgene start_position end_position
1 ENSG00000250722     SELENOP               5       6414       42799880     42887392

 

HOWEVER, If I just put mca_filter instead of single gene:

Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = mca_filter,
              mart = human)

I cannot get many single gene information.

> which(Info$hgnc_symbol == "Selenop")
integer(0)

Do you know why? Please let me know. Thank you!

 

 

R biomart • 1.1k views
ADD COMMENT
0
Entering edit mode

Before digging any deeper, can you check this isn't due to case sensitive matching.  The command which(Info$hgnc_symbol == "Selenop") will only match entries that look like Selenop, but your query returns SELENOP.   Your first example will fail this too:

Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = c("Selenop"),
              mart = human)
> which(Info$hgnc_symbol == "Selenop")
integer(0)

You can use a function like grep to perform a case-insenstive search e.g.

Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = c("Selenop", "CDC6"),
              mart = human)
> Info
  ensembl_gene_id hgnc_symbol chromosome_name entrezgene start_position end_position
1 ENSG00000094804        CDC6              17        990       40287633     40304657
2 ENSG00000250722     SELENOP               5       6414       42799880     42887392
> grep(x = Info$hgnc_symbol, pattern = 'Selenop', ignore.case = TRUE)
[1] 2

If this doesn't resolve the issue then please include the output of is(mca_filter) and head(mca_filter) so we can see examples of what values are present.

ADD REPLY
0
Entering edit mode

Thanks for the comment, Mike. I am afraid that it is not a case sensitive matching.

mca_filter has "Selenop" gene so I tried both values = mca_filter and values = c("Selenop").

But only values = c("Selenop") gives the correct result. 

=================================================================

> Info <- getBM(attributes = attr,
+               filters = "hgnc_symbol",
+               values = c("Selenop", "CDC6"),
+               mart = human)
> Info
  ensembl_gene_id hgnc_symbol chromosome_name entrezgene start_position end_position
1 ENSG00000094804        CDC6              17        990       40287633     40304657
2 ENSG00000250722     SELENOP               5       6414       42799880     42887392

This also works for me but when I put mca_filter rather than some single or multiple gene, it only gives one gene information.

> intersect(Info$hgnc_symbol, mca_filter)
[1] "H19"

This means it only get "H19" gene information.

> length(Info$hgnc_symbol)
[1] 697
> length(mca_filter)
[1] 1000

When I check the number of genes in each list, they show like the above. 

==========================================================

I will also give you the information that you asked for.

> is(mca_filter)
 [1] "character"             "vector"                "data.frameRowLabels"   "SuperClassMethod"      "index"                 "atomicVector"          "kfunction"            
 [8] "EnumerationValue"      "characterORconnection" "characterORMIAME"      "character_OR_NULL"     "atomic"                "listI"                 "output"               
[15] "vector_OR_factor"     

 

> head(mca_filter, 10)
 [1] "Spink1"  "Gast"    "Sbp"     "Wap"     "Csn1s2a" "Ins2"    "Igha"    "Igkc"    "Sftpc"   "Scgb1a1"

=============================================================================

For pbmc data from Seurat, it worked perfectly fine but when I used mca data from Seurat (https://satijalab.org/seurat/mca_loom.html), it doesn't work. I also used mca_filter as hv.genes from this website.

Hence,

mca_filter <- hv.genes

================================================================

Thank you so much again.

 

 

ADD REPLY
0
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 4 hours ago
EMBL Heidelberg

I still think this might be a case sensitive issue.  The lines I've reproduced below show that there are multiple values returned by your query. It's not the full 1000, but 697 gene symbols are matched by your query.

> length(Info$hgnc_symbol)
[1] 697
> length(mca_filter)
[1] 1000

It is normal for biomaRt to return nothing if it doesn't find a match for an element in your values vector, and presumably here 303 elements of mca_filter return nothing.

I suspect the reason intersect(Info$hgnc_symbol, mca_filter) gives only one result is probably down to the fact that it is case sensitive. The example below demonstrates the same behaviour:

> intersect(c("H19", "Selenop", "Cdc6"), c("H19", "SELENOP", "CDC6"))
[1] "H19"

This leaves two questions:

  • why are we only finding 697 hits
  • why does the capitalization change between mca_filter and the biomaRt results?

I think both of these are because the MCA data are from mouse, but you are querying the human dataset at Ensembl. The incomplete number of matches is because you wouldn't expect to get a complete set of genes found in both organisms, and it is also standard for mouse gene symbols to be stylised Selenop and human symbols to be all capitals e.g. SELENOP

ADD COMMENT
1
Entering edit mode

Thank you so much for the comments, Mike. I think now I understand everything what I did wrong in this work.

Thank you very much again!

ADD REPLY
0
Entering edit mode

Great glad it makes sense.  Depending what you're trying to do, you can either use the mmusculus_gene_ensembl dataset if you want to annotate the mouse genes, or take a look at the getLDS() function if you want to find matches across species.

ADD REPLY

Login before adding your answer.

Traffic: 817 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6