Dear Bioconductor Forum,
I am converting ensembl gene ids to hgnc symbols, but I am getting this error message I can't make sense of.
Here is my code:
seqdata <- read.delim("featureCount_Dup.txt", stringsAsFactors = FALSE)
head(seqdata)
Geneid REX.accepted_hits_nameSorted.bam
1 ENSG00000223972 2
2 ENSG00000227232 380
3 ENSG00000243485 0
4 ENSG00000237613 0
5 ENSG00000268020 0
6 ENSG00000240361 0
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol","ensembl_gene_id_version"),values=seqdata$Geneid,mart= mart)
table(G_list$ensembl_gene_id==seqdata$Geneid)
FALSE TRUE
63790 4
Warning message:
In G_list$ensembl_gene_id == seqdata$Geneid) :
longer object length is not a multiple of shorter object length
Why has it not worked? Why am I only getting 4 TRUE (matches) between seqdata$Geneid
and the vector G_list$ensembl_gene_id
? I thought they should match 100% if it had worked correct? Also, surely the length of G_list
should be identical to one of the dimensions of seqdata$Geneid
?
dim(G_list)
[1] 54544 3
dim(seqdata)
[1] 63794 2
Any help would be really appreciated :-)
Thank you for spotting those typos in my code, James. I have amended my code above (I am getting the same error even with these typos fixed).
You are querying a relational database when you use biomaRt, and one facet of that process is that the returned values are not necessarily in the same order.
Which is why you should always ask to get back the values you searched on, as you have. You can then use
match
to align the results with your incoming data, which you need to do first.Also, do note that
==
is, in this instance, probably ok once you've reordered, but in general something like%in%
orall.equal
is to be preferred.Thanks for the help, James.
I'm still not sure why it isn't working, though. For example, I noticed that ENSG00000241670 is in
seqdata$Geneid
list, but it isn't returned from my biomaRt query. This is a bona fide ensembl id (Gene Name: 'AP006222.1' (HUGO: N/A), see here). What is going on here?Out of 63794 ensembl ids in
seqdata$Geneid
, 9253 were not found in my biomaRt query. That is about 15%. Is that quite a lot, or is this normal?The gene you give as an example was removed as of release 89. So you shouldn't expect to find it in the current Ensembl release.
If you care to use GRCh37 and old genes that are no longer thought to exist, you can use an archived version with biomaRt.
You can also hypothetically use other archives, by doing something like
But as you can see, that is failing me, because it uses the main ensembl.org site, and that site is currently down. But if you were to be able to connect to the main site, you would find that the HUGO gene symbol for that gene is RPL23AP21 (and the 'Gene Name' you show there is definitely not a gene name, but instead a GenBank ID.)
Thank you James that works now!!