biomaRt "longer object length is not a multiple of shorter object length"
1
0
Entering edit mode
Linda • 0
@linda-23123
Last seen 3.0 years ago
United Kingdom

Dear Bioconductor Forum,

I am converting ensembl gene ids to hgnc symbols, but I am getting this error message I can't make sense of.

Here is my code:

seqdata <- read.delim("featureCount_Dup.txt", stringsAsFactors = FALSE)

head(seqdata)
           Geneid                  REX.accepted_hits_nameSorted.bam
1 ENSG00000223972                                                 2
2 ENSG00000227232                                               380
3 ENSG00000243485                                                 0
4 ENSG00000237613                                                 0
5 ENSG00000268020                                                 0
6 ENSG00000240361                                                 0

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol","ensembl_gene_id_version"),values=seqdata$Geneid,mart= mart)

table(G_list$ensembl_gene_id==seqdata$Geneid)

FALSE  TRUE 
63790     4 
Warning message:
In G_list$ensembl_gene_id == seqdata$Geneid) :
  longer object length is not a multiple of shorter object length

Why has it not worked? Why am I only getting 4 TRUE (matches) between seqdata$Geneid and the vector G_list$ensembl_gene_id? I thought they should match 100% if it had worked correct? Also, surely the length of G_list should be identical to one of the dimensions of seqdata$Geneid?

dim(G_list)
[1] 54544     3
dim(seqdata)
[1] 63794     2

Any help would be really appreciated :-)

biomaRt • 1.4k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

The rownames of your count data are 1:N, so you should have no expectation that any Ensembl Gene ID should match. Also what is seqdata? You show that you read something in, but then you do stuff with countdata.

ADD COMMENT
0
Entering edit mode

Thank you for spotting those typos in my code, James. I have amended my code above (I am getting the same error even with these typos fixed).

ADD REPLY
1
Entering edit mode

You are querying a relational database when you use biomaRt, and one facet of that process is that the returned values are not necessarily in the same order.

Which is why you should always ask to get back the values you searched on, as you have. You can then use match to align the results with your incoming data, which you need to do first.

Also, do note that == is, in this instance, probably ok once you've reordered, but in general something like %in% or all.equal is to be preferred.

ADD REPLY
0
Entering edit mode

Thanks for the help, James.

I'm still not sure why it isn't working, though. For example, I noticed that ENSG00000241670 is in seqdata$Geneid list, but it isn't returned from my biomaRt query. This is a bona fide ensembl id (Gene Name: 'AP006222.1' (HUGO: N/A), see here). What is going on here?

"ENSG00000241670" %in% G_list$ensembl_gene_id
[1] FALSE
"ENSG00000241670" %in% seqdata$Geneid
[1] TRUE

Out of 63794 ensembl ids in seqdata$Geneid, 9253 were not found in my biomaRt query. That is about 15%. Is that quite a lot, or is this normal?

ADD REPLY
2
Entering edit mode

The gene you give as an example was removed as of release 89. So you shouldn't expect to find it in the current Ensembl release.

If you care to use GRCh37 and old genes that are no longer thought to exist, you can use an archived version with biomaRt.

> martold <- useEnsembl("ensembl","hsapiens_gene_ensembl", GRCh = "37")
> getBM(c("ensembl_gene_id","hgnc_symbol"), "ensembl_gene_id","ENSG00000241670", martold)
  ensembl_gene_id hgnc_symbol
1 ENSG00000241670          NA

You can also hypothetically use other archives, by doing something like

> martold <- useEnsembl("ensembl","hsapiens_gene_ensembl", host = "may2017.archive.ensembl.org")
Error in read_html.response(GET(url)) : Not Found (HTTP 404).

But as you can see, that is failing me, because it uses the main ensembl.org site, and that site is currently down. But if you were to be able to connect to the main site, you would find that the HUGO gene symbol for that gene is RPL23AP21 (and the 'Gene Name' you show there is definitely not a gene name, but instead a GenBank ID.)

ADD REPLY
0
Entering edit mode

Thank you James that works now!!

ADD REPLY

Login before adding your answer.

Traffic: 488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6