Search
Question: how to retrieve all attributes from biomart ?
0
11 days ago by
Bioinformatics30 wrote:

Basically I want to extract all attributes for several genes ,

when I use the following as example, I get an error , would anyone know why?

hsapiens_inf <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','ensembl_peptide_id','ensembl_exon_id',
+                                    'description','chromosome_name','start_position','end_position','strand','band',
+                                    'transcript_start','transcript_end','external_gene_id','external_transcript_id',
+                                    'external_gene_db','transcript_db_name','transcript_count',
+                                    'percentage_gc_content','gene_biotype','transcript_biotype','source',
+                                    'transcript_source','status,transcript_status','phenotype_description',
+                                    'source_name','study_external_id','go_id','name_1006','definition_1006',
+                                    'arrayexpress','chembl'),mart = mart)
Error in getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id",  :
Invalid attribute(s): external_gene_id, external_transcript_id, external_gene_db, transcript_db_name, status,transcript_status
Please use the function 'listAttributes' to get valid attribute names

modified 11 days ago by James W. MacDonald48k • written 11 days ago by Bioinformatics30
0
11 days ago by
United States
James W. MacDonald48k wrote:

You get this error:

Error in getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id",  :
Invalid attribute(s): external_gene_id, external_transcript_id, external_gene_db, transcript_db_name, status,transcript_status
Please use the function 'listAttributes' to get valid attribute names

Can you say why that isn't sufficient/descriptive enough for you to diagnose this yourself?

I just made the parsing smaller in order to show I have tried a lot but I cannot figure out what the problem is. For example

hsapiens_inf <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id',
+                                    'ensembl_peptide_id','ensembl_exon_id'),mart = hsapiens6)
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Connection timed out after 10003 milliseconds

Or this one

hsapiens_inf <- getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id",
+                                    "ensembl_peptide_id","ensembl_exon_id"),mart = hsapiens6)
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Connection timed out after 10003 milliseconds

I have checked the listAttributes and it seems to be ok, would you please tell me what is the problem?

1

Sure. You are asking for a metric ton of data, with an arbitrarily large amount of replication, and evidently the Biomart server is taking longer than curl_fetch_memory is willing to wait. I did get it to go, and I wonder what you plan to do with a data.frame with almost 1.4 million rows?

> dim(hsapiens_inf)
[1] 1383187       4

Do note that the Biomart server is going to return a fully normalized table that is joined across each of the attributes you are requesting. So if a gene has two transcripts, you get two rows. And if the transcripts have say three exons each, you now get six rows. And if there are different proteins in there, you get more rows still. All for one gene.

Here's the worst of it:

> tail(table(table(hsapiens_inf[,1])), 30)

742  745  758  777  790  810  832  857  863  916  924  931  936  939  974  981
1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
986  996 1050 1095 1115 1139 1151 1220 1266 1380 1454 2012 2045 2136
1    1    1    1    2    1    1    1    1    1    1    1    1    1

So you have one gene that takes up 2136 rows of the data.frame! That's legit. But of what use is that?

Perhaps it would be better for you to say what you are trying to do, and then maybe somebody could offer a suggestion.

@James W. MacDonald basically I am trying to do gene onthology. there are 100 packages but I prefer to retrieve data from UniProt or ensemble.

What is gene onthology?

@James W. MacDonald gene ontology means you can find various information from any given gene. Look at http://www.geneontology.org