biomaRt question -- not getting a gene
1
0
Entering edit mode
@elizabeth-purdom-2486
Last seen 3.0 years ago
USA/ Berkeley/UC Berkeley
Hello, I am baffled by something I happened to discover in the results of my query with biomaRt and I can't figure out what's going on. I am using getBM to pull down a large number of gene coordinates, and filtering to restrict to chromosomes 1-22 and X,Y. For some reason this procedure (which is giving no errors) is not pulling down some genes that I think it should. My basic code for pulling down all of this information is: tempAll<-getBM(c("ensembl_gene_id", "start_position", "end_position","strand","chromosome_name","biotype"),filter = "chromosome_name", values = c(1:22, "X", "Y"),mart = mart) A particular gene, "ENSG00000011677", is found by 'getGene' (and other getBM queries with different filters, as I discuss below) but not in my main query: > getGene("ENSG00000011677","ensembl_gene_id",mart) ensembl_gene_id hgnc_symbol 1 ENSG00000011677 GABRA3 description 1 Gamma-aminobutyric acid receptor subunit alpha-3 precursor (GABA(A) receptor subunit alpha-3). [Source:Uniprot/SWISSPROT;Acc:P34903] chromosome_name band strand start_position end_position ensembl_gene_id 1 X q28 -1 151086290 151370993 ENSG00000011677 > tempAll[match("ENSG00000011677",tempAll$ensembl_gene_id),] ensembl_gene_id start_position end_position strand chromosome_name biotype NA <na> NA NA NA <na> <na> Oddly, if I change my main code to filter on chromosome_name but just "X", just c("X","Y"), just c(1,"X"), and a couple of other combinations I picked then this gene correctly appears. It also appears if I filter on 'biotype' equals 'protein_coding'. I won't show all of these results unless someone wants, but I just copied and pasted so that was definitely the only thing changing. When I looked, of the 21,021 genes on chr1-22,X,Y brought down with filter of 'biotype' equals 'protein_coding', only 16,236 of them were in my main query that limited by chromosome ('tempAll' above). The ~5,000 missing ones are only in chr 5-9 and X,Y. I'm thinking there is some matching problem going on but I don't know where (and if it's my error or not). For now I'm just pulling it all down and filtering myself, but I would like to know what's going on here. Best, Elizabeth
biomaRt biomaRt • 1.4k views
ADD COMMENT
0
Entering edit mode
@steffenstatberkeleyedu-2907
Last seen 10.2 years ago
Hi Elizabeth, It would be great if you could report this to helpdesk at ensembl.org. Ideally when you see inconsistencies like this, you do the biomaRt queries again and set verbose=TRUE in the getBM function. This will print out the exact XML query that is send to the Ensembl BioMart. Add this XML message to your email to the helpdesk, and they can then use it to figure out what is going on. biomaRt only provides an interface to the Ensembl BioMart system and doesn't change anything in the query results. So whatever Ensembl gives back, is returned by getBM. Cheers, Steffen > Hello, > I am baffled by something I happened to discover in the results of my > query with biomaRt and I can't figure out what's going on. I am using > getBM to pull down a large number of gene coordinates, and filtering to > restrict to chromosomes 1-22 and X,Y. For some reason this procedure > (which is giving no errors) is not pulling down some genes that I think > it should. > > My basic code for pulling down all of this information is: > tempAll<-getBM(c("ensembl_gene_id", "start_position", > "end_position","strand","chromosome_name","biotype"),filter = > "chromosome_name", values = c(1:22, "X", "Y"),mart = mart) > > A particular gene, "ENSG00000011677", is found by 'getGene' (and other > getBM queries with different filters, as I discuss below) but not in my > main query: > > getGene("ENSG00000011677","ensembl_gene_id",mart) > ensembl_gene_id hgnc_symbol > 1 ENSG00000011677 GABRA3 > > description > 1 Gamma-aminobutyric acid receptor subunit alpha-3 precursor (GABA(A) > receptor subunit alpha-3). [Source:Uniprot/SWISSPROT;Acc:P34903] > chromosome_name band strand start_position end_position ensembl_gene_id > 1 X q28 -1 151086290 151370993 ENSG00000011677 > > tempAll[match("ENSG00000011677",tempAll$ensembl_gene_id),] > ensembl_gene_id start_position end_position strand chromosome_name > biotype > NA <na> NA NA NA <na> > <na> > > Oddly, if I change my main code to filter on chromosome_name but just > "X", just c("X","Y"), just c(1,"X"), and a couple of other combinations > I picked then this gene correctly appears. It also appears if I filter > on 'biotype' equals 'protein_coding'. I won't show all of these results > unless someone wants, but I just copied and pasted so that was > definitely the only thing changing. > > When I looked, of the 21,021 genes on chr1-22,X,Y brought down with > filter of 'biotype' equals 'protein_coding', only 16,236 of them were in > my main query that limited by chromosome ('tempAll' above). The ~5,000 > missing ones are only in chr 5-9 and X,Y. I'm thinking there is some > matching problem going on but I don't know where (and if it's my error > or not). > > For now I'm just pulling it all down and filtering myself, but I would > like to know what's going on here. > > Best, > Elizabeth > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT

Login before adding your answer.

Traffic: 765 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6