unexpected genes names list using getBM{biomaRt}
0
0
Entering edit mode
@steffenstatberkeleyedu-2907
Last seen 10.2 years ago
Hi Ramzi, When using multiple filters with more than one value for each filter, the BioMart webservice won't match up the different values for each filter. For example the following query: getBM(c("hgnc_symbol","chromosome_name","start_position","end_position "),filters=c("chromosome_name","start","end"),values=list(c(11,2),c(30 576841,86479831),c(30576891,86479881)),mart=mart) in XML: <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="11,2"/><filter name="start" value="30576841,86479831"/><filter name="end" value="30576891,86479881"/></dataset></query> will return hgnc_symbols from genes on chromosome 11 between position 30576841 and 30576891 but probably also genes on chromosome 11 between position 30576841 and 86479881 (and all other combinations). For this type of query you'll either have to query region by region at once, or even better query all hgnc_symbols and their positions first and then select in R which genes fall in your regions of interest. You could use the following query to do this. mart=useMart("ensembl", dataset="hsapiens_gene_ensembl") allGenePositions = getBM(c("hgnc_symbol","ensembl_gene_id","chromosome_name","start_posit ion","end_position"), mart=mart) Cheers, Steffen Ramzi TEMANNI wrote: Hi James, I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref and that's why the output generates many genes. i'm trying to find out a way to convert coordinates but seems I have to align again using the GRCh37 ref. I don't think that is the problem, as UCSC hg19 is based on GRCh37. I think there might be a problem on the Biomart server side. As far as I can tell, the query string produced by biomaRt is correct, but the Biomart server returns more than I think it should. In fact, if you do the query for each chromosome individually, you either get one or zero genes returned. It's only if you do the query for more than one chromosomal region at a time that you get these extra genes: > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[3,1], t.cpd[3,2], t.cpd[3,2]+50), enseembl) [1] hgnc_symbol <0 rows> (or 0-length row.names) > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[4,1], t.cpd[4,2], t.cpd[4,2]+50), enseembl) hgnc_symbol 1 MPPED2 > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[5,1], t.cpd[5,2], t.cpd[5,2]+50), enseembl) hgnc_symbol 1 REEP1 > tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl) > dim(tmp) [1] 946 1 > head(tmp) hgnc_symbol 1 ZFP91 2 CNTF 3 C11orf76 4 RPLP0P2 5 C11orf64 6 OR4D8P > grep("REEP1", tmp[,1], value=T) [1] "REEP1" The XMLQuery that is sent to the Biomart server looks like this (if Rhoda Kinsella or Steffen Durinck are following this at all). Browse[2]> xmlQuery [1] "<query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="\" biomaRt\""=""> <dataset name="hsapiens_gene_ensembl"><attribute name="hgnc_symbol"/><filter name="chromosome_name" value="11,2"/><filter name="start" value="30576841,86479831"/><filter name="end" value="30576891,86479881"/></dataset></query>" Best, Jim Regards, Ramzi ---------------------------------------------------------------- On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI <ramzi.temanni at="" gmail.com="" <mailto:ramzi.temanni="" at="" gmail.com="">> wrote: Hi James, Thanks for your reply, As i have short reads of 50b, I've modified the code accordigly to your suggestion: gn.m1<-getBM(attributes= c("hgnc_symbol"), filters=c("chromosome_name"," start",*"end"*), values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.n umeric(t.cpd[1:10,2])+50*), mart=ensembl) But still having more genes than expected: hgnc_symbol 1 UBE2G1 2 DUSP5P 3 HIST3H2BA 4 ZNF847P 5 ACTBP11 6 ZC3H11B ........ 8488 PPP2R2C 8489 WFS1 8490 SNORD73A 8491 SNORA24 8492 SNORA26 Did I miss something ? Thanks for the remark regarding the position, you are right ! I did not notice that, have to get back to the bowtie alignment file and see what is the reason behind that. I'm using bowtie with the last hg19 ref. Thanks again for your comment. Best Regards, Ramzi ---------------------------------------------------------------- On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald <jmacdon at="" med.umich.edu="" <mailto:jmacdon="" at="" med.umich.edu="">> wrote: Hi Ramzi, Ramzi TEMANNI wrote: Hi, I want to extract the gene names knowing the chromosome and the position for each genes: t.cpd[1:10,1:2] CHR.M1 POS.M1 [1,] "12" "140059033" [2,] "19" "164634640" [3,] "10" "32347784" [4,] "11" "30576841" [5,] "2" "86479831" [6,] "12" "237019866" [7,] "4" "76487174" [8,] "20" "136121868" [9,] "2" "6255547" [10,] "1" "67658137" i use the following commands: library(biomaRt) mart = useMart("ensembl") ensembl = useDataset("hsapiens_gene_ensembl", mart = mart) gn.m1<-getBM(attributes= c("hgnc_symbol"), filters=c("chromosome_name","start"), values=list(t.cpd[1:10,1],t.cpd[1:10,2]), mart=ensembl) I'm expecting having a list of 10 genes names, but instead i get 8652 genes: hgnc_symbol 1 OR2M1P 2 OR2L1P 3 HSD17B7P1 4 OR14L1P 5 OR2W5 6 VN1R5 ...... 8649 WFS1 8650 SNORD73A 8651 SNORA24 8652 SNORA26 Did I miss something ? Yes. You are giving the start position, but not the end. Without explicitly telling the Biomart server where to stop looking for genes, where do you think it will stop by default? Also, several of your coordinates are nonsensical. For instance, chr12 is only 133851859 bases long, chr20 is 63025520 bases long, etc. Best, Jim Thanks in advance for your help Best Regards, Ramzi ---------------------------------------------------------------- [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing at r-project.org <mailto:bioc-sig-sequencing at="" r-project.org=""> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing at r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Alignment convert biomaRt Alignment convert biomaRt • 1.7k views
ADD COMMENT

Login before adding your answer.

Traffic: 712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6