Entering edit mode
steffen@stat.Berkeley.EDU
▴
600
@steffenstatberkeleyedu-2907
Last seen 10.2 years ago
Hi Ramzi,
When using multiple filters with more than one value for each filter,
the
BioMart webservice won't match up the different values for each
filter.
For example the following query:
getBM(c("hgnc_symbol","chromosome_name","start_position","end_position
"),filters=c("chromosome_name","start","end"),values=list(c(11,2),c(30
576841,86479831),c(30576891,86479881)),mart=mart)
in XML:
<query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="11,2"/><filter name="start" value="30576841,86479831"/><filter name="end" value="30576891,86479881"/></dataset></query>
will return hgnc_symbols from genes on chromosome 11 between position
30576841 and 30576891 but probably also genes on chromosome 11
between
position 30576841 and 86479881 (and all other combinations).
For this type of query you'll either have to query region by region at
once, or even better query all hgnc_symbols and their positions first
and
then select in R which genes fall in your regions of interest.
You could use the following query to do this.
mart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
allGenePositions =
getBM(c("hgnc_symbol","ensembl_gene_id","chromosome_name","start_posit
ion","end_position"),
mart=mart)
Cheers,
Steffen
Ramzi TEMANNI wrote:
Hi James,
I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref
and
that's why the output generates many genes.
i'm trying to find out a way to convert coordinates but seems I have
to
align again using the GRCh37 ref.
I don't think that is the problem, as UCSC hg19 is based on GRCh37. I
think there might be a problem on the Biomart server side. As far as I
can
tell, the query string produced by biomaRt is correct, but the Biomart
server returns more than I think it should.
In fact, if you do the query for each chromosome individually, you
either
get one or zero genes returned. It's only if you do the query for more
than one chromosomal region at a time that you get these extra genes:
> getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[3,1], t.cpd[3,2], t.cpd[3,2]+50), enseembl)
[1] hgnc_symbol
<0 rows> (or 0-length row.names)
> getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[4,1], t.cpd[4,2], t.cpd[4,2]+50), enseembl)
hgnc_symbol
1 MPPED2
> getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[5,1], t.cpd[5,2], t.cpd[5,2]+50), enseembl)
hgnc_symbol
1 REEP1
> tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl)
> dim(tmp)
[1] 946 1
> head(tmp)
hgnc_symbol
1 ZFP91
2 CNTF
3 C11orf76
4 RPLP0P2
5 C11orf64
6 OR4D8P
> grep("REEP1", tmp[,1], value=T)
[1] "REEP1"
The XMLQuery that is sent to the Biomart server looks like this (if
Rhoda
Kinsella or Steffen Durinck are following this at all).
Browse[2]> xmlQuery
[1] "<query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="\" biomaRt\""=""> <dataset name="hsapiens_gene_ensembl"><attribute name="hgnc_symbol"/><filter name="chromosome_name" value="11,2"/><filter name="start" value="30576841,86479831"/><filter name="end" value="30576891,86479881"/></dataset></query>"
Best,
Jim
Regards,
Ramzi
----------------------------------------------------------------
On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI <ramzi.temanni at="" gmail.com="" <mailto:ramzi.temanni="" at="" gmail.com="">> wrote:
Hi James,
Thanks for your reply,
As i have short reads of 50b, I've modified the code accordigly to
your suggestion:
gn.m1<-getBM(attributes= c("hgnc_symbol"),
filters=c("chromosome_name","
start",*"end"*),
values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.n
umeric(t.cpd[1:10,2])+50*),
mart=ensembl)
But still having more genes than expected:
hgnc_symbol
1 UBE2G1
2 DUSP5P
3 HIST3H2BA
4 ZNF847P
5 ACTBP11
6 ZC3H11B
........
8488 PPP2R2C
8489 WFS1
8490 SNORD73A
8491 SNORA24
8492 SNORA26
Did I miss something ?
Thanks for the remark regarding the position, you are right ! I did
not notice that, have to get back to the bowtie alignment file and
see what is the reason behind that. I'm using bowtie with the last
hg19 ref.
Thanks again for your comment. Best Regards,
Ramzi
----------------------------------------------------------------
On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald
<jmacdon at="" med.umich.edu="" <mailto:jmacdon="" at="" med.umich.edu="">> wrote:
Hi Ramzi,
Ramzi TEMANNI wrote:
Hi,
I want to extract the gene names knowing the chromosome and
the position for
each genes:
t.cpd[1:10,1:2]
CHR.M1 POS.M1
[1,] "12" "140059033"
[2,] "19" "164634640"
[3,] "10" "32347784"
[4,] "11" "30576841"
[5,] "2" "86479831"
[6,] "12" "237019866"
[7,] "4" "76487174"
[8,] "20" "136121868"
[9,] "2" "6255547"
[10,] "1" "67658137"
i use the following commands:
library(biomaRt)
mart = useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
gn.m1<-getBM(attributes= c("hgnc_symbol"),
filters=c("chromosome_name","start"),
values=list(t.cpd[1:10,1],t.cpd[1:10,2]),
mart=ensembl)
I'm expecting having a list of 10 genes names, but instead
i
get 8652 genes:
hgnc_symbol
1 OR2M1P
2 OR2L1P
3 HSD17B7P1
4 OR14L1P
5 OR2W5
6 VN1R5
......
8649 WFS1
8650 SNORD73A
8651 SNORA24
8652 SNORA26
Did I miss something ?
Yes. You are giving the start position, but not the end.
Without
explicitly telling the Biomart server where to stop looking for
genes, where do you think it will stop by default?
Also, several of your coordinates are nonsensical. For
instance,
chr12 is only 133851859 bases long, chr20 is 63025520 bases
long, etc.
Best,
Jim
Thanks in advance for your help
Best Regards,
Ramzi
----------------------------------------------------------------
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
<mailto:bioc-sig-sequencing at="" r-project.org="">
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
-- James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and
should not be used for urgent or sensitive issues
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should
not
be used for urgent or sensitive issues
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing