annotate and biomaRt: inconsistent behaviour; nsFilter question

0

Entering edit mode

Saira Mian ▴ 10

@saira-mian-2400

Last seen 11.4 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070926/ 8c6e1704/attachment.pl

• 719 views

ADD COMMENT • link updated 18.4 years ago by James W. MacDonald 68k • written 18.4 years ago by Saira Mian ▴ 10

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 14 minutes ago

United States

Hi Saira, Saira Mian wrote: > I noticed that for some Affymetrix probe sets, "genenames" (annotate) > returns a single gene whereas "getBM" (biomaRt) returns two: > > annotate: > > library(hgu133a) > > genenames <- as.list(hgu133aGENENAME) > > genenames[["200710_at"]] > [1] "acyl-Coenzyme A dehydrogenase, very long chain" > > biomaRt: > > ensemblhuman <- useMart("ensembl", dataset="hsapiens_gene_ensembl") > > getBM(attributes=c("affy_hg_u133a", "hgnc_symbol", > "ensembl_transcript_id"),filters="affy_hg_u133a",values="200710_at", mart=ensemblhuman) > affy_hg_u133a hgnc_symbol ensembl_transcript_id > 1 200710_at ACADVL ENST00000356839 > 2 200710_at ACADVL ENST00000322910 > 3 200710_at ACADVL ENST00000350303 > 4 200710_at DVL2 ENST00000380838 > 5 200710_at DVL2 ENST00000005340 > > Why are the results from annotate and biomaRT inconsistent? Is there a > "correct" answer? The above probe set is just one of the examples I came > across when learning biomaRt using the first 30 rows of my ExpressionSet > object "eset" produced by nsFilter (see below). My cursory examination > of ACADVL and DVL2 using the UCSC genome browser suggests that the > one-to-many behaviour may occur because the genes are physically > adjacent in the genome (for this and one other example I inspected, the > genes were head-to-tail). The inconsistency arises because of the annotation you are using. In the first case you are using Entrez Gene (which as the name implies is a _gene_ level annotation). In the second case you are using Ensemble transcript level annotations, which is annotation at the mRNA level. Since there can be splice variants for a given gene that may result in different protein products, you can always get different names. The Entrez Gene ID for this probeset is 37. If you look that up on NCBI you will see that there are two RefSeq IDs associated, which indicates that NCBI thinks there are two isoforms for this gene. Other inconsistencies may arise from the fact that you are using two different sources for annotation. You cannot always assume that two different groups will have the same information. > > > ans <- nsFilter(eset) > > eset <- ans$eset > > affyids <- rownames(exprs(eset[1:30, ])) >> affyids > [1] "214440_at" "202376_at" "201511_at" "201000_at" "209459_s_at" > [6] "203504_s_at" "212772_s_at" "204343_at" "209620_s_at" "200045_at" > [11] "202123_s_at" "206411_s_at" "212895_s_at" "214274_s_at" "212186_at" > [16] "43427_at" "202502_at" "202366_at" "205355_at" "200710_at" > [21] "205412_at" "209608_s_at" "210337_s_at" "207071_s_at" "200793_s_at" > [26] "213501_at" "201629_s_at" "202767_at" "204393_s_at" "200974_at" >> annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"), > filters="affy_hg_u133a", values=affyids, mart=ensemblhuman)) >> annotation > affy_hg_u133a hgnc_symbol > 1 200045_at ABCF1 > 4 200710_at ACADVL > 7 200710_at DVL2 > 9 200793_s_at ACO2 > 10 200793_s_at POLR3H > 13 200974_at > 14 200974_at ACTA2 > 16 201000_at AARS > 19 201511_at GPBAR1 > 20 201511_at AAMP > 21 201629_s_at ACP1 > 23 202123_s_at ABL1 > 25 202366_at ACADS > 26 202376_at SERPINA3 > 28 202502_at ACADM > 30 202767_at DDB2 > 34 202767_at ACP2 > 35 203504_s_at ABCA1 > 36 204343_at ABCA3 > 39 204393_s_at ACPP > 40 205355_at ACADSB > 42 205412_at ACAT1 > 43 206411_s_at ABL2 > 46 207071_s_at ACO1 > 49 209459_s_at ABAT > 50 209608_s_at ACAT2 > 51 209608_s_at TCP1 > 52 209620_s_at ABCB7 > 55 210337_s_at ACLY > 57 212186_at ACACA > 61 212772_s_at ABCA2 > 64 212895_s_at TIMM22 > 65 212895_s_at ABR > 69 213501_at ACOX1 > 71 214274_s_at DLEC1 > 74 214274_s_at ACAA1 > 77 214440_at NAT1 > 79 43427_at ACACB > > I don't understand why the results for "200974_at" are a gene with no > hgnc_symbol and ACTA2 since I thought nsFilter would have removed the > gene with no name. Why would you think that? I don't see anything in the help page for nsFilter() that would indicate any probeset without a gene symbol would be removed. Best, Jim > > I'm an inexperienced R/Bioconductor user and so am unsure whether I've > simply made some elementary mistakes. > > Saira Mian > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 18.4 years ago James W. MacDonald 68k

0

Entering edit mode

Hi Saira, You can add an extra filter in order to return only values that have a gene symbol associated with them by: annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"), filters=c("affy_hg_u133a","with_hgnc_symbol"), values=list(affyids,TRUE), mart=ensemblhuman)) Note as well that for the version of biomaRt included the new BioC release there will be no need to apply the unique function on the getBM output as getBM will do this by default at the web service side. Ensembl does and independent mapping of the affymetrix probes to the genome. If they find multiple gene matches for one probe they will return all of these matches. The two genes that are retrieved in your query are indeed next to each other, not overlapping and on opposite strands. As Ensembl associates both of these genes with the 200710_at affymetrix probe there must be a match for this probe to each of these two genes, maybe they share some homology and the affy probe happens to be in that region? We could mail the Ensembl helpdesk at helpdesk at ensembl.org to get more details on this particular mapping. Cheers, Steffen James W. MacDonald wrote: > Hi Saira, > > Saira Mian wrote: > >> I noticed that for some Affymetrix probe sets, "genenames" (annotate) >> returns a single gene whereas "getBM" (biomaRt) returns two: >> >> annotate: >> > library(hgu133a) >> > genenames <- as.list(hgu133aGENENAME) >> > genenames[["200710_at"]] >> [1] "acyl-Coenzyme A dehydrogenase, very long chain" >> >> biomaRt: >> > ensemblhuman <- useMart("ensembl", dataset="hsapiens_gene_ensembl") >> > getBM(attributes=c("affy_hg_u133a", "hgnc_symbol", >> "ensembl_transcript_id"),filters="affy_hg_u133a",values="200710_at" ,mart=ensemblhuman) >> affy_hg_u133a hgnc_symbol ensembl_transcript_id >> 1 200710_at ACADVL ENST00000356839 >> 2 200710_at ACADVL ENST00000322910 >> 3 200710_at ACADVL ENST00000350303 >> 4 200710_at DVL2 ENST00000380838 >> 5 200710_at DVL2 ENST00000005340 >> >> Why are the results from annotate and biomaRT inconsistent? Is there a >> "correct" answer? The above probe set is just one of the examples I came >> across when learning biomaRt using the first 30 rows of my ExpressionSet >> object "eset" produced by nsFilter (see below). My cursory examination >> of ACADVL and DVL2 using the UCSC genome browser suggests that the >> one-to-many behaviour may occur because the genes are physically >> adjacent in the genome (for this and one other example I inspected, the >> genes were head-to-tail). >> > > The inconsistency arises because of the annotation you are using. In the > first case you are using Entrez Gene (which as the name implies is a > _gene_ level annotation). In the second case you are using Ensemble > transcript level annotations, which is annotation at the mRNA level. > Since there can be splice variants for a given gene that may result in > different protein products, you can always get different names. > > The Entrez Gene ID for this probeset is 37. If you look that up on NCBI > you will see that there are two RefSeq IDs associated, which indicates > that NCBI thinks there are two isoforms for this gene. > > Other inconsistencies may arise from the fact that you are using two > different sources for annotation. You cannot always assume that two > different groups will have the same information. > > > >> > ans <- nsFilter(eset) >> > eset <- ans$eset >> > affyids <- rownames(exprs(eset[1:30, ])) >> >>> affyids >>> >> [1] "214440_at" "202376_at" "201511_at" "201000_at" "209459_s_at" >> [6] "203504_s_at" "212772_s_at" "204343_at" "209620_s_at" "200045_at" >> [11] "202123_s_at" "206411_s_at" "212895_s_at" "214274_s_at" "212186_at" >> [16] "43427_at" "202502_at" "202366_at" "205355_at" "200710_at" >> [21] "205412_at" "209608_s_at" "210337_s_at" "207071_s_at" "200793_s_at" >> [26] "213501_at" "201629_s_at" "202767_at" "204393_s_at" "200974_at" >> >>> annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"), >>> >> filters="affy_hg_u133a", values=affyids, mart=ensemblhuman)) >> >>> annotation >>> >> affy_hg_u133a hgnc_symbol >> 1 200045_at ABCF1 >> 4 200710_at ACADVL >> 7 200710_at DVL2 >> 9 200793_s_at ACO2 >> 10 200793_s_at POLR3H >> 13 200974_at >> 14 200974_at ACTA2 >> 16 201000_at AARS >> 19 201511_at GPBAR1 >> 20 201511_at AAMP >> 21 201629_s_at ACP1 >> 23 202123_s_at ABL1 >> 25 202366_at ACADS >> 26 202376_at SERPINA3 >> 28 202502_at ACADM >> 30 202767_at DDB2 >> 34 202767_at ACP2 >> 35 203504_s_at ABCA1 >> 36 204343_at ABCA3 >> 39 204393_s_at ACPP >> 40 205355_at ACADSB >> 42 205412_at ACAT1 >> 43 206411_s_at ABL2 >> 46 207071_s_at ACO1 >> 49 209459_s_at ABAT >> 50 209608_s_at ACAT2 >> 51 209608_s_at TCP1 >> 52 209620_s_at ABCB7 >> 55 210337_s_at ACLY >> 57 212186_at ACACA >> 61 212772_s_at ABCA2 >> 64 212895_s_at TIMM22 >> 65 212895_s_at ABR >> 69 213501_at ACOX1 >> 71 214274_s_at DLEC1 >> 74 214274_s_at ACAA1 >> 77 214440_at NAT1 >> 79 43427_at ACACB >> >> I don't understand why the results for "200974_at" are a gene with no >> hgnc_symbol and ACTA2 since I thought nsFilter would have removed the >> gene with no name. >> > > Why would you think that? I don't see anything in the help page for > nsFilter() that would indicate any probeset without a gene symbol would > be removed. > > Best, > > Jim > > > >> I'm an inexperienced R/Bioconductor user and so am unsure whether I've >> simply made some elementary mistakes. >> >> Saira Mian >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >

ADD REPLY • link 18.4 years ago Steffen ▴ 500

Login before adding your answer.