unmapped keys in hugene10stprobeset.db

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 9.6 years ago

Here's an annotation question someone might be able to help me out with. I'll be grateful. Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array': Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3? based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. This sounds to me like affy started with sequence from exons of ~29k genes and created probes. But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes. library (hugene10stprobeset.db) library (hugene10sttranscriptcluster.db) bm = hugene10stprobesetENTREZID length (keys (bm)) # 257022 count.mappedkeys (bm) # 238141 # unmapped: 18881 cm = hugene10sttranscriptclusterENTREZID length (keys (cm)); # 33257 count.mappedkeys (cm) # 21787 The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well. Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion? Thanks! - Paul

Annotation affy Annotation affy • 1.2k views

ADD COMMENT • link updated 13.7 years ago by Marc Carlson ★ 7.2k • written 13.7 years ago by Paul Shannon ★ 1.1k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Paul, I looked into this for you. Often there will be discrepancies like this for purely historical reasons. For example, Affy may have made the probes based on one idea about what the transcriptome looked like and then this could have changed by the time they shipped their product. That kind of discrepancy happens all the time and especially with older chips. But in your case, you also seem to have a lot of control probes on this platform. You can extract the unmatched probes like this: library (hugene10stprobeset.db) a = hugene10stprobesetENTREZID oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))] I actually pulled down the .csv mapping from Affymetrix that Arthur Li would have used to generate this database. And I noticed that all the oddProbes I was looking at were control probes. In fact, more than 4 thousand of these probes are control probes. Looking more closely at this file, you will see that many, many other probes have no gene mapping to them even though they are not listed as control probes. What is going on with some of those probesets? Why has Affy refused to assign an identity those ones? That is really more of a question for Affymetrix than for us. When we map these IDs to make annotation packages, we look for known gene IDs from the manufacturer (unigene, refseq etc.), and we then map those onto entrez gene IDs from NCBI and from there onto other annotations. But if the people who make the array are not willing to tell us what these things map to then we could really only speculate about what they are. But, if you have some external information that helps you to decide what these probes really map to, (maybe you have mapped the probesets onto the genome yourself or else maybe you feel that you can extract a little more data out of Affys .csv file than this author did), then in that case you can always feed that "improved" mapping into the SQLForge code in the AnnotationDbi package and generate your very own version of this annotation package. It is pretty straightforward to do so and is described in the SQLForge vignette here: http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.h tml I hope this helps explain things, Marc On 08/16/2010 02:17 PM, Paul Shannon wrote: > Here's an annotation question someone might be able to help me out with. I'll be grateful. > > Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array': > > Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3? based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. > > This sounds to me like affy started with sequence from exons of ~29k genes and created probes. > But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes. > > library (hugene10stprobeset.db) > library (hugene10sttranscriptcluster.db) > bm = hugene10stprobesetENTREZID > length (keys (bm)) # 257022 > count.mappedkeys (bm) # 238141 > # unmapped: 18881 > cm = hugene10sttranscriptclusterENTREZID > length (keys (cm)); # 33257 > count.mappedkeys (cm) # 21787 > > The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well. > > Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion? > > Thanks! > > - Paul > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

hi Paul & Marc, in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol. I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to? cheers, Mark ----------------------------------------------------- Mark Cowley, PhD Peter Wills Bioinformatics Centre Garvan Institute of Medical Research, Sydney, Australia ----------------------------------------------------- On 17/08/2010, at 9:26 AM, Marc Carlson wrote: > Hi Paul, > > I looked into this for you. Often there will be discrepancies like this > for purely historical reasons. For example, Affy may have made the > probes based on one idea about what the transcriptome looked like and > then this could have changed by the time they shipped their product. > That kind of discrepancy happens all the time and especially with older > chips. But in your case, you also seem to have a lot of control probes > on this platform. > > You can extract the unmatched probes like this: > > library (hugene10stprobeset.db) > a = hugene10stprobesetENTREZID > oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))] > > I actually pulled down the .csv mapping from Affymetrix that Arthur Li > would have used to generate this database. And I noticed that all the > oddProbes I was looking at were control probes. In fact, more than 4 > thousand of these probes are control probes. Looking more closely at > this file, you will see that many, many other probes have no gene > mapping to them even though they are not listed as control probes. What > is going on with some of those probesets? Why has Affy refused to > assign an identity those ones? That is really more of a question for > Affymetrix than for us. > > When we map these IDs to make annotation packages, we look for known > gene IDs from the manufacturer (unigene, refseq etc.), and we then map > those onto entrez gene IDs from NCBI and from there onto other > annotations. But if the people who make the array are not willing to > tell us what these things map to then we could really only speculate > about what they are. > > But, if you have some external information that helps you to decide what > these probes really map to, (maybe you have mapped the probesets onto > the genome yourself or else maybe you feel that you can extract a little > more data out of Affys .csv file than this author did), then in that > case you can always feed that "improved" mapping into the SQLForge code > in the AnnotationDbi package and generate your very own version of this > annotation package. It is pretty straightforward to do so and is > described in the SQLForge vignette here: > > http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi .html > > I hope this helps explain things, > > > Marc > > > > > > > On 08/16/2010 02:17 PM, Paul Shannon wrote: >> Here's an annotation question someone might be able to help me out with. I'll be grateful. >> >> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array': >> >> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3? based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. >> >> This sounds to me like affy started with sequence from exons of ~29k genes and created probes. >> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes. >> >> library (hugene10stprobeset.db) >> library (hugene10sttranscriptcluster.db) >> bm = hugene10stprobesetENTREZID >> length (keys (bm)) # 257022 >> count.mappedkeys (bm) # 238141 >> # unmapped: 18881 >> cm = hugene10sttranscriptclusterENTREZID >> length (keys (cm)); # 33257 >> count.mappedkeys (cm) # 21787 >> >> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well. >> >> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion? >> >> Thanks! >> >> - Paul >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.7 years ago Mark Cowley ▴ 910

0

Entering edit mode

Hi Mark, You should talk to me about annotations. I maintain the annotation repository here and make sure that all of the packages get re-made for each release etc.. This particular package was contributed and is maintained by Arthur Li. So I will contact the two of you off list as needed, depending on what you find out in the "improvement" department. Something that may help you to be aware of as you explore this is that the annotations and the SQLForge code that generates them are all entrez gene centric. So you need to be able to connect the probe to an entrez gene ID that was not mapped to before in order to "improve" them. But, if you have new information about probes that map to things like microRNAs, then that really could help since there *are* entrez gene IDs for those things in NCBI (and in our supporting "org" packages. This is true even though these things are not really genes in the strictest sense of the word. Marc On 08/16/2010 05:05 PM, Mark Cowley wrote: > hi Paul & Marc, > in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol. > I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to? > > cheers, > Mark > ----------------------------------------------------- > Mark Cowley, PhD > > Peter Wills Bioinformatics Centre > Garvan Institute of Medical Research, Sydney, Australia > ----------------------------------------------------- > > On 17/08/2010, at 9:26 AM, Marc Carlson wrote: > > >> Hi Paul, >> >> I looked into this for you. Often there will be discrepancies like this >> for purely historical reasons. For example, Affy may have made the >> probes based on one idea about what the transcriptome looked like and >> then this could have changed by the time they shipped their product. >> That kind of discrepancy happens all the time and especially with older >> chips. But in your case, you also seem to have a lot of control probes >> on this platform. >> >> You can extract the unmatched probes like this: >> >> library (hugene10stprobeset.db) >> a = hugene10stprobesetENTREZID >> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))] >> >> I actually pulled down the .csv mapping from Affymetrix that Arthur Li >> would have used to generate this database. And I noticed that all the >> oddProbes I was looking at were control probes. In fact, more than 4 >> thousand of these probes are control probes. Looking more closely at >> this file, you will see that many, many other probes have no gene >> mapping to them even though they are not listed as control probes. What >> is going on with some of those probesets? Why has Affy refused to >> assign an identity those ones? That is really more of a question for >> Affymetrix than for us. >> >> When we map these IDs to make annotation packages, we look for known >> gene IDs from the manufacturer (unigene, refseq etc.), and we then map >> those onto entrez gene IDs from NCBI and from there onto other >> annotations. But if the people who make the array are not willing to >> tell us what these things map to then we could really only speculate >> about what they are. >> >> But, if you have some external information that helps you to decide what >> these probes really map to, (maybe you have mapped the probesets onto >> the genome yourself or else maybe you feel that you can extract a little >> more data out of Affys .csv file than this author did), then in that >> case you can always feed that "improved" mapping into the SQLForge code >> in the AnnotationDbi package and generate your very own version of this >> annotation package. It is pretty straightforward to do so and is >> described in the SQLForge vignette here: >> >> http://www.bioconductor.org/packages/release/bioc/html/AnnotationDb i.html >> >> I hope this helps explain things, >> >> >> Marc >> >> >> >> >> >> >> On 08/16/2010 02:17 PM, Paul Shannon wrote: >> >>> Here's an annotation question someone might be able to help me out with. I'll be grateful. >>> >>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array': >>> >>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3? based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. >>> >>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes. >>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes. >>> >>> library (hugene10stprobeset.db) >>> library (hugene10sttranscriptcluster.db) >>> bm = hugene10stprobesetENTREZID >>> length (keys (bm)) # 257022 >>> count.mappedkeys (bm) # 238141 >>> # unmapped: 18881 >>> cm = hugene10sttranscriptclusterENTREZID >>> length (keys (cm)); # 33257 >>> count.mappedkeys (cm) # 21787 >>> >>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well. >>> >>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion? >>> >>> Thanks! >>> >>> - Paul >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > >

ADD REPLY • link 13.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Thanks for clarification Marc, most of my 'improvements' are not Entrez Gene-centric. I glean far more annotation for each probeset by parsing the mrna_assignment field if the gene_assignment field is empty. This usually results in at least a genbank ID and description of the transcript, and as I pointed out earlier, microRNA's and snoRNA's. I'll investigate the potential relationships between the new stuff that i'm uncovering and Entrez Gene cheers, Mark On 18/08/2010, at 3:12 AM, Marc Carlson wrote: > Hi Mark, > > You should talk to me about annotations. I maintain the annotation > repository here and make sure that all of the packages get re-made for > each release etc.. This particular package was contributed and is > maintained by Arthur Li. So I will contact the two of you off list as > needed, depending on what you find out in the "improvement" department. > > Something that may help you to be aware of as you explore this is that > the annotations and the SQLForge code that generates them are all entrez > gene centric. So you need to be able to connect the probe to an entrez > gene ID that was not mapped to before in order to "improve" them. But, > if you have new information about probes that map to things like > microRNAs, then that really could help since there *are* entrez gene IDs > for those things in NCBI (and in our supporting "org" packages. This is > true even though these things are not really genes in the strictest > sense of the word. > > > Marc > > > On 08/16/2010 05:05 PM, Mark Cowley wrote: >> hi Paul & Marc, >> in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol. >> I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to? >> >> cheers, >> Mark >> ----------------------------------------------------- >> Mark Cowley, PhD >> >> Peter Wills Bioinformatics Centre >> Garvan Institute of Medical Research, Sydney, Australia >> ----------------------------------------------------- >> >> On 17/08/2010, at 9:26 AM, Marc Carlson wrote: >> >> >>> Hi Paul, >>> >>> I looked into this for you. Often there will be discrepancies like this >>> for purely historical reasons. For example, Affy may have made the >>> probes based on one idea about what the transcriptome looked like and >>> then this could have changed by the time they shipped their product. >>> That kind of discrepancy happens all the time and especially with older >>> chips. But in your case, you also seem to have a lot of control probes >>> on this platform. >>> >>> You can extract the unmatched probes like this: >>> >>> library (hugene10stprobeset.db) >>> a = hugene10stprobesetENTREZID >>> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))] >>> >>> I actually pulled down the .csv mapping from Affymetrix that Arthur Li >>> would have used to generate this database. And I noticed that all the >>> oddProbes I was looking at were control probes. In fact, more than 4 >>> thousand of these probes are control probes. Looking more closely at >>> this file, you will see that many, many other probes have no gene >>> mapping to them even though they are not listed as control probes. What >>> is going on with some of those probesets? Why has Affy refused to >>> assign an identity those ones? That is really more of a question for >>> Affymetrix than for us. >>> >>> When we map these IDs to make annotation packages, we look for known >>> gene IDs from the manufacturer (unigene, refseq etc.), and we then map >>> those onto entrez gene IDs from NCBI and from there onto other >>> annotations. But if the people who make the array are not willing to >>> tell us what these things map to then we could really only speculate >>> about what they are. >>> >>> But, if you have some external information that helps you to decide what >>> these probes really map to, (maybe you have mapped the probesets onto >>> the genome yourself or else maybe you feel that you can extract a little >>> more data out of Affys .csv file than this author did), then in that >>> case you can always feed that "improved" mapping into the SQLForge code >>> in the AnnotationDbi package and generate your very own version of this >>> annotation package. It is pretty straightforward to do so and is >>> described in the SQLForge vignette here: >>> >>> http://www.bioconductor.org/packages/release/bioc/html/AnnotationD bi.html >>> >>> I hope this helps explain things, >>> >>> >>> Marc >>> >>> >>> >>> >>> >>> >>> On 08/16/2010 02:17 PM, Paul Shannon wrote: >>> >>>> Here's an annotation question someone might be able to help me out with. I'll be grateful. >>>> >>>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array': >>>> >>>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3? based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. >>>> >>>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes. >>>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes. >>>> >>>> library (hugene10stprobeset.db) >>>> library (hugene10sttranscriptcluster.db) >>>> bm = hugene10stprobesetENTREZID >>>> length (keys (bm)) # 257022 >>>> count.mappedkeys (bm) # 238141 >>>> # unmapped: 18881 >>>> cm = hugene10sttranscriptclusterENTREZID >>>> length (keys (cm)); # 33257 >>>> count.mappedkeys (cm) # 21787 >>>> >>>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well. >>>> >>>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion? >>>> >>>> Thanks! >>>> >>>> - Paul >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> >

ADD REPLY • link 13.7 years ago Mark Cowley ▴ 910

Login before adding your answer.