Transcript clusters missing from hugene20sttranscriptcluster.db

0

Entering edit mode

Cornwell, Adam ▴ 110

@cornwell-adam-5680

Last seen 16 months ago

United States

Hello, I've been working with hugene20sttranscriptcluster.db_2.14.0 (most recent release version) for the last couple of days, and noticed that some of our usual marker genes appear to not be present in the annotation package. These genes are present in current and previous versions of the Affymetrix probe -> gene mappings from NETAFFX. For example, transcript cluster 16966809 should correspond to gene symbol PDGFRA and Entrez ID 5156 (which is included in the NA34 annotation release for the platform) but any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") turns up FALSE. Picking a random transcript cluster, 16748695 (PDE6H), turns up TRUE and will return the symbol. I'm not sure if there are other genes missing as well, since I happened to stumble across this one. For now I can try to build an annotation database from the affy annotation. Am I missing something or can someone else confirm that things are missing? Quick copy-paste example: library(hugene20sttranscriptcluster.db, annotate) any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] hugene20sttranscriptcluster.db_2.14.0 OrderedList_1.36.0 twilight_1.40.0 BiocInstaller_1.14.2 [5] doParallel_1.0.8 iterators_1.0.7 limma_3.20.4 gplots_2.13.0 [9] xlsx_0.5.5 xlsxjars_0.6.0 rJava_0.9-6 annotate_1.42.0 [13] SCAN.UPC_2.6.0 sva_3.10.0 mgcv_1.7-29 nlme_3.1-117 [17] corpcor_1.6.6 foreach_1.4.2 affyio_1.32.0 affy_1.42.2 [21] GEOquery_2.30.0 oligo_1.28.2 Biostrings_2.32.0 XVector_0.4.0 [25] IRanges_1.22.7 oligoClasses_1.26.0 org.Hs.eg.db_2.14.0 RSQLite_0.11.4 [29] DBI_0.2-7 AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0 [33] BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] affxparser_1.36.0 bit_1.1-12 bitops_1.0-6 caTools_1.17 codetools_0.2-8 ff_2.2-13 gdata_2.13.3 [8] GenomicRanges_1.16.3 grid_3.1.0 gtools_3.4.0 KernSmooth_2.23-12 lattice_0.20-29 MASS_7.3-31 Matrix_1.1-3 [15] preprocessCore_1.26.1 RCurl_1.95-4.1 stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 zlibbioc_1.10.0 Adam Cornwell Programmer/Analyst [[alternative HTML version deleted]]

Annotation probe Annotation probe • 1.6k views

ADD COMMENT • link updated 9.9 years ago by James W. MacDonald 65k • written 9.9 years ago by Cornwell, Adam ▴ 110

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

Hi Adam, On 6/3/2014 6:07 PM, Cornwell, Adam wrote: > Hello, > > I've been working with hugene20sttranscriptcluster.db_2.14.0 (most recent release version) for the last couple of days, and noticed that some of our usual marker genes appear to not be present in the annotation package. These genes are present in current and previous versions of the Affymetrix probe -> gene mappings from NETAFFX. > For example, transcript cluster 16966809 should correspond to gene symbol PDGFRA and Entrez ID 5156 (which is included in the NA34 annotation release for the platform) but any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") turns up FALSE. Picking a random transcript cluster, 16748695 (PDE6H), turns up TRUE and will return the symbol. I'm not sure if there are other genes missing as well, since I happened to stumble across this one. > > For now I can try to build an annotation database from the affy annotation. Am I missing something or can someone else confirm that things are missing? > > Quick copy-paste example: > library(hugene20sttranscriptcluster.db, annotate) > any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") What you are missing is that this probeset maps to two symbols, and is thus masked in the conventional get() and bimap interfaces. > get("16966809", hugene20sttranscriptclusterSYMBOL) [1] NA > any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") [1] FALSE > z <- toggleProbes(hugene20sttranscriptclusterSYMBOL, "all") > get("16966809", z) [1] "PDGFRA" "FIP1L1" > any(mappedkeys(z) == "16966809") [1] TRUE These older methods have been supplanted by the select() method, which you should use instead: > select(hugene20sttranscriptcluster.db, "16966809", "SYMBOL") PROBEID SYMBOL 1 16966809 PDGFRA 2 16966809 FIP1L1 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows Best, Jim > > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] splines parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hugene20sttranscriptcluster.db_2.14.0 OrderedList_1.36.0 twilight_1.40.0 BiocInstaller_1.14.2 > [5] doParallel_1.0.8 iterators_1.0.7 limma_3.20.4 gplots_2.13.0 > [9] xlsx_0.5.5 xlsxjars_0.6.0 rJava_0.9-6 annotate_1.42.0 > [13] SCAN.UPC_2.6.0 sva_3.10.0 mgcv_1.7-29 nlme_3.1-117 > [17] corpcor_1.6.6 foreach_1.4.2 affyio_1.32.0 affy_1.42.2 > [21] GEOquery_2.30.0 oligo_1.28.2 Biostrings_2.32.0 XVector_0.4.0 > [25] IRanges_1.22.7 oligoClasses_1.26.0 org.Hs.eg.db_2.14.0 RSQLite_0.11.4 > [29] DBI_0.2-7 AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0 > [33] BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] affxparser_1.36.0 bit_1.1-12 bitops_1.0-6 caTools_1.17 codetools_0.2-8 ff_2.2-13 gdata_2.13.3 > [8] GenomicRanges_1.16.3 grid_3.1.0 gtools_3.4.0 KernSmooth_2.23-12 lattice_0.20-29 MASS_7.3-31 Matrix_1.1-3 > [15] preprocessCore_1.26.1 RCurl_1.95-4.1 stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 zlibbioc_1.10.0 > > > Adam Cornwell > Programmer/Analyst > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 9.9 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you for the clarification. I didn't come across anything to indicate that the bimap interfaces were effectively deprecated. Is there a page or post or something that has current best practices for working with annotation? Many vignettes and example workflows still include methods other than select(). So on to disambiguating multimapping transcript clusters I suppose... Adam Cornwell -----Original Message----- From: James W. MacDonald [mailto:jmacdon@uw.edu] Sent: Wednesday, June 04, 2014 9:54 AM To: Cornwell, Adam; 'bioconductor at r-project.org' Cc: 'maintainer at bioconductor.org' Subject: Re: [BioC] Transcript clusters missing from hugene20sttranscriptcluster.db Hi Adam, On 6/3/2014 6:07 PM, Cornwell, Adam wrote: > Hello, > > I've been working with hugene20sttranscriptcluster.db_2.14.0 (most recent release version) for the last couple of days, and noticed that some of our usual marker genes appear to not be present in the annotation package. These genes are present in current and previous versions of the Affymetrix probe -> gene mappings from NETAFFX. > For example, transcript cluster 16966809 should correspond to gene symbol PDGFRA and Entrez ID 5156 (which is included in the NA34 annotation release for the platform) but any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") turns up FALSE. Picking a random transcript cluster, 16748695 (PDE6H), turns up TRUE and will return the symbol. I'm not sure if there are other genes missing as well, since I happened to stumble across this one. > > For now I can try to build an annotation database from the affy annotation. Am I missing something or can someone else confirm that things are missing? > > Quick copy-paste example: > library(hugene20sttranscriptcluster.db, annotate) > any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") What you are missing is that this probeset maps to two symbols, and is thus masked in the conventional get() and bimap interfaces. > get("16966809", hugene20sttranscriptclusterSYMBOL) [1] NA > any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") [1] FALSE > z <- toggleProbes(hugene20sttranscriptclusterSYMBOL, "all") > get("16966809", z) [1] "PDGFRA" "FIP1L1" > any(mappedkeys(z) == "16966809") [1] TRUE These older methods have been supplanted by the select() method, which you should use instead: > select(hugene20sttranscriptcluster.db, "16966809", "SYMBOL") PROBEID SYMBOL 1 16966809 PDGFRA 2 16966809 FIP1L1 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows Best, Jim > > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] splines parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hugene20sttranscriptcluster.db_2.14.0 OrderedList_1.36.0 twilight_1.40.0 BiocInstaller_1.14.2 > [5] doParallel_1.0.8 iterators_1.0.7 limma_3.20.4 gplots_2.13.0 > [9] xlsx_0.5.5 xlsxjars_0.6.0 rJava_0.9-6 annotate_1.42.0 > [13] SCAN.UPC_2.6.0 sva_3.10.0 mgcv_1.7-29 nlme_3.1-117 > [17] corpcor_1.6.6 foreach_1.4.2 affyio_1.32.0 affy_1.42.2 > [21] GEOquery_2.30.0 oligo_1.28.2 Biostrings_2.32.0 XVector_0.4.0 > [25] IRanges_1.22.7 oligoClasses_1.26.0 org.Hs.eg.db_2.14.0 RSQLite_0.11.4 > [29] DBI_0.2-7 AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0 > [33] BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] affxparser_1.36.0 bit_1.1-12 bitops_1.0-6 caTools_1.17 codetools_0.2-8 ff_2.2-13 gdata_2.13.3 > [8] GenomicRanges_1.16.3 grid_3.1.0 gtools_3.4.0 KernSmooth_2.23-12 lattice_0.20-29 MASS_7.3-31 Matrix_1.1-3 > [15] preprocessCore_1.26.1 RCurl_1.95-4.1 stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 zlibbioc_1.10.0 > > > Adam Cornwell > Programmer/Analyst > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 9.9 years ago Cornwell, Adam ▴ 110

0

Entering edit mode

Hi Adam, The Bimaps are not (in the strictest sense) deprecated. IOW we were not planning to make the ones that people have traditionally used go away anytime in the near future. So if you have old code that relies on these that code should all still be fine. But at the same time we are also not really adding any new ones and we are not exposing newer annotation resources by using the older bimap interface going forward. There are a lot of different reasons for this that I don't need to go into here. But regardless we have had a better way of doing these things for a couple years now that you are encouraged to take advantage of. So please look here for a little walk through of how we expect most users will interact with annotations today and in the future: http://www.bioconductor.org/help/workflows/annotation/annotation/ I hope this helps explain things better, Marc On 06/04/2014 10:07 AM, Cornwell, Adam wrote: > Thank you for the clarification. I didn't come across anything to indicate that the bimap interfaces were effectively deprecated. Is there a page or post or something that has current best practices for working with annotation? Many vignettes and example workflows still include methods other than select(). > So on to disambiguating multimapping transcript clusters I suppose... > > Adam Cornwell > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon at uw.edu] > Sent: Wednesday, June 04, 2014 9:54 AM > To: Cornwell, Adam; 'bioconductor at r-project.org' > Cc: 'maintainer at bioconductor.org' > Subject: Re: [BioC] Transcript clusters missing from hugene20sttranscriptcluster.db > > Hi Adam, > > On 6/3/2014 6:07 PM, Cornwell, Adam wrote: >> Hello, >> >> I've been working with hugene20sttranscriptcluster.db_2.14.0 (most recent release version) for the last couple of days, and noticed that some of our usual marker genes appear to not be present in the annotation package. These genes are present in current and previous versions of the Affymetrix probe -> gene mappings from NETAFFX. >> For example, transcript cluster 16966809 should correspond to gene symbol PDGFRA and Entrez ID 5156 (which is included in the NA34 annotation release for the platform) but any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") turns up FALSE. Picking a random transcript cluster, 16748695 (PDE6H), turns up TRUE and will return the symbol. I'm not sure if there are other genes missing as well, since I happened to stumble across this one. >> >> For now I can try to build an annotation database from the affy annotation. Am I missing something or can someone else confirm that things are missing? >> >> Quick copy-paste example: >> library(hugene20sttranscriptcluster.db, annotate) >> any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") > What you are missing is that this probeset maps to two symbols, and is thus masked in the conventional get() and bimap interfaces. > > > get("16966809", hugene20sttranscriptclusterSYMBOL) > [1] NA > > any(mappedkeys(hugene20sttranscriptclusterSYMBOL) == "16966809") [1] FALSE > z <- toggleProbes(hugene20sttranscriptclusterSYMBOL, "all") > get("16966809", z) [1] "PDGFRA" "FIP1L1" > > any(mappedkeys(z) == "16966809") > [1] TRUE > > > These older methods have been supplanted by the select() method, which you should use instead: > > > select(hugene20sttranscriptcluster.db, "16966809", "SYMBOL") > PROBEID SYMBOL > 1 16966809 PDGFRA > 2 16966809 FIP1L1 > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' resulted in 1:many mapping between keys and return rows > > > Best, > > Jim > > >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] splines parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] hugene20sttranscriptcluster.db_2.14.0 OrderedList_1.36.0 twilight_1.40.0 BiocInstaller_1.14.2 >> [5] doParallel_1.0.8 iterators_1.0.7 limma_3.20.4 gplots_2.13.0 >> [9] xlsx_0.5.5 xlsxjars_0.6.0 rJava_0.9-6 annotate_1.42.0 >> [13] SCAN.UPC_2.6.0 sva_3.10.0 mgcv_1.7-29 nlme_3.1-117 >> [17] corpcor_1.6.6 foreach_1.4.2 affyio_1.32.0 affy_1.42.2 >> [21] GEOquery_2.30.0 oligo_1.28.2 Biostrings_2.32.0 XVector_0.4.0 >> [25] IRanges_1.22.7 oligoClasses_1.26.0 org.Hs.eg.db_2.14.0 RSQLite_0.11.4 >> [29] DBI_0.2-7 AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0 >> [33] BiocGenerics_0.10.0 >> >> loaded via a namespace (and not attached): >> [1] affxparser_1.36.0 bit_1.1-12 bitops_1.0-6 caTools_1.17 codetools_0.2-8 ff_2.2-13 gdata_2.13.3 >> [8] GenomicRanges_1.16.3 grid_3.1.0 gtools_3.4.0 KernSmooth_2.23-12 lattice_0.20-29 MASS_7.3-31 Matrix_1.1-3 >> [15] preprocessCore_1.26.1 RCurl_1.95-4.1 stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 zlibbioc_1.10.0 >> >> >> Adam Cornwell >> Programmer/Analyst >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 9.9 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi James,

I'm currently working on hugene20sttranscriptcluster.db as well. I was wondering if you know the order of the symbol such as "PDGFRA" goes before "FIP1L1" means anything? or it's just random? I'm trying to incorporate probes that can map to multiple genes as well; therefore I was wondering if you would recommend to just use the first symbol for each probe or collapse all the possible symbol for a probe. However, it will be more troublesome if I took the later approach when merging all the probe to single gene level.

Best,

Sylvia

ADD REPLY • link 8.3 years ago sylvia ▴ 10

Login before adding your answer.