lumi annotation using nuID, missing gene symbols

0

Entering edit mode

John Coulthard ▴ 170

@john-coulthard-3077

Last seen 9.6 years ago

Dear List I'm analyzing some (HumanHT12_V3_0_R1_11283641_A) Illumina data using the lumi package. The raw data has 48803 probes, 36157 of which have a gene symbol annotation and 12646 don't. When I use lumi, and convert probe ids to nuIDs, then annotate the nuIDs I only get 25935 probes annotated with a gene symbol. There can't be that many probes which have had annotated gene symbol deleted, so what am I doing wrong? Is there a way to get the probe_ids and gene symbols that came with the raw data onto my TopTable post analysis? My working below (not the full analysis just an example of how I did the annotation bit). Thanks for you time. John > lumidata<-lumiR("Sample Probe Profile_rawdata.txt", lib.mapping='lumiHumanIDMapping') Perform Quality Control assessment of the LumiBatch object ... Duplicated IDs found and were merged! > f <- exprs(lumidata) > g<-as.matrix(rownames(f)) > f<-as.data.frame(cbind(f,g) ) > head(f) 1 2 3 4 V25 Ku8QhfS0n_hIOABXuE 92 84 75 79 Ku8QhfS0n_hIOABXuE fqPEquJRRlSVSfL.8A 113 120 111 109 fqPEquJRRlSVSfL.8A ckiehnugOno9d7vf1Q 107 104 94 94 ckiehnugOno9d7vf1Q x57Vw5B5Fbt5JUnQkI 93 83 94 94 x57Vw5B5Fbt5JUnQkI ritxUH.kuHlYqjozpE 93 97 77 89 ritxUH.kuHlYqjozpE QpE5UiUgmJOJEkPXpc 102 95 97 92 QpE5UiUgmJOJEkPXpc > f$Symbol<-if (require(lumiHumanAll.db)) getSYMBOL(f$V25, 'lumiHumanAll.db') > sumis.na(f$Symbol)) [1] 22868 > data<-read.csv("Sample Probe Profile_rawdata.txt", header = TRUE, sep="\t") > names(data) [1] "PROBE_ID" "SYMBOL" "X1.AVG_Signal" "X1.Detection.Pval" "X1.NARRAYS" "X1.ARRAY_STDEV" "X1.BEAD_STDERR" ... > sumis.na(data$SYMBOL)) [1] 0 > sum(data$SYMBOL=="") [1] 12646 > sum(data$SYMBOL!="") [1] 36157 > sessionInfo() R version 2.10.1 (2009-12-14) i386-redhat-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] beadarray_1.14.0 lumiHumanIDMapping_1.4.0 limma_3.2.3 lumi_1.12.4 MASS_7.3-4 preprocessCore_1.8.0 [7] mgcv_1.6-1 affy_1.24.2 lumiHumanAll.db_1.8.1 org.Hs.eg.db_2.3.6 RSQLite_0.8-4 DBI_0.2-5 [13] annotate_1.24.1 AnnotationDbi_1.8.2 Biobase_2.6.1 loaded via a namespace (and not attached): [1] affyio_1.14.0 grid_2.10.1 hwriter_1.2 KernSmooth_2.23-3 lattice_0.17-26 Matrix_0.999375-33 nlme_3.1-96 [8] tcltk_2.10.1 tools_2.10.1 xtable_1.5-6 > _________________________________________________________________ Hotmail: Free, trusted and rich email service. [[alternative HTML version deleted]]

Annotation probe annotate lumi Annotation probe annotate lumi • 1.8k views

ADD COMMENT • link updated 14.0 years ago by Marc Carlson ★ 7.2k • written 14.0 years ago by John Coulthard ▴ 170

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi John, I don't have enough of your code to know for sure, but you might not be doing anything wrong. The lumi package basically allows you to map the individual probes from Illumina onto the appropriate (current) refseq or entrez gene IDs which in turn allows them to be connected to the appropriate gene symbol. But in doing so, you are no longer blindly trusting the mappings from Illumina and so now your mappings will be more cautious/conservative than the output you got directly from Illumina. Being more careful can mean finding fewer matches as not all probes may be measuring what they were initially designed to measure. You can read more about the details of this by looking over the vignettes for the lumi package here (section 3.2 of the 1st vignette): http://www.bioconductor.org/packages/release/bioc/html/lumi.html How best to match probes or sets of probes onto annotations is a pretty huge topic. But if you should feel that the lumi package is being too conservative in it's assignment, you can always make your own using SQLForge from the AnnotationDbi package to match up the probes or groups of probes with whatever refseq/genbank/entrez IDs you are willing to trust with your data. The instructions for using SQLForge are here in case you want to do that: http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.h tml I hope you find this helpful, Marc On 05/07/2010 06:20 AM, John Coulthard wrote: > Dear List > > I'm analyzing some (HumanHT12_V3_0_R1_11283641_A) Illumina data using the lumi package. The raw data has 48803 probes, 36157 of which have a gene symbol annotation and 12646 don't. When I use lumi, and convert probe ids to nuIDs, then annotate the nuIDs I only get 25935 probes annotated with a gene symbol. > > There can't be that many probes which have had annotated gene symbol deleted, so what am I doing wrong? > Is there a way to get the probe_ids and gene symbols that came with the raw data onto my TopTable post analysis? > > My working below (not the full analysis just an example of how I did the annotation bit). > > Thanks for you time. > > John > > > > > >> lumidata<-lumiR("Sample Probe Profile_rawdata.txt", lib.mapping='lumiHumanIDMapping') >> > Perform Quality Control assessment of the LumiBatch object ... > Duplicated IDs found and were merged! > >> f <- exprs(lumidata) >> g<-as.matrix(rownames(f)) >> f<-as.data.frame(cbind(f,g) ) >> head(f) >> > 1 2 3 4 V25 > Ku8QhfS0n_hIOABXuE 92 84 75 79 Ku8QhfS0n_hIOABXuE > fqPEquJRRlSVSfL.8A 113 120 111 109 fqPEquJRRlSVSfL.8A > ckiehnugOno9d7vf1Q 107 104 94 94 ckiehnugOno9d7vf1Q > x57Vw5B5Fbt5JUnQkI 93 83 94 94 x57Vw5B5Fbt5JUnQkI > ritxUH.kuHlYqjozpE 93 97 77 89 ritxUH.kuHlYqjozpE > QpE5UiUgmJOJEkPXpc 102 95 97 92 QpE5UiUgmJOJEkPXpc > > >> f$Symbol<-if (require(lumiHumanAll.db)) getSYMBOL(f$V25, 'lumiHumanAll.db') >> sumis.na(f$Symbol)) >> > [1] 22868 > > > >> data<-read.csv("Sample Probe Profile_rawdata.txt", header = TRUE, sep="\t") >> names(data) >> > [1] "PROBE_ID" "SYMBOL" "X1.AVG_Signal" > "X1.Detection.Pval" "X1.NARRAYS" "X1.ARRAY_STDEV" "X1.BEAD_STDERR" > > ... > > >> sumis.na(data$SYMBOL)) >> > [1] 0 > >> sum(data$SYMBOL=="") >> > [1] 12646 > >> sum(data$SYMBOL!="") >> > [1] 36157 > > > > > >> sessionInfo() >> > R version 2.10.1 (2009-12-14) > i386-redhat-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C > [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] beadarray_1.14.0 lumiHumanIDMapping_1.4.0 limma_3.2.3 lumi_1.12.4 MASS_7.3-4 preprocessCore_1.8.0 > [7] mgcv_1.6-1 affy_1.24.2 lumiHumanAll.db_1.8.1 org.Hs.eg.db_2.3.6 RSQLite_0.8-4 DBI_0.2-5 > [13] annotate_1.24.1 AnnotationDbi_1.8.2 Biobase_2.6.1 > > loaded via a namespace (and not attached): > [1] affyio_1.14.0 grid_2.10.1 hwriter_1.2 KernSmooth_2.23-3 lattice_0.17-26 Matrix_0.999375-33 nlme_3.1-96 > [8] tcltk_2.10.1 tools_2.10.1 xtable_1.5-6 > >> > > _________________________________________________________________ > Hotmail: Free, trusted and rich email service. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 14.0 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Gilbert Feng ▴ 300

@gilbert-feng-3778

Last seen 9.6 years ago

Hi, John Thanks for choosing lumi. First of all, we recommend to use the latest Bioconductor release(2.6) and latest lumiHumanAll.db (1.10.0), which supports one nuID maps to more genes. Not all of annotated probes can be well aligned to the corresponding genes. Therefore, only probes with good alignment scores are annotated in lumiHumanAll.db The annotation mapping is based on annotated files from computational biology group in University of Cambridge ( http://www.compbio.group.cam.ac.uk/Resources/Annotation/). In each annotation file, only probes with perfect and good alignment scores are kept in lumiHumanAll.db, lumiMouseAll.db and lumiRatAll.db . You can use those annotation files to do mapping by yourself. Hope this is helpful for your question! Gilbert On 5/7/10 8:20 AM, "John Coulthard" <bahhab at="" hotmail.com=""> wrote: > > Dear List > > I'm analyzing some (HumanHT12_V3_0_R1_11283641_A) Illumina data using the lumi > package. The raw data has 48803 probes, 36157 of which have a gene symbol > annotation and 12646 don't. When I use lumi, and convert probe ids to nuIDs, > then annotate the nuIDs I only get 25935 probes annotated with a gene symbol. > > There can't be that many probes which have had annotated gene symbol deleted, > so what am I doing wrong? > Is there a way to get the probe_ids and gene symbols that came with the raw > data onto my TopTable post analysis? > > My working below (not the full analysis just an example of how I did the > annotation bit). > > Thanks for you time. > > John > > > > >> lumidata<-lumiR("Sample Probe Profile_rawdata.txt", >> lib.mapping='lumiHumanIDMapping') > Perform Quality Control assessment of the LumiBatch object ... > Duplicated IDs found and were merged! >> f <- exprs(lumidata) >> g<-as.matrix(rownames(f)) >> f<-as.data.frame(cbind(f,g) ) >> head(f) > 1 2 3 4 V25 > Ku8QhfS0n_hIOABXuE 92 84 75 79 Ku8QhfS0n_hIOABXuE > fqPEquJRRlSVSfL.8A 113 120 111 109 fqPEquJRRlSVSfL.8A > ckiehnugOno9d7vf1Q 107 104 94 94 ckiehnugOno9d7vf1Q > x57Vw5B5Fbt5JUnQkI 93 83 94 94 x57Vw5B5Fbt5JUnQkI > ritxUH.kuHlYqjozpE 93 97 77 89 ritxUH.kuHlYqjozpE > QpE5UiUgmJOJEkPXpc 102 95 97 92 QpE5UiUgmJOJEkPXpc > >> f$Symbol<-if (require(lumiHumanAll.db)) getSYMBOL(f$V25, 'lumiHumanAll.db') >> sumis.na(f$Symbol)) > [1] 22868 > > >> data<-read.csv("Sample Probe Profile_rawdata.txt", header = TRUE, sep="\t") >> names(data) > [1] "PROBE_ID" "SYMBOL" "X1.AVG_Signal" > "X1.Detection.Pval" "X1.NARRAYS" "X1.ARRAY_STDEV" > "X1.BEAD_STDERR" > > ... > >> sumis.na(data$SYMBOL)) > [1] 0 >> sum(data$SYMBOL=="") > [1] 12646 >> sum(data$SYMBOL!="") > [1] 36157 > > > > >> sessionInfo() > R version 2.10.1 (2009-12-14) > i386-redhat-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 LC_MONETARY=C > [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C > LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] beadarray_1.14.0 lumiHumanIDMapping_1.4.0 limma_3.2.3 > lumi_1.12.4 MASS_7.3-4 preprocessCore_1.8.0 > [7] mgcv_1.6-1 affy_1.24.2 lumiHumanAll.db_1.8.1 > org.Hs.eg.db_2.3.6 RSQLite_0.8-4 DBI_0.2-5 > [13] annotate_1.24.1 AnnotationDbi_1.8.2 Biobase_2.6.1 > > loaded via a namespace (and not attached): > [1] affyio_1.14.0 grid_2.10.1 hwriter_1.2 > KernSmooth_2.23-3 lattice_0.17-26 Matrix_0.999375-33 nlme_3.1-96 > [8] tcltk_2.10.1 tools_2.10.1 xtable_1.5-6 >> > > > _________________________________________________________________ > Hotmail: Free, trusted and rich email service. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ----------------------------------------------- Gang (Gilbert) Feng, PhD Biomedical Informatics Center Robert H. Lurie Comprehensive Cancer Center Northwestern University 750 N. Lake Shore Drive, 11th Floor(11-175e) Chicago, IL 60611 Phone:312-503-2358 Email g-feng (at) northwestern.edu

ADD COMMENT • link 14.0 years ago Gilbert Feng ▴ 300

Login before adding your answer.