NA geneSymbol with lumi

0

Entering edit mode

Sebastien Gerega ▴ 370

@sebastien-gerega-2229

Last seen 9.6 years ago

Hi, I am using the lumi package to analyse illumina microarray data. When it finally comes to getting the top 10 DE genes with topTable I get many hits with the geneSymbol <na>. However, if I look up the ProbeID corresponding to the nuID that provide <na>, I find that they do correspond to genes. Why aren't they being displayed in the topTable? thanks, Sebastien ID geneSymbol logFC t P.Value adj.P.Val B 1917 fwfUovXT3rjAjqbpJU S100A8 -5.307223 -50.43759 9.854174e-09 0.0001383625 8.724832 12632 Qd_S7V4OkLjsX3jkt4 KRT6B -5.281406 -39.54237 3.896317e-08 0.0002735409 8.229157 12149 BjSTT6BOqGLhpKKFGI <na> -3.118669 -30.01505 1.844180e-07 0.0008631377 7.451766 7474 6ipCUUDxcp4ryIj6Uk <na> -3.155916 -24.45685 5.835502e-07 0.0013366890 6.716048 3831 3nivfFfvk55Rd18lLk <na> -2.690362 -24.10891 6.324511e-07 0.0013366890 6.659617

Microarray lumi Microarray lumi • 1.1k views

ADD COMMENT • link 16.5 years ago Sebastien Gerega ▴ 370

0

Entering edit mode

Sebastien Gerega ▴ 370

@sebastien-gerega-2229

Last seen 9.6 years ago

Sebastien Gerega <seb at="" ...=""> writes: > > Hi, > I am using the lumi package to analyse illumina microarray data. > When it finally comes to getting the top 10 DE genes with topTable I get > many hits with > the geneSymbol <na>. However, if I look up the ProbeID corresponding to > the nuID > that provide <na>, I find that they do correspond to genes. Why aren't > they being > displayed in the topTable? > thanks, > Sebastien > > ID geneSymbol logFC t P.Value > adj.P.Val B > 1917 fwfUovXT3rjAjqbpJU S100A8 -5.307223 -50.43759 9.854174e-09 > 0.0001383625 8.724832 > 12632 Qd_S7V4OkLjsX3jkt4 KRT6B -5.281406 -39.54237 3.896317e-08 > 0.0002735409 8.229157 > 12149 BjSTT6BOqGLhpKKFGI <na> -3.118669 -30.01505 1.844180e-07 > 0.0008631377 7.451766 > 7474 6ipCUUDxcp4ryIj6Uk <na> -3.155916 -24.45685 5.835502e-07 > 0.0013366890 6.716048 > 3831 3nivfFfvk55Rd18lLk <na> -2.690362 -24.10891 6.324511e-07 > 0.0013366890 6.659617 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > I have looked into this problem a little more... I downloaded the Human6_v2_sequence spreadsheet from the Illumina website and found that many of the targets that provide NA as gene symbol have no symbol in the Illumina database either. For example: ID geneSymbol 5903 ILMN_21212 FAM43A 3103 ILMN_1425 FOXO4 11993 ILMN_6504 PPL 5153 ILMN_19390 ST3GAL4 1723 ILMN_12716 CREB3L2 4484 ILMN_17676 TNS3 2700 ILMN_138461 <na> 1358 ILMN_12133 FSCN1 3507 ILMN_15271 CITED4 12401 ILMN_73087 <na> ILMN_73087 provides NA as gene symbol and does not have a gene symbol in the Illumina DB either. However, ILMN_138461 provides NA as gene symbol but does have a gene symbol in the Illumina DB. It is APM-1. In addition ILMN_73087 has no entries in either the Illumina or BioC DB but when I do a search for ILMN_73087 in Ensembl I a hit that has multiple EntrezGene listings. Is there any fix for the NA entries? Is this problem being addressed? thanks, Sebastien

ADD COMMENT • link 16.5 years ago Sebastien Gerega ▴ 370

0

Entering edit mode

HI Sebastian, Yes I get this all the time as well. It does not seem to matter if you use nuid or other illumina annotations... illuminaMousev1p1 for example. As a quick fix I have ended up using the illumina annotations as supplemental data for cases where there is a "NA" I look up the targetID and use the illumina annotation for that targetID . In most cases the missing ones are Riken cDNAs...This code will get you started note the pitfalls with this code. The table ann used below is just the illumina annotation data read into a data frame : > dim(ann) [1] 46643 13 > colnames(ann) [1] "Search_key" "Target" "ProbeId" "Gid" [5] "Transcript" "Accession" "Symbol" "Type" [9] "Start" "Probe_Sequence" "Definition" "Ontology" [13] "Synonym" (note in ann "Target" and "ProbeId" do not contain unique entries so can't be used as rownames in the table) Rough solution: LL<-(mget(results[,"ID"], env=illuminaMousev1p1SYMBOL, ifnotfound=NA)) lost_targets<- labels(LL[is.na(LL)]) locations<-apply(as.matrix(lost_targets),1,function(x) grep(x,ann[,"Target"],fixed=TRUE)) ######## WARNING if length(unlist(locations)) != length(lost_nuid_loc) ######## this will screw up ;; as it means the targetID was not found or it ######## may have been found multiple times (have not had this happen-yet) locations<-lapply(locations,function(x) x[1] ) #in case more that one lost_ann<-ann[unlist(locations),] LL[is.na(LL)]<-as.character(lost_ann[,"Symbol"]) # yes assumes same order Cheers Paul > > Hi, > I am using the lumi package to analyse illumina microarray data. > When it finally comes to getting the top 10 DE genes with topTable I get > many hits with > the geneSymbol <na>. However, if I look up the ProbeID corresponding to > the nuID > that provide <na>, I find that they do correspond to genes. Why aren't > they being > displayed in the topTable? > thanks, > Sebastien > > ID geneSymbol logFC t P.Value > adj.P.Val B > 1917 fwfUovXT3rjAjqbpJU S100A8 -5.307223 -50.43759 9.854174e-09 > 0.0001383625 8.724832 > 12632 Qd_S7V4OkLjsX3jkt4 KRT6B -5.281406 -39.54237 3.896317e-08 > 0.0002735409 8.229157 > 12149 BjSTT6BOqGLhpKKFGI <na> -3.118669 -30.01505 1.844180e-07 > 0.0008631377 7.451766 > 7474 6ipCUUDxcp4ryIj6Uk <na> -3.155916 -24.45685 5.835502e-07 > 0.0013366890 6.716048 > 3831 3nivfFfvk55Rd18lLk <na> -2.690362 -24.10891 6.324511e-07 > 0.0013366890 6.659617 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > I have looked into this problem a little more... I downloaded the Human6_v2_sequence spreadsheet from the Illumina website and found that many of the targets that provide NA as gene symbol have no symbol in the Illumina database either. For example: ID geneSymbol 5903 ILMN_21212 FAM43A 3103 ILMN_1425 FOXO4 11993 ILMN_6504 PPL 5153 ILMN_19390 ST3GAL4 1723 ILMN_12716 CREB3L2 4484 ILMN_17676 TNS3 2700 ILMN_138461 <na> 1358 ILMN_12133 FSCN1 3507 ILMN_15271 CITED4 12401 ILMN_73087 <na> ILMN_73087 provides NA as gene symbol and does not have a gene symbol in the Illumina DB either. However, ILMN_138461 provides NA as gene symbol but does have a gene symbol in the Illumina DB. It is APM-1. In addition ILMN_73087 has no entries in either the Illumina or BioC DB but when I do a search for ILMN_73087 in Ensembl I a hit that has multiple EntrezGene listings. Is there any fix for the NA entries? Is this problem being addressed? thanks, Sebastien _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.5 years ago Paul Leo ▴ 970

0

Entering edit mode

The illuminaMousev1p1 annotation package is made using the RefSeq identifiers provided by Illumina. You can see which identifier was used by looking at as.list(illuminaMousev1psACCNUM()). Some of these are RefSeq IDs and some are GenBank IDs. Only the RefSeq IDs are used for getting the rest of the annotations by searching against NCBI databases. The probes with GenBank IDs are not used because those probes are not in exonic sections of the transcript. If you are looking for more information on a probe which does not have a gene symbol in the annotation package, you should start with illuminaMousev1psACCNUM() rather than going back the Illumina manifest file. Or, as was suggested earlier, BLAST the probe sequence. I may do that for the next Bioconductor release rather than using the RefSeq IDs provided by Illumina. Lynn Amon Paul Leo wrote: > HI Sebastian, > Yes I get this all the time as well. It does not seem to matter if you > use nuid or other illumina annotations... illuminaMousev1p1 for example. > > As a quick fix I have ended up using the illumina annotations as > supplemental data for cases where there is a "NA" I look up the targetID > and use the illumina annotation for that targetID . In most cases the > missing ones are Riken cDNAs...This code will get you started note the > pitfalls with this code. The table ann used below is just the illumina > annotation data read into a data frame : > >> dim(ann) >> > [1] 46643 13 > >> colnames(ann) >> > [1] "Search_key" "Target" "ProbeId" "Gid" > [5] "Transcript" "Accession" "Symbol" "Type" > [9] "Start" "Probe_Sequence" "Definition" "Ontology" > [13] "Synonym" > (note in ann "Target" and "ProbeId" do not contain unique entries so > can't be used as rownames in the table) > > Rough solution: > > LL<-(mget(results[,"ID"], env=illuminaMousev1p1SYMBOL, ifnotfound=NA)) > lost_targets<- labels(LL[is.na(LL)]) > locations<-apply(as.matrix(lost_targets),1,function(x) > grep(x,ann[,"Target"],fixed=TRUE)) > ######## WARNING if length(unlist(locations)) != > length(lost_nuid_loc) ######## this will screw up ;; as it means the > targetID was not found or it ######## may have been found multiple times > (have not had this happen-yet) > locations<-lapply(locations,function(x) x[1] ) #in case more that one > lost_ann<-ann[unlist(locations),] > LL[is.na(LL)]<-as.character(lost_ann[,"Symbol"]) # yes assumes same > order > > Cheers > Paul > > >> Hi, >> I am using the lumi package to analyse illumina microarray data. >> When it finally comes to getting the top 10 DE genes with topTable I >> > get > >> many hits with >> the geneSymbol <na>. However, if I look up the ProbeID corresponding >> > to > >> the nuID >> that provide <na>, I find that they do correspond to genes. Why aren't >> > > >> they being >> displayed in the topTable? >> thanks, >> Sebastien >> >> ID geneSymbol logFC t P.Value >> > > >> adj.P.Val B >> 1917 fwfUovXT3rjAjqbpJU S100A8 -5.307223 -50.43759 9.854174e-09 >> 0.0001383625 8.724832 >> 12632 Qd_S7V4OkLjsX3jkt4 KRT6B -5.281406 -39.54237 3.896317e-08 >> 0.0002735409 8.229157 >> 12149 BjSTT6BOqGLhpKKFGI <na> -3.118669 -30.01505 1.844180e-07 >> 0.0008631377 7.451766 >> 7474 6ipCUUDxcp4ryIj6Uk <na> -3.155916 -24.45685 5.835502e-07 >> 0.0013366890 6.716048 >> 3831 3nivfFfvk55Rd18lLk <na> -2.690362 -24.10891 6.324511e-07 >> 0.0013366890 6.659617 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at ... >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > I have looked into this problem a little more... > > I downloaded the Human6_v2_sequence spreadsheet from the Illumina > website and found that many of the targets that provide NA as > gene symbol have no symbol in the Illumina database either. > > For example: > > ID geneSymbol > 5903 ILMN_21212 FAM43A > 3103 ILMN_1425 FOXO4 > 11993 ILMN_6504 PPL > 5153 ILMN_19390 ST3GAL4 > 1723 ILMN_12716 CREB3L2 > 4484 ILMN_17676 TNS3 > 2700 ILMN_138461 <na> > 1358 ILMN_12133 FSCN1 > 3507 ILMN_15271 CITED4 > 12401 ILMN_73087 <na> > > ILMN_73087 provides NA as gene symbol and does not have a gene > symbol in the Illumina DB either. > > However, ILMN_138461 provides NA as gene symbol but does have a > gene symbol in the Illumina DB. It is APM-1. > > In addition ILMN_73087 has no entries in either the > Illumina or BioC DB but when I do a search for ILMN_73087 in > Ensembl I a hit that has multiple EntrezGene listings. > > Is there any fix for the NA entries? Is this problem being addressed? > thanks, > Sebastien > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 16.5 years ago Lynn Amon ▴ 280

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071116/ ea59a611/attachment.pl

ADD REPLY • link 16.5 years ago Wei Shi ★ 3.6k

Login before adding your answer.