AnnBuilders paseData() doesn't recognize ACCs with underscore?

0

Entering edit mode

Benjamin Otto ▴ 830

@benjamin-otto-1519

Last seen 11.4 years ago

Hi, parseData() seems to have problems in recognition of accession numbers including an underscore like "NM_001815". The function just doesn't find them although they do exist in the database file. Here is the example I'm trying to get working: >library(AnnBuilder) >pkgpath <- .find.package("AnnBuilder") ># unigene infos >ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" ># parsing >ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >"scripts", "gbUGParser"), baseFile = "geneNMap", >organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >parseData(ug) The geneNMap file has the entries: 32468_f_at D90278;M16652 32469_at L00693 NM_001815 NM_001815 BF897514 BF897514 38912_at D90042 BC028014 BC028014 D90042 D90042 I get out: [,1] [,2] 32468_f_at "32468_f_at" "1084;63036" 32469_at "32469_at" "1084" 38912_at "38912_at" "10" BF897514 "BF897514" "1084" D90042 "D90042" "10" Thanks a lot for your help in advance.. Regards, Benjamin -- Benjamin Otto Universitaetsklinikum Eppendorf Hamburg Institut fuer Klinische Chemie Martinistrasse 52 20246 Hamburg

• 1.7k views

ADD COMMENT • link updated 19.0 years ago by John Zhang ★ 2.9k • written 19.0 years ago by Benjamin Otto ▴ 830

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 11.4 years ago

> >parseData() seems to have problems in recognition of accession numbers >including an underscore like "NM_001815". The function just doesn't find >them although they do exist in the database file. You have used a wrong parser. There are parsers, such as egRefseqParser and gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick one that fits your data. > >Here is the example I'm trying to get working: > >>library(AnnBuilder) >>pkgpath <- .find.package("AnnBuilder") >># unigene infos >>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >># parsing >>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>"scripts", "gbUGParser"), baseFile = "geneNMap", >>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>parseData(ug) > >The geneNMap file has the entries: > >32468_f_at D90278;M16652 >32469_at L00693 >NM_001815 NM_001815 >BF897514 BF897514 >38912_at D90042 >BC028014 BC028014 >D90042 D90042 > >I get out: > [,1] [,2] >32468_f_at "32468_f_at" "1084;63036" >32469_at "32469_at" "1084" >38912_at "38912_at" "10" >BF897514 "BF897514" "1084" >D90042 "D90042" "10" > > >Thanks a lot for your help in advance.. > >Regards, > >Benjamin > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD COMMENT • link 19.0 years ago John Zhang ★ 2.9k

0

Entering edit mode

Hi John, Your right, my problem is bound to the mix of accession and RefSeq Ids so being correct gbUGParser wouldn't be expectd to find the refseqs (my description of "accessions including underscores" was pretty dopey, I admitt). I just, probably in an attack of wild speculation, thought the "gb" scipts would automatically include the refseqs because there are no REF2xxxParsers and the gbNRef2LLParser is the only parser with refseq on the input side (as far as I can remember).The gbNRef2LLParser returns LocusLink Ids but I would like to match unigene ids and there seems to be no "gbNREF2UGParser"... So probably I should rename a copy of the gbUGParser to "gbNREF2UGParser" and add the "_" to regular expression. Regards, Benjamin -----Urspr?ngliche Nachricht----- Von: John Zhang [mailto:jzhang at jimmy.harvard.edu] Gesendet: 17 January 2007 15:12 An: bioconductor at stat.math.ethz.ch; b.otto at uke.uni-hamburg.de Betreff: Re: [BioC] AnnBuilders paseData() doesn't recognize ACCs with underscore? > >parseData() seems to have problems in recognition of accession numbers >including an underscore like "NM_001815". The function just doesn't >find them although they do exist in the database file. You have used a wrong parser. There are parsers, such as egRefseqParser and gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick one that fits your data. > >Here is the example I'm trying to get working: > >>library(AnnBuilder) >>pkgpath <- .find.package("AnnBuilder") >># unigene infos >>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >># parsing >>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>"scripts", "gbUGParser"), baseFile = "geneNMap", >>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>parseData(ug) > >The geneNMap file has the entries: > >32468_f_at D90278;M16652 >32469_at L00693 >NM_001815 NM_001815 >BF897514 BF897514 >38912_at D90042 >BC028014 BC028014 >D90042 D90042 > >I get out: > [,1] [,2] >32468_f_at "32468_f_at" "1084;63036" >32469_at "32469_at" "1084" >38912_at "38912_at" "10" >BF897514 "BF897514" "1084" >D90042 "D90042" "10" > > >Thanks a lot for your help in advance.. > >Regards, > >Benjamin > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD REPLY • link 19.0 years ago Benjamin Otto ▴ 830

0

Entering edit mode

Hi John, Here comes a correction to my last email. Probably my brain is working in power save mode today but now I'm a little bit confused: 1) gbUGParser should get genbank ids (accessions) and return unigene ids, right? 2) NM_xxxxxx might denote reference sequences but still ARE accessions, right? AND they are genbank identifiers. So gbUGParser SHOULD recognize them as valid identifier. Regards, benjamin -----Urspr?ngliche Nachricht----- Von: John Zhang [mailto:jzhang at jimmy.harvard.edu] Gesendet: 17 January 2007 15:12 An: bioconductor at stat.math.ethz.ch; b.otto at uke.uni-hamburg.de Betreff: Re: [BioC] AnnBuilders paseData() doesn't recognize ACCs with underscore? > >parseData() seems to have problems in recognition of accession numbers >including an underscore like "NM_001815". The function just doesn't >find them although they do exist in the database file. You have used a wrong parser. There are parsers, such as egRefseqParser and gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick one that fits your data. > >Here is the example I'm trying to get working: > >>library(AnnBuilder) >>pkgpath <- .find.package("AnnBuilder") >># unigene infos >>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >># parsing >>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>"scripts", "gbUGParser"), baseFile = "geneNMap", >>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>parseData(ug) > >The geneNMap file has the entries: > >32468_f_at D90278;M16652 >32469_at L00693 >NM_001815 NM_001815 >BF897514 BF897514 >38912_at D90042 >BC028014 BC028014 >D90042 D90042 > >I get out: > [,1] [,2] >32468_f_at "32468_f_at" "1084;63036" >32469_at "32469_at" "1084" >38912_at "38912_at" "10" >BF897514 "BF897514" "1084" >D90042 "D90042" "10" > > >Thanks a lot for your help in advance.. > >Regards, > >Benjamin > > >-- >Benjamin Otto >Universitaetsklinikum Eppendorf Hamburg >Institut fuer Klinische Chemie >Martinistrasse 52 >20246 Hamburg > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD REPLY • link 19.0 years ago Benjamin Otto ▴ 830

0

Entering edit mode

Benjamin Otto ▴ 830

@benjamin-otto-1519

Last seen 11.4 years ago

Ah, sorry, just solved the problem. I had to add the "_" to the regular expression in the gbUGParser file in the scripts folder... Regards, Benjamin -----Urspr?ngliche Nachricht----- Von: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] Im Auftrag von Benjamin Otto Gesendet: 17 January 2007 14:50 An: bioconductor at stat.math.ethz.ch Betreff: [BioC] AnnBuilders paseData() doesn't recognize ACCs withunderscore? Hi, parseData() seems to have problems in recognition of accession numbers including an underscore like "NM_001815". The function just doesn't find them although they do exist in the database file. Here is the example I'm trying to get working: >library(AnnBuilder) >pkgpath <- .find.package("AnnBuilder") ># unigene infos >ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" ># parsing >ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, "scripts", >"gbUGParser"), baseFile = "geneNMap", organism = "Homo sapiens", built >= "N/A", fromWeb = FALSE) >parseData(ug) The geneNMap file has the entries: 32468_f_at D90278;M16652 32469_at L00693 NM_001815 NM_001815 BF897514 BF897514 38912_at D90042 BC028014 BC028014 D90042 D90042 I get out: [,1] [,2] 32468_f_at "32468_f_at" "1084;63036" 32469_at "32469_at" "1084" 38912_at "38912_at" "10" BF897514 "BF897514" "1084" D90042 "D90042" "10" Thanks a lot for your help in advance.. Regards, Benjamin -- Benjamin Otto Universitaetsklinikum Eppendorf Hamburg Institut fuer Klinische Chemie Martinistrasse 52 20246 Hamburg _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 19.0 years ago Benjamin Otto ▴ 830

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 11.4 years ago

> >Your right, my problem is bound to the mix of accession and RefSeq Ids so >being correct gbUGParser wouldn't be expectd to find the refseqs (my >description of "accessions including underscores" was pretty dopey, I >admitt). I just, probably in an attack of wild speculation, thought the "gb" >scipts would automatically include the refseqs because there are no >REF2xxxParsers and the gbNRef2LLParser is the only parser with refseq on the >input side (as far as I can remember).The gbNRef2LLParser returns LocusLink >Ids but I would like to match unigene ids and there seems to be no >"gbNREF2UGParser"... >So probably I should rename a copy of the gbUGParser to "gbNREF2UGParser" >and add the "_" to regular expression. Yes, you can always write your own parsers to meet special requirements. > > >Regards, > >Benjamin > > > > > > > > >-----Urspr?ngliche Nachricht----- >Von: John Zhang [mailto:jzhang at jimmy.harvard.edu] >Gesendet: 17 January 2007 15:12 >An: bioconductor at stat.math.ethz.ch; b.otto at uke.uni-hamburg.de >Betreff: Re: [BioC] AnnBuilders paseData() doesn't recognize ACCs with >underscore? > > >> >>parseData() seems to have problems in recognition of accession numbers >>including an underscore like "NM_001815". The function just doesn't >>find them although they do exist in the database file. > >You have used a wrong parser. There are parsers, such as egRefseqParser and >gbNRef2LLParser, that handles RefSeq ids with undersores. You need to pick >one that fits your data. > >> >>Here is the example I'm trying to get working: >> >>>library(AnnBuilder) >>>pkgpath <- .find.package("AnnBuilder") >>># unigene infos >>>ugUrl <- "C:/Programme/R/R-2.4.1/library/AnnBuilder/data/Ths.data" >>># parsing >>>ug <- UG(srcUrl = ugUrl, parser = file.path(pkgpath, >>>"scripts", "gbUGParser"), baseFile = "geneNMap", >>>organism = "Homo sapiens", built = "N/A", fromWeb = FALSE) >>>parseData(ug) >> >>The geneNMap file has the entries: >> >>32468_f_at D90278;M16652 >>32469_at L00693 >>NM_001815 NM_001815 >>BF897514 BF897514 >>38912_at D90042 >>BC028014 BC028014 >>D90042 D90042 >> >>I get out: >> [,1] [,2] >>32468_f_at "32468_f_at" "1084;63036" >>32469_at "32469_at" "1084" >>38912_at "38912_at" "10" >>BF897514 "BF897514" "1084" >>D90042 "D90042" "10" >> >> >>Thanks a lot for your help in advance.. >> >>Regards, >> >>Benjamin >> >> >>-- >>Benjamin Otto >>Universitaetsklinikum Eppendorf Hamburg >>Institut fuer Klinische Chemie >>Martinistrasse 52 >>20246 Hamburg >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Jianhua Zhang >Department of Medical Oncology >Dana-Farber Cancer Institute >44 Binney Street >Boston, MA 02115-6084 Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD COMMENT • link 19.0 years ago John Zhang ★ 2.9k

Login before adding your answer.