Problems in AnnBuilder package function ABPkgBuilder
1
0
Entering edit mode
Asta Laiho ▴ 70
@asta-laiho-2025
Last seen 9.6 years ago
I have been using AnnBuilder function ABPkgBuilder to create an annotation package for Affymetrix array rat2302. I compared the package that I created (23.3) to the rat2302 annotation package released on march 15th (Bioc 2.0) and I detected some differences between the packages. I was wondering what could cause these differences. Here are the package information for the rat2302_1.15.13 package and my own package. BIOC: Quality control information for rat2302 Date built: Created: Thu Mar 15 18:25:07 2007 Number of probes: 31099 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: rat2302ACCNUM found 31099 of 31099 rat2302CHRLOC found 12177 of 31099 rat2302CHR found 23212 of 31099 rat2302ENTREZID found 23250 of 31099 rat2302ENZYME found 1916 of 31099 rat2302GENENAME found 23229 of 31099 rat2302GO found 14228 of 31099 rat2302MAP found 22526 of 31099 rat2302PATH found 4535 of 31099 rat2302PMID found 14511 of 31099 rat2302REFSEQ found 23157 of 31099 rat2302SUMFUNC found 0 of 31099 rat2302SYMBOL found 23249 of 31099 rat2302UNIGENE found 22825 of 31099 Mappings found for non-probe based rda files: rat2302CHRLENGTHS found 21 rat2302ENZYME2PROBE found 586 rat2302GO2ALLPROBES found 7649 rat2302GO2PROBE found 5695 rat2302ORGANISM found 1 rat2302PATH2PROBE found 177 rat2302PFAM found 18634 rat2302PMID2PROBE found 24911 rat2302PROSITE found 13246 My own package: AnnBuilder_1.13.21 Affy: rat230_2.na22.annot.csv.zip (3/9/07) GO: Built: 08-Feb-2007 Quality control information for rat2302Geno Date built: Created: Fri Mar 23 14:18:24 2007 Number of probes: 31099 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: rat2302GenoACCNUM found 31099 of 31099 rat2302GenoCHR found 23101 of 31099 rat2302GenoENTREZID found 23140 of 31099 rat2302GenoENZYME found 1913 of 31099 rat2302GenoGENENAME found 23119 of 31099 rat2302GenoGO found 18000 of 31099 rat2302GenoMAP found 22424 of 31099 rat2302GenoPATH found 4539 of 31099 rat2302GenoPMID found 14539 of 31099 rat2302GenoREFSEQ found 23045 of 31099 rat2302GenoSUMFUNC found 0 of 31099 rat2302GenoSYMBOL found 23140 of 31099 rat2302GenoUNIGENE found 22755 of 31099 Mappings found for non-probe based rda files: rat2302GenoCHRLENGTHS found 21 rat2302GenoENZYME2PROBE found 590 rat2302GenoGO2ALLPROBES found 8313 rat2302GenoGO2PROBE found 6348 rat2302GenoORGANISM found 1 rat2302GenoPATH2PROBE found 177 rat2302GenoPFAM found 18575 rat2302GenoPMID2PROBE found 25257 rat2302GenoPROSITE found 13194 So my package is built 7 days after the official packet. Yet there can be noticed some differences. Bioc package has 110 entrez ids more than my package. This is surprising since the number of found entrez ids should increase, not decrease by time to my experience. In the Bioc package there are 55 unique entrez ids more than in my package. I use the public representative id from rat2302 Affymetrix annotation file as a primary source for the mappings and Unigene and Entrez id columns as secondary sources for the mappings, like I have been told is also done when creating the Bioc annotation package. The most striking difference is in the GO information. Even that it is decleared in the Bioc package html info page that the same release of the GO information has been used in building it (08-Feb-2007) it seems that older version has actually been used. What else could be the explanation for that my package contains GO information for almost 4000 probesets more? When I last updated my own package in December, I also had information for 4000 probesets less. My package is also totally missing the CHRLOC information. This, I assume, is because I get the following error message when building the annotation package: "Error in loadFromUrl(srcUrl) : URL ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomesrefLink.txt.gz is incorrect or the target site is not responding!" This file does not exist on the server anymore (it was removed already last year) and I hope that AnnBuilder could soon be updated accordingly. Another thing that I faced was a problem in the Unigene data file format. I had to remove the "//" on the last row of the file before AnnBuilder was able to process the file. Two weeks ago (when I built the package) the url for the KEGG data was still working fine, but this week I noticed the url: "ftp://ftp.genome.ad.jp/pub/kegg/pathways" had changed to "ftp://ftp.genome.ad.jp/pub/kegg/pathway". Something else in the file structure has changed as well, since fixing just the url did not help. I hope that also this can soon be updated for the package. I have attached below also the sessioninfo. I tested creating the same annotation package also with R 2.4.0 and previous version of AnnBuilder but the created package was identical to the one I managed to create now. Regards, Asta Laiho #--------------------------------------------------------------------- ------------------------ > sessionInfo() R version 2.5.0 Under development (unstable) (2007-02-11 r40701) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE= en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT= en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" [7] "methods" "base" other attached packages: AnnBuilder annotate XML Biobase rat2302 rat2302Geno "1.13.21" "1.13.6" "1.6-0" "1.13.38" "1.15.13" "1.0.0"
Annotation GO rat2302 probe Biobase annotate AnnBuilder PROcess Annotation GO rat2302 • 881 views
ADD COMMENT
0
Entering edit mode
Nianhua Li ▴ 870
@nianhua-li-1606
Last seen 9.6 years ago
Hi, Asta, > So my package is built 7 days after the official packet. Even though the package was "published" on Marth 15, the source data was downloaded from public data resources on Feb 28. So the differences in source data are about 3 weeks. > I use the public representative id from rat2302 Affymetrix annotation file as a primary source for the > mappings and Unigene and Entrez id columns as secondary sources for the mappings, like I have been told is > also done when creating the Bioc annotation package. We used those files in the same way. The annotations for Rat230_2 on Affymetrix site is dated 3/9/07, which was not available when we create rat2302_1.15.13. Ours was dated 11/15/06. I guess this causes most of the differences in probeset to Entrez mapping. The other possible contribution is UniGene. I guess you didn't give the "organism" argument correctly. > The most striking difference is in the GO information. Even that it is decleared in the Bioc package html > info page that the same release of the GO information has been used in building it (08-Feb-2007) it seems > that older version has actually been used. The gene to GO mapping information is obtained from Entrez Gene website. The GO package is only used to get ontology category (BP, MF, CC). I just re-created rat2302 with my local-mirror of source data that I downloaded on Feb 28 and got the same QC result. This time I am sure my GO version is 1.15.13. I also re-created rat2302 with the latest source data and GO 1.15.13 and found GO mappings for 18085 probeset IDs. So, the difference is not caused by the GO package. > What else could be the explanation for that my package contains > GO information for almost 4000 probesets more? You normally want to look at the source data first. I download gene2go.gz from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA and compared it with the one I downloaded on Feb 28. The old gene2go.gz provide GO mappings for 11354 rat Entrez Gene IDs, the new file covers 16759 rat Entrez Gene IDs. 5411 rat Entrez Gene IDs only have GO mappings in the new file. 6 rat Entrez Gene IDs only have GO mappings in the old file. I also checked human Entrez Gene IDs. 17324 Entrez Gene IDs in the new file, 132 of them are unique to the new file. 17281 IDs in the old file, 89 of them are unique to the old file. For mouse Entrez Gene IDs: 19424 IDs in the new file, 156 of them are unique to the new file. 19354 IDs in the old file, 89 of them are unique to the old file. > My package is also totally missing the CHRLOC information. This, I assume, is because I get the following > error message when building the annotation package: > > "Error in loadFromUrl(srcUrl) : URL > ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomesrefLink.txt.gz is incorrect or the > target site is not responding!" Did you spell "Rattus norvegicus" correctly? The URL should be something like ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/<organism_spec _dir="">/database/refLink.txt.gz This missing of <organism_spec_dir> part in your error message suggests that the organism value was wrong. > This file does not exist on the server anymore (it was removed already last year) The file is in ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Rattus_norvegi cus/database/refLink.txt.gz It has never been moved in the past two years. > Another thing that I faced was a problem in the Unigene data file format. I had to remove the "//" on the last > row of the file before AnnBuilder was able to process the file. Again this is because the "organism" value was wrong. This also affects the probeset to Entrez Gene mapping process by the way. > Two weeks ago (when I built the package) the url for the KEGG data was still working fine, but this week I > noticed the url: "ftp://ftp.genome.ad.jp/pub/kegg/pathways" had changed to > "ftp://ftp.genome.ad.jp/pub/kegg/pathway". Something else in the file structure has changed as > well, since fixing just the url did not help. I hope that also this can soon be updated for the package. Thanks for reporting. It has been fixed yesterday. Try AnnBuilder 1.13.23. I re-created rat2302 by using the latest data (but still the old annotation file from Affymetrix and GO 1.15.13) and compared it with rat2302 1.15.13. There are the differences: === rat2302PATH2PROBE The two rat2302PATH2PROBE have 176 variables in common. 1 objects in older version only: 00904 . n 3 objects in newer version only: 00730 04012 05213 . Among a random sample of 60 variables that are in common, older version has 0 NAs, newer version has 0 NAs; 40 have identical values in the two packages, and 20 have different values: 00602 05210 04520 04540 00740 ... . === rat2302PMID Among a random sample of 60 variables that are in common, older version has 36 NAs, newer version has 35 NAs; 55 have identical values in the two packages, and 5 have different values: 1380046_at 1395212_at 1374353_x_at 1387054_at 1369818_at . === rat2302PMID2PROBE The two rat2302PMID2PROBE have 24909 variables in common. 2 objects in older version only: 15448694 9462742 . n 617 objects in newer version only: 10051504 10070498 10188224 10210204 10210776 ... . === rat2302GO Among a random sample of 60 variables that are in common, older version has 32 NAs, newer version has 24 NAs; 29 have identical values in the two packages, and 31 have different values: 1398279_at 1372378_at 1387178_a_at 1387032_at 1385140_at ... . === rat2302PROSITE Among a random sample of 60 variables that are in common, older version has 24 NAs, newer version has 23 NAs; 59 have identical values in the two packages, and 1 have different values: 1388028_at . === rat2302GO2PROBE The two rat2302GO2PROBE have 5686 variables in common. 9 objects in older version only: GO:0000126 GO:0001747 GO:0004840 GO:0006384 GO:0007222 ... . n 686 objects in newer version only: GO:0000003 GO:0000015 GO:0000028 GO:0000156 GO:0000224 ... . Among a random sample of 60 variables that are in common, older version has 0 NAs, newer version has 0 NAs; 19 have identical values in the two packages, and 41 have different values: GO:0051056 GO:0043406 GO:0016051 GO:0042554 GO:0007274 ... . === rat2302ENZYME2PROBE The two rat2302ENZYME2PROBE have 586 variables in common. 0 objects in older version only: . n 7 objects in newer version only: 2.4.1.133 2.4.1.152 2.4.99.7 2.8.1.7 2.8.2.11 ... . Among a random sample of 60 variables that are in common, older version has 0 NAs, newer version has 0 NAs; 59 have identical values in the two packages, and 1 have different values: 3.1.3.2 . === rat2302GO2ALLPROBES The two rat2302GO2ALLPROBES have 7642 variables in common. 7 objects in older version only: GO:0000126 GO:0004840 GO:0006384 GO:0007222 GO:0030236 ... . n 687 objects in newer version only: GO:0000015 GO:0000028 GO:0000156 GO:0000217 GO:0000224 ... . Among a random sample of 60 variables that are in common, older version has 0 NAs, newer version has 0 NAs; 17 have identical values in the two packages, and 43 have different values: GO:0005234 GO:0042531 GO:0009262 GO:0008705 GO:0007130 ... . === rat2302PFAM Among a random sample of 60 variables that are in common, older version has 21 NAs, newer version has 21 NAs; 59 have identical values in the two packages, and 1 have different values: 1394880_at . So, there can be a lot of differences in the data even just after 1 month. hope this is helpful nianhua
ADD COMMENT

Login before adding your answer.

Traffic: 865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6