annotation package for chicken affyprobes

0

Entering edit mode

Nianhua Li ▴ 870

@nianhua-li-1606

Last seen 9.6 years ago

Hi, Lina, We are lucky in case of chicken. I just updated AnnBuilder (v1.11.7) to support Gallus gallus (taxon id 9031). You can get it from svn right away or wait 2 days to download it from bioc website. Or if you want to try it now, here is the changes: ================================================================ --- IPI.R (new) +++ IPI.R (old) @@ -21,8 +21,7 @@ speciesNorganismTable <- rbind( c("human", "Homo sapiens"), c("mouse", "Mus musculus"), - c("rat", "Rattus norvegicus"), - c("chick", "Gallus gallus") + c("rat", "Rattus norvegicus") ) colnames(speciesNorganismTable) <- c("species", "organism") return(speciesNorganismTable) ================================================================= The above change happens in function "speciesNorganism". This allows you get annotation for PFAM and PROSITE. ================================================================= --- getSrcUrl.R (new) +++ getSrcUrl.R (old) @@ -65,7 +65,6 @@ "DANIO RERIO" = "Danio_Rerio", "CAENORHABDITIS ELEGANS" = "Caenorhabditis_elegans", "DROSOPHILA MELANOGASTER" = "Drosophila_melanogaster", - "GALLUS GALLUS" = "Gallus_gallus", NA) ifis.na(key)) { warning(paste("Organism", organism, "is not supported by GoldenPath (GP).")) @@ -170,7 +169,6 @@ Sma = "Schistosoma mansoni", Ssa = "Salmo salar", Ssc = "Sus scrofa", Str = "Xenopus tropicalis", Xl = "Xenopus laevis", At = "Arabidopsis thaliana", - Gga = "Gallus gallus", Gma = "Glycine max", Han = "Helianthus annus", Hv = "Hordeum vulgare", Lsa = " Lactuca sativa", Les = "Lycopersicon esculentum", Lco = "Lotus corniculatus", ================================================================== The first change is in function "getUCSCUrl", for chromosome location. The second is in function "UGSciNames" for UniGene. I test it with this script: =============================== library(AnnBuilder) mypkg <- function(pkgPath, version) { ABPkgBuilder(baseName="mybase.txt", baseMapType="ll", pkgName="mypkg", pkgPath=pkgPath, organism="Gallus gallus", version=version, author=list( authors="Nianhua Li", maintainer="Nianhua Li<email at="" email.org="">" ) ) } mypkg(getwd(), "1.0.0") =============================== mybase.txt is 1 395929 2 395844 3 396017 4 415357 5 424377 QC data is: Number of probes: 5 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: mypkgACCNUM found 0 of 5 mypkgCHRLOC found 4 of 5 mypkgCHR found 5 of 5 mypkgENZYME found 4 of 5 mypkgGENENAME found 0 of 5 mypkgGO found 5 of 5 mypkgLOCUSID found 5 of 5 mypkgMAP found 0 of 5 mypkgPATH found 5 of 5 mypkgPMID found 3 of 5 mypkgREFSEQ found 5 of 5 mypkgSUMFUNC found 0 of 5 mypkgSYMBOL found 5 of 5 mypkgUNIGENE found 5 of 5 Mappings found for non-probe based rda files: mypkgENZYME2PROBE found 5 mypkgGO2ALLPROBES found 107 mypkgGO2PROBE found 19 mypkgORGANISM found 1 mypkgPATH2PROBE found 18 mypkgPFAM found 5 mypkgPMID2PROBE found 4 mypkgPROSITE found 3 Let me know if you have any questions or concerns. Cheers! nianhua

Annotation Organism Gallus gallus probe AnnBuilder Annotation Organism Gallus gallus probe • 1.6k views

ADD COMMENT • link updated 17.7 years ago by Lina Hultin-Rosenberg ▴ 180 • written 17.7 years ago by Nianhua Li ▴ 870

0

Entering edit mode

Lina Hultin-Rosenberg ▴ 180

@lina-hultin-rosenberg-1802

Last seen 9.6 years ago

Hi Jim and Nianhua! Thank you so much for your answers, perfect timing that the new AnnBuilder version will support chicken!! Yes Jim, my goal is to output a list of the top probesets along with annotations as well as to be able to add gene names to volcano plots etc. And later on also perhaps use the functional annotations, like GO, for some analyses. I have quite a lot of annotation data in the annotation file provided by affymetrix - mappings to different identifiers, functional annotations etc. I might have everything I need there but I don't know how to "connect" that to my affybatch or exprSet object or output from limma to handle it easily. Thanks again! Best, Lina -----Ursprungligt meddelande----- Fr?n: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] F?r Nianhua Li Skickat: den 17 augusti 2006 03:15 Till: bioconductor at stat.math.ethz.ch ?mne: Re: [BioC] annotation package for chicken affyprobes Hi, Lina, We are lucky in case of chicken. I just updated AnnBuilder (v1.11.7) to support Gallus gallus (taxon id 9031). You can get it from svn right away or wait 2 days to download it from bioc website. Or if you want to try it now, here is the changes: ================================================================ --- IPI.R (new) +++ IPI.R (old) @@ -21,8 +21,7 @@ speciesNorganismTable <- rbind( c("human", "Homo sapiens"), c("mouse", "Mus musculus"), - c("rat", "Rattus norvegicus"), - c("chick", "Gallus gallus") + c("rat", "Rattus norvegicus") ) colnames(speciesNorganismTable) <- c("species", "organism") return(speciesNorganismTable) ================================================================= The above change happens in function "speciesNorganism". This allows you get annotation for PFAM and PROSITE. ================================================================= --- getSrcUrl.R (new) +++ getSrcUrl.R (old) @@ -65,7 +65,6 @@ "DANIO RERIO" = "Danio_Rerio", "CAENORHABDITIS ELEGANS" = "Caenorhabditis_elegans", "DROSOPHILA MELANOGASTER" = "Drosophila_melanogaster", - "GALLUS GALLUS" = "Gallus_gallus", NA) ifis.na(key)) { warning(paste("Organism", organism, "is not supported by GoldenPath (GP).")) @@ -170,7 +169,6 @@ Sma = "Schistosoma mansoni", Ssa = "Salmo salar", Ssc = "Sus scrofa", Str = "Xenopus tropicalis", Xl = "Xenopus laevis", At = "Arabidopsis thaliana", - Gga = "Gallus gallus", Gma = "Glycine max", Han = "Helianthus annus", Hv = "Hordeum vulgare", Lsa = " Lactuca sativa", Les = "Lycopersicon esculentum", Lco = "Lotus corniculatus", ================================================================== The first change is in function "getUCSCUrl", for chromosome location. The second is in function "UGSciNames" for UniGene. I test it with this script: =============================== library(AnnBuilder) mypkg <- function(pkgPath, version) { ABPkgBuilder(baseName="mybase.txt", baseMapType="ll", pkgName="mypkg", pkgPath=pkgPath, organism="Gallus gallus", version=version, author=list( authors="Nianhua Li", maintainer="Nianhua Li<email at="" email.org="">" ) ) } mypkg(getwd(), "1.0.0") =============================== mybase.txt is 1 395929 2 395844 3 396017 4 415357 5 424377 QC data is: Number of probes: 5 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: mypkgACCNUM found 0 of 5 mypkgCHRLOC found 4 of 5 mypkgCHR found 5 of 5 mypkgENZYME found 4 of 5 mypkgGENENAME found 0 of 5 mypkgGO found 5 of 5 mypkgLOCUSID found 5 of 5 mypkgMAP found 0 of 5 mypkgPATH found 5 of 5 mypkgPMID found 3 of 5 mypkgREFSEQ found 5 of 5 mypkgSUMFUNC found 0 of 5 mypkgSYMBOL found 5 of 5 mypkgUNIGENE found 5 of 5 Mappings found for non-probe based rda files: mypkgENZYME2PROBE found 5 mypkgGO2ALLPROBES found 107 mypkgGO2PROBE found 19 mypkgORGANISM found 1 mypkgPATH2PROBE found 18 mypkgPFAM found 5 mypkgPMID2PROBE found 4 mypkgPROSITE found 3 Let me know if you have any questions or concerns. Cheers! nianhua _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 17.7 years ago Lina Hultin-Rosenberg ▴ 180

0

Entering edit mode

Lina Hultin-Rosenberg ▴ 180

@lina-hultin-rosenberg-1802

Last seen 9.6 years ago

Hi again Nianhua! I have now installed the new version of AnnBuilder (1.11.8) and wanted to try it with your test script and test file below before I did any attempt to build the package for chicken. I created the mybase.txt and ran the script but got the following error/warning messages: ______________________________________________________________________ ___ Error in "[.data.frame"(temp, , 2) : undefined columns selected In addition: Warning message: data length [5] is not a sub-multiple or multiple of the number of rows [3] in matrix ______________________________________________________________________ ____ Might be something simple I did wrong but can't really figure it out myself. Do you know what might be the problem? Thank you! /Lina -----Ursprungligt meddelande----- Fr?n: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] F?r Nianhua Li Skickat: den 17 augusti 2006 03:15 Till: bioconductor at stat.math.ethz.ch ?mne: Re: [BioC] annotation package for chicken affyprobes Hi, Lina, We are lucky in case of chicken. I just updated AnnBuilder (v1.11.7) to support Gallus gallus (taxon id 9031). You can get it from svn right away or wait 2 days to download it from bioc website. Or if you want to try it now, here is the changes: ================================================================ --- IPI.R (new) +++ IPI.R (old) @@ -21,8 +21,7 @@ speciesNorganismTable <- rbind( c("human", "Homo sapiens"), c("mouse", "Mus musculus"), - c("rat", "Rattus norvegicus"), - c("chick", "Gallus gallus") + c("rat", "Rattus norvegicus") ) colnames(speciesNorganismTable) <- c("species", "organism") return(speciesNorganismTable) ================================================================= The above change happens in function "speciesNorganism". This allows you get annotation for PFAM and PROSITE. ================================================================= --- getSrcUrl.R (new) +++ getSrcUrl.R (old) @@ -65,7 +65,6 @@ "DANIO RERIO" = "Danio_Rerio", "CAENORHABDITIS ELEGANS" = "Caenorhabditis_elegans", "DROSOPHILA MELANOGASTER" = "Drosophila_melanogaster", - "GALLUS GALLUS" = "Gallus_gallus", NA) ifis.na(key)) { warning(paste("Organism", organism, "is not supported by GoldenPath (GP).")) @@ -170,7 +169,6 @@ Sma = "Schistosoma mansoni", Ssa = "Salmo salar", Ssc = "Sus scrofa", Str = "Xenopus tropicalis", Xl = "Xenopus laevis", At = "Arabidopsis thaliana", - Gga = "Gallus gallus", Gma = "Glycine max", Han = "Helianthus annus", Hv = "Hordeum vulgare", Lsa = " Lactuca sativa", Les = "Lycopersicon esculentum", Lco = "Lotus corniculatus", ================================================================== The first change is in function "getUCSCUrl", for chromosome location. The second is in function "UGSciNames" for UniGene. I test it with this script: =============================== library(AnnBuilder) mypkg <- function(pkgPath, version) { ABPkgBuilder(baseName="mybase.txt", baseMapType="ll", pkgName="mypkg", pkgPath=pkgPath, organism="Gallus gallus", version=version, author=list( authors="Nianhua Li", maintainer="Nianhua Li<email at="" email.org="">" ) ) } mypkg(getwd(), "1.0.0") =============================== mybase.txt is 1 395929 2 395844 3 396017 4 415357 5 424377 QC data is: Number of probes: 5 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: mypkgACCNUM found 0 of 5 mypkgCHRLOC found 4 of 5 mypkgCHR found 5 of 5 mypkgENZYME found 4 of 5 mypkgGENENAME found 0 of 5 mypkgGO found 5 of 5 mypkgLOCUSID found 5 of 5 mypkgMAP found 0 of 5 mypkgPATH found 5 of 5 mypkgPMID found 3 of 5 mypkgREFSEQ found 5 of 5 mypkgSUMFUNC found 0 of 5 mypkgSYMBOL found 5 of 5 mypkgUNIGENE found 5 of 5 Mappings found for non-probe based rda files: mypkgENZYME2PROBE found 5 mypkgGO2ALLPROBES found 107 mypkgGO2PROBE found 19 mypkgORGANISM found 1 mypkgPATH2PROBE found 18 mypkgPFAM found 5 mypkgPMID2PROBE found 4 mypkgPROSITE found 3 Let me know if you have any questions or concerns. Cheers! nianhua _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 17.7 years ago Lina Hultin-Rosenberg ▴ 180

0

Entering edit mode

Dear Lina, Yeah, the format of mybase.txt is a bit tricky. The data in the file is like a table with 2 columns and 5 rows. The two columns must be separated by "Tab" (the "Tab" key often locates near "Q" on the keyboard, it is also referred as "\t" on linux). You will get the error message if you use "space" (" ") as separator for the 2 columns. I guess this is what happened in your case, but I maybe wrong. So, I attached mybase.txt here for clarification. Hope it works this time. thanks nianhua -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mybase.txt Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060824/ 6c006e57/attachment.txt

ADD REPLY • link 17.7 years ago Nianhua Li ▴ 870

0

Entering edit mode

Lina Hultin-Rosenberg ▴ 180

@lina-hultin-rosenberg-1802

Last seen 9.6 years ago

Hi again! I managed to build the annotation package for chicken - thanks for all your help! I was a bit surprised though by the low annotation coverage, see QC data below. I don't really know how the data is collected but I would think more information on chromosome location is known for the probesets. When reading about the new chicken genome assembly (http://genome.ucsc.edu) it says that around 95% of the sequence has been anchored to chromosomes. I thought the annotation process in R used this information? What can be the reason for the very few anchored probesets? I might be doing something wrong or perhaps it is a problem of mapping probe id's to other identifiers? I used the genbank mappings as mybasefile and unigene and entrez mappings as other sources. Is there a way within R to increase annotation coverage? I am especially interested in chromosome location (number), but maybe this is a problem that is best solved outside R? Would greatly appreciate some help! Thank you, Lina ====================================================================== = QC data: Number of probes: 38535 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: chickenACCNUM found 25654 of 38535 chickenCHR found 9707 of 38535 chickenCHRLOC found 156 of 38535 chickenENZYME found 52 of 38535 chickenGENENAME found 0 of 38535 chickenGO found 4224 of 38535 chickenLOCUSID found 9722 of 38535 chickenMAP found 0 of 38535 chickenPATH found 87 of 38535 chickenPMID found 283 of 38535 chickenREFSEQ found 9709 of 38535 chickenSUMFUNC found 0 of 38535 chickenSYMBOL found 9722 of 38535 chickenUNIGENE found 289 of 38535 Mappings found for non-probe based rda files: chickenENZYME2PROBE found 33 chickenGO2ALLPROBES found 1785 chickenGO2PROBE found 930 chickenORGANISM found 1 chickenPATH2PROBE found 31 chickenPFAM found 7418 chickenPMID2PROBE found 101 chickenPROSITE found 5490 ====================================================================== ==== -----Ursprungligt meddelande----- Fr?n: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] F?r Nianhua Li Skickat: den 17 augusti 2006 03:15 Till: bioconductor at stat.math.ethz.ch ?mne: Re: [BioC] annotation package for chicken affyprobes Hi, Lina, We are lucky in case of chicken. I just updated AnnBuilder (v1.11.7) to support Gallus gallus (taxon id 9031). You can get it from svn right away or wait 2 days to download it from bioc website. Or if you want to try it now, here is the changes: ================================================================ --- IPI.R (new) +++ IPI.R (old) @@ -21,8 +21,7 @@ speciesNorganismTable <- rbind( c("human", "Homo sapiens"), c("mouse", "Mus musculus"), - c("rat", "Rattus norvegicus"), - c("chick", "Gallus gallus") + c("rat", "Rattus norvegicus") ) colnames(speciesNorganismTable) <- c("species", "organism") return(speciesNorganismTable) ================================================================= The above change happens in function "speciesNorganism". This allows you get annotation for PFAM and PROSITE. ================================================================= --- getSrcUrl.R (new) +++ getSrcUrl.R (old) @@ -65,7 +65,6 @@ "DANIO RERIO" = "Danio_Rerio", "CAENORHABDITIS ELEGANS" = "Caenorhabditis_elegans", "DROSOPHILA MELANOGASTER" = "Drosophila_melanogaster", - "GALLUS GALLUS" = "Gallus_gallus", NA) ifis.na(key)) { warning(paste("Organism", organism, "is not supported by GoldenPath (GP).")) @@ -170,7 +169,6 @@ Sma = "Schistosoma mansoni", Ssa = "Salmo salar", Ssc = "Sus scrofa", Str = "Xenopus tropicalis", Xl = "Xenopus laevis", At = "Arabidopsis thaliana", - Gga = "Gallus gallus", Gma = "Glycine max", Han = "Helianthus annus", Hv = "Hordeum vulgare", Lsa = " Lactuca sativa", Les = "Lycopersicon esculentum", Lco = "Lotus corniculatus", ================================================================== The first change is in function "getUCSCUrl", for chromosome location. The second is in function "UGSciNames" for UniGene. I test it with this script: =============================== library(AnnBuilder) mypkg <- function(pkgPath, version) { ABPkgBuilder(baseName="mybase.txt", baseMapType="ll", pkgName="mypkg", pkgPath=pkgPath, organism="Gallus gallus", version=version, author=list( authors="Nianhua Li", maintainer="Nianhua Li<email at="" email.org="">" ) ) } mypkg(getwd(), "1.0.0") =============================== mybase.txt is 1 395929 2 395844 3 396017 4 415357 5 424377 QC data is: Number of probes: 5 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: mypkgACCNUM found 0 of 5 mypkgCHRLOC found 4 of 5 mypkgCHR found 5 of 5 mypkgENZYME found 4 of 5 mypkgGENENAME found 0 of 5 mypkgGO found 5 of 5 mypkgLOCUSID found 5 of 5 mypkgMAP found 0 of 5 mypkgPATH found 5 of 5 mypkgPMID found 3 of 5 mypkgREFSEQ found 5 of 5 mypkgSUMFUNC found 0 of 5 mypkgSYMBOL found 5 of 5 mypkgUNIGENE found 5 of 5 Mappings found for non-probe based rda files: mypkgENZYME2PROBE found 5 mypkgGO2ALLPROBES found 107 mypkgGO2PROBE found 19 mypkgORGANISM found 1 mypkgPATH2PROBE found 18 mypkgPFAM found 5 mypkgPMID2PROBE found 4 mypkgPROSITE found 3 Let me know if you have any questions or concerns. Cheers! nianhua _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 17.7 years ago Lina Hultin-Rosenberg ▴ 180

0

Entering edit mode

Dear Lina, The annotation process of ABPkgBuilder is to first use mybasefile and all other mapping sources you provide to generate a mapping between probeset IDs to Entrez Gene IDs, and then use Entrez Gene IDs to retrieve annotations from public databases. Therefore, probeset ID to Entrez Gene ID mapping is the base of all other annotations. I noticed this line in your QC data: chickenLOCUSID found 9722 of 38535. It means only 9722 probeset IDs have been mapped to Entrez Gene. That is the reason for low annotation coverage, I think. GenBank to Entrez Gene mapping is based on ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz, and UniGene to Entrez Gene mapping is based on ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2unigene (less sure). Could you please trace a few probeset IDs that didn't map to Entrez Gene and see what is going on? chickenCHRLOC should contain information of chromosome locations from UCSC Genome database. The mapping is based on two files in http://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Gallus_gallus /database/ : refGene.txt.gz and refLink.txt.gz. The first file provides chromosome locations for RefSeq IDs. The second file provides EntrezGene to RefSeq mapping. It is surprising that only 156 out of 9722 probeset IDs found UCSC annotations. If you could provide some Entrez Gene IDs, I can trace the problem in the code. chickenCHR and chickenMap should contain information of chromosome locations from Entrez Gene. The information comes from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz. Hope it helps. nianhua Lina Hultin-Rosenberg wrote: > Hi again! > > I managed to build the annotation package for chicken - thanks for all your > help! > > I was a bit surprised though by the low annotation coverage, see QC data > below. I don't really know how the data is collected but I would think more > information on chromosome location is known for the probesets. When reading > about the new chicken genome assembly (http://genome.ucsc.edu) it says that > around 95% of the sequence has been anchored to chromosomes. I thought the > annotation process in R used this information? > > What can be the reason for the very few anchored probesets? I might be doing > something wrong or perhaps it is a problem of mapping probe id's to other > identifiers? I used the genbank mappings as mybasefile and unigene and > entrez mappings as other sources. Is there a way within R to increase > annotation coverage? I am especially interested in chromosome location > (number), but maybe this is a problem that is best solved outside R? > > Would greatly appreciate some help! > > Thank you, > Lina > > > ==================================================================== === > QC data: > Number of probes: 38535 > Probe number missmatch: None > Probe missmatch: None > Mappings found for probe based rda files: > chickenACCNUM found 25654 of 38535 > chickenCHR found 9707 of 38535 > chickenCHRLOC found 156 of 38535 > chickenENZYME found 52 of 38535 > chickenGENENAME found 0 of 38535 > chickenGO found 4224 of 38535 > chickenLOCUSID found 9722 of 38535 > chickenMAP found 0 of 38535 > chickenPATH found 87 of 38535 > chickenPMID found 283 of 38535 > chickenREFSEQ found 9709 of 38535 > chickenSUMFUNC found 0 of 38535 > chickenSYMBOL found 9722 of 38535 > chickenUNIGENE found 289 of 38535 > Mappings found for non-probe based rda files: > chickenENZYME2PROBE found 33 > chickenGO2ALLPROBES found 1785 > chickenGO2PROBE found 930 > chickenORGANISM found 1 > chickenPATH2PROBE found 31 > chickenPFAM found 7418 > chickenPMID2PROBE found 101 > chickenPROSITE found 5490 > ==================================================================== ====== > >

ADD REPLY • link 17.7 years ago Nianhua Li ▴ 870

0

Entering edit mode

Dear Lina, Sorry for the late reply. As I mentioned in the previous email, the UCSC genome annotation of chicken is obtained from http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/database/ , which "contains a dump of the UCSC genome annotation database for the May 2006 assembly of the chicken genome (galGal3, Chicken Genome Sequencing Consortium May 2006 release)". We use two files to get chromosome location information: refGene.txt.gz and refLink.txt.gz. I downloaded the current version of these two files (both dated Aug 27, 2006), import them to sqlite and got some "statistics "of the data: (1) refGene only have 3847 records, which means only 3847 sequences have chromosome location information. (2) We draw annotations from refGene by using the second column in the file: accession # of RefSeq records representing mRNA sequence. There are only 3730 unique accession # in refGene. (3) We use refLink to obtain Entrez Gene to RefSeq mapping. refLink covers 154286 unique Entrez Gene IDs, and 174215 unique RefSeq accession numbers (for mRNA). But after merging this information with refGene, we only get chromosome location information for 3710 unique Entrez Gene IDs. So, 3710 is maxim number of annotations one can get from UCSC given a list of chicken Entrez Gene IDs. (4) The affy2entrez mapping file you provide covers 13229 unique Entrez Gene IDs, 3609 of them overlap with the 3710 Entrez Gene IDs we got in (3). So, the mapping file actually did a pretty good job. Overall, I think it is a "problem" of UCSC genome database. How do you think about it? If you have any suggestions for a better data source (i.e. a different file from UCSC FTP site), I can modify the code accordingly if others agree on. nianhua

ADD REPLY • link 17.7 years ago Nianhua Li ▴ 870

0

Entering edit mode

Dear Nianhua, Our communication gets a bit delayed by the time difference :-). I took a look at some of the other files at UCSC ftp site and it seems as if the files all_mrna.txt.gz, mrnaOrientInfo.txt.gz, all_est.txt.gz and estOrientInfo.txt.gz provide better chromosome location information. They all contain chromosome information for genbank ids and I matched them towards the affy2genbank file I had (25654 probeset ids mapped to genbank ids), the resulting matches can be seen below. File Total gb ids Matching gb ids mrnaOrientInfo 30946 10993 all_mrna 27098 10392 est_OrientInfo 616962 11899 all_est 616978 11899 I guess it might be possible to increase the mappings even more. Still the coverage isn't very good so I think I will go for trying to blast the probeset sequences to the genome to get chromosome location. I might wait until the galGal3 version is available at Ensembl though before doing that. /Lina -----Ursprungligt meddelande----- Fr?n: Nianhua Li [mailto:nli at fhcrc.org] Skickat: den 30 augusti 2006 01:51 Till: Lina Hultin-Rosenberg Kopia: bioconductor at stat.math.ethz.ch ?mne: Re: SV: [BioC] annotation package for chicken affyprobes Dear Lina, Sorry for the late reply. As I mentioned in the previous email, the UCSC genome annotation of chicken is obtained from http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/database/ , which "contains a dump of the UCSC genome annotation database for the May 2006 assembly of the chicken genome (galGal3, Chicken Genome Sequencing Consortium May 2006 release)". We use two files to get chromosome location information: refGene.txt.gz and refLink.txt.gz. I downloaded the current version of these two files (both dated Aug 27, 2006), import them to sqlite and got some "statistics "of the data: (1) refGene only have 3847 records, which means only 3847 sequences have chromosome location information. (2) We draw annotations from refGene by using the second column in the file: accession # of RefSeq records representing mRNA sequence. There are only 3730 unique accession # in refGene. (3) We use refLink to obtain Entrez Gene to RefSeq mapping. refLink covers 154286 unique Entrez Gene IDs, and 174215 unique RefSeq accession numbers (for mRNA). But after merging this information with refGene, we only get chromosome location information for 3710 unique Entrez Gene IDs. So, 3710 is maxim number of annotations one can get from UCSC given a list of chicken Entrez Gene IDs. (4) The affy2entrez mapping file you provide covers 13229 unique Entrez Gene IDs, 3609 of them overlap with the 3710 Entrez Gene IDs we got in (3). So, the mapping file actually did a pretty good job. Overall, I think it is a "problem" of UCSC genome database. How do you think about it? If you have any suggestions for a better data source (i.e. a different file from UCSC FTP site), I can modify the code accordingly if others agree on. nianhua

ADD REPLY • link 17.7 years ago Lina Hultin-Rosenberg ▴ 180

Login before adding your answer.