athPkgBuilder data source
1
0
Entering edit mode
Nianhua Li ▴ 870
@nianhua-li-1606
Last seen 8.3 years ago
Dear list, I had some doubts on the data sources used by athPkgBuilder that I post on bioc-devel list two months ago, but got no reply. I would like to try one more time here. Sorry for the double posting. ---------------------------------------------------------------- I did a close look at the athPkgBuilder function in AnnBuilder (builder of ath1121501 and ag) and have some questions about the data source being used: 1. probeset id to gene mapping: The current mapping strategy was 1) map probe id to "Representative.Public.ID" by using Affymetrix GeneChip annotation data 2) use "Representative.Public.ID" as if it was AGI locus id to get other annotations (pathway, go, etc) from TAIR It seems that the "Representative.Publid.ID is a mix of AGI locus id, UniGene Cluster and a small part of other sources. In the affymetrix annotation file, there is another column called "Transcript ID (Array Design)", which has almost the same value as "Prepresentative.Public.ID". I feel it was originated from ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether affymetrix update those two columns on a regular basis or not. But if all the annotations (chromosome, go, pathway) come from TAIR, maybe we should use TAIR's mapping of probeset id to AGI locus id: ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ : "The oligonucleotide sequences of the probes were mapped to the Arabidopsis Transcripts dataset from the Arabidopsis genome TAIR6 version (released November 11, 2005). The dataset included mitochondria and chloroplast genes, as well as pseudogenes and non- coding RNAs. The mapping to the TAIR6 Transcripts was performed using the BLASTN program with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy chips, the required match length to achieve this e-value is 23 or more identical nucleotides. To assign a probe set to a given locus, at least 9 of the probes included in the probe set were required to match a transcript at that locus." Not all probeset ids have matching AGI locus ids. Do we need to provide mapping to other gene identifiers such as GenBank Accession number or Entrez Gene IDs to make annoations more complete? Affymetrix starts to provide probeset id to Entrez Gene ID mappings in their annotation files. Should we include that information? Also, I can see three possible ways to get probe-to- GenBank mapping: 1) from affymetrix annotation file directly, 2)probe to AGI locus and then AGI locus to GenBank accession, all from TAIR, 3)probe to Entrez Gene from affy, and then Entrez Gene to GenBank from NCBI. Which way is the best? or should we use the "voting" algorithm used by ABPkgBuilder? 2. chromosome location The current package get chromosome locations from ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.L ocus Even though the file seems being updated very often, the directory it locates in and the README file were not. So, it is not clear for me how it was generated/updated. Any hint on that? Will ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a better source? The meaning of chromosome location in those two sources may be different though. The former means the location of a GenBank EST, and the later means "chromosome coordinates of the best probe set match to the Transcripts dataset". 3. gene description (ath1121501GENENAME) The current package (1.12.1) get the description from ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The descriptions are the same as ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both of them means the description of the AGI locus corresponding to a affy probeset. In the Affymetrix annotation file, there is a column called "Target Description". It is the description of the gene that a probeset is targeting to. All probesets have descriptions, therefore we get a better coverage than getting description from TAIR. When the "Representative Public ID" (or "Transcript ID") is a AGI locus id, it seems the description was retrieved from TAIR. However, it is not clear how this information is updated, and whether it is synchronized with TAIR's update or not. Another possible source of description is Entrez Gene, since Affymetrix maps probeset to Entrez Gene. 4. pathway Pathway information is currently obtained from AraCyc, a pathway tool in TAIR: http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel it only contains metabolic pathways (it can be wrong as I only read the introduction). KEGG contains regulatory pathways as well, and it is also manually curated. Those two sources are independant from each other. Shall we include both of them? 5. pubmed Probeset to pubmed mapping is currently obtained from ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-060 309.txt . The pubmed ids represents the publications that TAIR used to map a AGI locus id to a concept in Plant Ontology. But I think environment like ath1121501PUBMED should represent the publications about the matching gene of a probeset. I didn't find AGI locus to pubmed mapping in TAIR. So, we have to get it from either Entrez Gene id or GenBank accession. This gets back to the frist question: what is the best way to map probeset to GenBank/Entrez Gene? Hope this email is not too long. Any feedback will be highly appreciated. If we decide to use a better data source, I will be happy to do the implementation. many thanks Nianhua Li computational biology, public health, FHCRC
Annotation Pathways GO probe oligo Annotation Pathways GO probe oligo • 825 views
ADD COMMENT
0
Entering edit mode
Björn Usadel ▴ 250
@bjorn-usadel-1492
Last seen 8.3 years ago
Dear Nianhua, as Tine and you pointed out, there are some probesets that don't match a gene. We used to map the probesets by ourselves based on the oligos as well and came to very similar conclusions as TAIR. Also the very old mappings taht matched every single probeset were based on the target sequences so the sequences where the oligos were designed against but not the actual oligos. Thus including the "missing" ones you would rather get spurious/wrong assignments in most cases. Most (all) of the missing ones just don't hit a gene model from the latest TAIR release and should therefore really not be annotated with any gene. I personally would prefer to not map the ones that -given the current knowledge- just sample different or no genes at all. The only thing, you might want to change is the thresholds of TAIR (which I again think are quite reasonable), but I think that they are quite reasonable, and at least if you/me/everyone relies on their mapping, we at least talk about the same thing. Cheers, Bj?rn Nianhua Li wrote: > Dear list, > > I had some doubts on the data sources used by athPkgBuilder that I post on > bioc-devel list two months ago, but got no reply. I would like to try one more > time here. Sorry for the double posting. > > ---------------------------------------------------------------- > > I did a close look at the athPkgBuilder function in AnnBuilder (builder of > ath1121501 and ag) and have some questions about the data source being used: > > 1. probeset id to gene mapping: > The current mapping strategy was > 1) map probe id to "Representative.Public.ID" by using Affymetrix GeneChip > annotation data > 2) use "Representative.Public.ID" as if it was AGI locus id to get other > annotations (pathway, go, etc) from TAIR > > It seems that the "Representative.Publid.ID is a mix of AGI locus id, UniGene > Cluster and a small part of other sources. In the affymetrix annotation file, > there is another column called "Transcript ID (Array Design)", which has almost > the same value as "Prepresentative.Public.ID". I feel it was originated from > ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether affymetrix > update those two columns on a regular basis or not. > > But if all the annotations (chromosome, go, pathway) come from TAIR, maybe we > should use TAIR's mapping of probeset id to AGI locus id: > ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ : > "The oligonucleotide sequences of the probes were mapped to the Arabidopsis > Transcripts dataset from the Arabidopsis genome TAIR6 version (released November > 11, 2005). > The dataset included mitochondria and chloroplast genes, as well as pseudogenes > and non- > coding RNAs. The mapping to the TAIR6 Transcripts was performed using the BLASTN > program > with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy > chips, the > required match length to achieve this e-value is 23 or more identical > nucleotides. To > assign a probe set to a given locus, at least 9 of the probes included in the > probe set > were required to match a transcript at that locus." > > Not all probeset ids have matching AGI locus ids. Do we need to provide mapping > to other gene identifiers such as GenBank Accession number or Entrez Gene IDs to > make annoations more complete? Affymetrix starts to provide probeset id to > Entrez Gene ID mappings in their annotation files. Should we include that > information? Also, I can see three possible ways to get probe-to- GenBank > mapping: 1) from affymetrix annotation file directly, 2)probe to AGI locus and > then AGI locus to GenBank accession, all from TAIR, 3)probe to Entrez Gene from > affy, and then Entrez Gene to GenBank from NCBI. Which way is the best? or > should we use the "voting" algorithm used by ABPkgBuilder? > > 2. chromosome location > The current package get chromosome locations from > ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment .Locus > Even though the file seems being updated very often, the directory it locates in > and the README file were not. So, it is not clear for me how it was > generated/updated. Any hint on that? Will > ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a better source? > The meaning of chromosome location in those two sources may be different though. > The former means the location of a GenBank EST, and the later means "chromosome > coordinates of the best probe set match to the Transcripts > dataset". > > 3. gene description (ath1121501GENENAME) > The current package (1.12.1) get the description from > ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The descriptions > are the same as ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both > of them means the description of the AGI locus corresponding to a affy probeset. > In the Affymetrix annotation file, there is a column called "Target > Description". It is the description of the gene that a probeset is targeting to. > All probesets have descriptions, therefore we get a better coverage than getting > description from TAIR. When the "Representative Public ID" (or "Transcript ID") > is a AGI locus id, it seems the description was retrieved from TAIR. However, it > is not clear how this information is updated, and whether it is synchronized > with TAIR's update or not. Another possible source of description is Entrez > Gene, since Affymetrix maps probeset to Entrez Gene. > > 4. pathway > Pathway information is currently obtained from AraCyc, a pathway tool in TAIR: > http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel it only > contains metabolic pathways (it can be wrong as I only read the introduction). > KEGG contains regulatory pathways as well, and it is also manually curated. > Those two sources are independant from each other. Shall we include both of them? > > 5. pubmed > Probeset to pubmed mapping is currently obtained from > ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-0 60309.txt . > The pubmed ids represents the publications that TAIR used to map a AGI locus id > to a concept in Plant Ontology. But I think environment like ath1121501PUBMED > should represent the publications about the matching gene of a probeset. I > didn't find AGI locus to pubmed mapping in TAIR. So, we have to get it from > either Entrez Gene id or GenBank accession. This gets back to the frist > question: what is the best way to map probeset to GenBank/Entrez Gene? > > Hope this email is not too long. Any feedback will be highly appreciated. If we > decide to use a better data source, I will be happy to do the implementation. > > many thanks > > Nianhua Li > computational biology, public health, FHCRC > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- -+-+-+-+-+-+-+-+-+-+-+- Bj?rn Usadel, PhD Max Planck Institute of Molecular Plant Physiology System Regulation Group Am M?hlenberg 1 D-14476 Golm Germany Tel (+49 331) 567-8114 Email usadel at mpimp-golm.mpg.de WWW mapman.mpimp-golm.mpg.de
ADD COMMENT

Login before adding your answer.

Traffic: 263 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6