SQLForge and probes that map to multiple genes
2
0
Entering edit mode
@cei-abreu-goodger-4433
Last seen 9.8 years ago
Mexico
Hi all, I'm trying to generate an annotation package for a custom mouse Affy chip (GNF1M). I'm a bit confused about how the package deals with probes that are mapped to multiple genes. Sure, when I have a single column of identifiers everything works nicely, but what exactly happens when I have more than one gene per probe? I tried a mock annotation, code below: # Running code to build the annotation package > library(AnnotationDbi) > library(mouse.db0) > > refseqs <- "gnf1m.test.tab" > read.table(refseqs) V1 V2 V3 1 gnf1m00050_at NM_008929 NM_172283 2 gnf1m00051_a_at NM_007487 NM_172283 3 gnf1m00052_a_at NM_178939 NM_172283 4 gnf1m00053_a_at NM_181666 NM_172283 5 gnf1m00054_a_at NM_026430 NM_172283 6 gnf1m00055_a_at NM_029916 NM_172283 7 gnf1m00056_a_at NM_181666 NM_172283 > > makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, baseMapType="refseq", + outputDir=".", version="0.9", manufacturer="GNF-Affymetrix", chipName="gnf1m") After installing, though, it seems to me that I have something strange. Although I added the refseq "NM_172283" to all of the probes, in the annotation it only went to two of them, the last one and another that was identical (see below). This might not be the best example, but if I do have probes that map to different genes, what's the best way of making SQLForge aware of this? Thanks! Cei # loading and accessing the annotation package > library(test.db) > as.list(testREFSEQ) $gnf1m00050_at [1] "NM_008929" "NP_032955" $gnf1m00051_a_at [1] "NM_001039515" "NM_007487" "NP_001034604" "NP_031513" $gnf1m00052_a_at [1] "NM_178939" "NP_849270" $gnf1m00053_a_at [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" $gnf1m00054_a_at [1] "NM_026430" "NP_080706" $gnf1m00055_a_at [1] "NM_029916" "NP_084192" $gnf1m00056_a_at [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" > sessionInfo() R version 2.7.0 (2008-04-22) i386-apple-darwin8.10.1 locale: C attached base packages: [1] stats graphics grDevices datasets tools utils methods [8] base other attached packages: [1] test.db_0.9 mouse.db0_2.1.4 AnnotationDbi_1.2.0 [4] RSQLite_0.6-8 DBI_0.2-4 Biobase_2.0.0 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Annotation Annotation • 1.6k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Mon, Jul 14, 2008 at 11:20 AM, Cei Abreu-Goodger <cei at="" sanger.ac.uk=""> wrote: > Hi all, > > I'm trying to generate an annotation package for a custom mouse Affy chip > (GNF1M). I'm a bit confused about how the package deals with probes that are > mapped to multiple genes. Sure, when I have a single column of identifiers > everything works nicely, but what exactly happens when I have more than one > gene per probe? > > I tried a mock annotation, code below: > > # Running code to build the annotation package >> library(AnnotationDbi) >> library(mouse.db0) >> >> refseqs <- "gnf1m.test.tab" >> read.table(refseqs) > V1 V2 V3 > 1 gnf1m00050_at NM_008929 NM_172283 > 2 gnf1m00051_a_at NM_007487 NM_172283 > 3 gnf1m00052_a_at NM_178939 NM_172283 > 4 gnf1m00053_a_at NM_181666 NM_172283 > 5 gnf1m00054_a_at NM_026430 NM_172283 > 6 gnf1m00055_a_at NM_029916 NM_172283 > 7 gnf1m00056_a_at NM_181666 NM_172283 >> >> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >> baseMapType="refseq", > + outputDir=".", version="0.9", > manufacturer="GNF-Affymetrix", chipName="gnf1m") > > > After installing, though, it seems to me that I have something strange. > Although I added the refseq "NM_172283" to all of the probes, in the > annotation it only went to two of them, the last one and another that was > identical (see below). This might not be the best example, but if I do have > probes that map to different genes, what's the best way of making SQLForge > aware of this? > > Thanks! > > Cei > > > # loading and accessing the annotation package >> library(test.db) >> as.list(testREFSEQ) > $gnf1m00050_at > [1] "NM_008929" "NP_032955" > > $gnf1m00051_a_at > [1] "NM_001039515" "NM_007487" "NP_001034604" "NP_031513" > $gnf1m00052_a_at > [1] "NM_178939" "NP_849270" > > $gnf1m00053_a_at > [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" > > $gnf1m00054_a_at > [1] "NM_026430" "NP_080706" > > $gnf1m00055_a_at > [1] "NM_029916" "NP_084192" > > $gnf1m00056_a_at > [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" If you look up NM_18166 and NM_172283, they are transcripts for the same gene, so one would expect that they will always be included together in the *REFSEQ lookup. The reason this is important is that, despite the fact that it appears that the third column in your data is being used, it is not. You probably want to look at the otherSrc parameter for specifying additional IDs to map. Sean >> sessionInfo() > R version 2.7.0 (2008-04-22) > i386-apple-darwin8.10.1 > > locale: > C > > attached base packages: > [1] stats graphics grDevices datasets tools utils methods [8] > base > other attached packages: > [1] test.db_0.9 mouse.db0_2.1.4 AnnotationDbi_1.2.0 > [4] RSQLite_0.6-8 DBI_0.2-4 Biobase_2.0.0 > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, > a charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Sean, Ok, so my example was even worse than I thought. And I had forgot to mention that the otherSrc parameter wasn't what I needed. So, to return to my bad example, I now have two separate files, the first column in the first file, the second in the second file: > refseqs <- "gnf1m.test.tab" > refseqs2 <- "gnf1m.test2.tab" > > read.table(refseqs) V1 V2 1 gnf1m00050_at NM_008929 2 gnf1m00051_a_at NM_007487 3 gnf1m00052_a_at NM_178939 4 gnf1m00053_a_at NM_181666 5 gnf1m00054_a_at NM_026430 6 gnf1m00055_a_at NM_029916 7 gnf1m00056_a_at NM_181666 > read.table(refseqs2) V1 V2 1 gnf1m00050_at NM_172283 2 gnf1m00051_a_at NM_172283 3 gnf1m00052_a_at NM_172283 4 gnf1m00053_a_at NM_172283 5 gnf1m00054_a_at NM_172283 6 gnf1m00055_a_at NM_172283 7 gnf1m00056_a_at NM_172283 I now add the second file as an otherSrc: > makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, baseMapType="refseq", otherSrc=c(refseqs2), outputDir=".", version="0.9", manufacturer="GNF-Affymetrix", chipName="gnf1m") But this till doesn't add the second gene's annotation to all the probes (the resulting package's annotation is exactly the same as in the first case). Is there any other way? Thanks again, Cei Sean Davis wrote: > On Mon, Jul 14, 2008 at 11:20 AM, Cei Abreu-Goodger <cei at="" sanger.ac.uk=""> wrote: > >> Hi all, >> >> I'm trying to generate an annotation package for a custom mouse Affy chip >> (GNF1M). I'm a bit confused about how the package deals with probes that are >> mapped to multiple genes. Sure, when I have a single column of identifiers >> everything works nicely, but what exactly happens when I have more than one >> gene per probe? >> >> I tried a mock annotation, code below: >> >> # Running code to build the annotation package >> >>> library(AnnotationDbi) >>> library(mouse.db0) >>> >>> refseqs <- "gnf1m.test.tab" >>> read.table(refseqs) >>> >> V1 V2 V3 >> 1 gnf1m00050_at NM_008929 NM_172283 >> 2 gnf1m00051_a_at NM_007487 NM_172283 >> 3 gnf1m00052_a_at NM_178939 NM_172283 >> 4 gnf1m00053_a_at NM_181666 NM_172283 >> 5 gnf1m00054_a_at NM_026430 NM_172283 >> 6 gnf1m00055_a_at NM_029916 NM_172283 >> 7 gnf1m00056_a_at NM_181666 NM_172283 >> >>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >>> baseMapType="refseq", >>> >> + outputDir=".", version="0.9", >> manufacturer="GNF-Affymetrix", chipName="gnf1m") >> >> >> After installing, though, it seems to me that I have something strange. >> Although I added the refseq "NM_172283" to all of the probes, in the >> annotation it only went to two of them, the last one and another that was >> identical (see below). This might not be the best example, but if I do have >> probes that map to different genes, what's the best way of making SQLForge >> aware of this? >> >> Thanks! >> >> Cei >> >> >> # loading and accessing the annotation package >> >>> library(test.db) >>> as.list(testREFSEQ) >>> >> $gnf1m00050_at >> [1] "NM_008929" "NP_032955" >> >> $gnf1m00051_a_at >> [1] "NM_001039515" "NM_007487" "NP_001034604" "NP_031513" >> $gnf1m00052_a_at >> [1] "NM_178939" "NP_849270" >> >> $gnf1m00053_a_at >> [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" >> >> $gnf1m00054_a_at >> [1] "NM_026430" "NP_080706" >> >> $gnf1m00055_a_at >> [1] "NM_029916" "NP_084192" >> >> $gnf1m00056_a_at >> [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" >> > > If you look up NM_18166 and NM_172283, they are transcripts for the > same gene, so one would expect that they will always be included > together in the *REFSEQ lookup. The reason this is important is that, > despite the fact that it appears that the third column in your data is > being used, it is not. > > You probably want to look at the otherSrc parameter for specifying > additional IDs to map. > > Sean > > > >>> sessionInfo() >>> >> R version 2.7.0 (2008-04-22) >> i386-apple-darwin8.10.1 >> >> locale: >> C >> >> attached base packages: >> [1] stats graphics grDevices datasets tools utils methods [8] >> base >> other attached packages: >> [1] test.db_0.9 mouse.db0_2.1.4 AnnotationDbi_1.2.0 >> [4] RSQLite_0.6-8 DBI_0.2-4 Biobase_2.0.0 >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, >> a charity registered in England with number 1021457 and a company registered >> in England with number 2742969, whose registered offi...{{dropped:29}}
ADD REPLY
0
Entering edit mode
On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger <cei at="" sanger.ac.uk=""> wrote: > Hi Sean, > > Ok, so my example was even worse than I thought. And I had forgot to mention > that the otherSrc parameter wasn't what I needed. So, to return to my bad > example, I now have two separate files, the first column in the first file, > the second in the second file: > >> refseqs <- "gnf1m.test.tab" >> refseqs2 <- "gnf1m.test2.tab" >> >> read.table(refseqs) > V1 V2 > 1 gnf1m00050_at NM_008929 > 2 gnf1m00051_a_at NM_007487 > 3 gnf1m00052_a_at NM_178939 > 4 gnf1m00053_a_at NM_181666 > 5 gnf1m00054_a_at NM_026430 > 6 gnf1m00055_a_at NM_029916 > 7 gnf1m00056_a_at NM_181666 >> read.table(refseqs2) > V1 V2 > 1 gnf1m00050_at NM_172283 > 2 gnf1m00051_a_at NM_172283 > 3 gnf1m00052_a_at NM_172283 > 4 gnf1m00053_a_at NM_172283 > 5 gnf1m00054_a_at NM_172283 > 6 gnf1m00055_a_at NM_172283 > 7 gnf1m00056_a_at NM_172283 > > I now add the second file as an otherSrc: > >> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >> baseMapType="refseq", otherSrc=c(refseqs2), > outputDir=".", version="0.9", manufacturer="GNF- Affymetrix", > chipName="gnf1m") > > > But this till doesn't add the second gene's annotation to all the probes > (the resulting package's annotation is exactly the same as in the first > case). Is there any other way? I think that the way SQLForge works now, it will only use the additional annotation if the first ID is not successfully mapped. (Someone else should probably confirm my assertion about this). Since it appears that your first column contains all RefSeq IDs, you will never get to the second column. So, in short, I don't know how to make SQLForge do what you want. Sean > > Sean Davis wrote: >> >> On Mon, Jul 14, 2008 at 11:20 AM, Cei Abreu-Goodger <cei at="" sanger.ac.uk=""> >> wrote: >> >>> >>> Hi all, >>> >>> I'm trying to generate an annotation package for a custom mouse Affy chip >>> (GNF1M). I'm a bit confused about how the package deals with probes that >>> are >>> mapped to multiple genes. Sure, when I have a single column of >>> identifiers >>> everything works nicely, but what exactly happens when I have more than >>> one >>> gene per probe? >>> >>> I tried a mock annotation, code below: >>> >>> # Running code to build the annotation package >>> >>>> >>>> library(AnnotationDbi) >>>> library(mouse.db0) >>>> >>>> refseqs <- "gnf1m.test.tab" >>>> read.table(refseqs) >>>> >>> >>> V1 V2 V3 >>> 1 gnf1m00050_at NM_008929 NM_172283 >>> 2 gnf1m00051_a_at NM_007487 NM_172283 >>> 3 gnf1m00052_a_at NM_178939 NM_172283 >>> 4 gnf1m00053_a_at NM_181666 NM_172283 >>> 5 gnf1m00054_a_at NM_026430 NM_172283 >>> 6 gnf1m00055_a_at NM_029916 NM_172283 >>> 7 gnf1m00056_a_at NM_181666 NM_172283 >>> >>>> >>>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >>>> baseMapType="refseq", >>>> >>> >>> + outputDir=".", version="0.9", >>> manufacturer="GNF-Affymetrix", chipName="gnf1m") >>> >>> >>> After installing, though, it seems to me that I have something strange. >>> Although I added the refseq "NM_172283" to all of the probes, in the >>> annotation it only went to two of them, the last one and another that was >>> identical (see below). This might not be the best example, but if I do >>> have >>> probes that map to different genes, what's the best way of making >>> SQLForge >>> aware of this? >>> >>> Thanks! >>> >>> Cei >>> >>> >>> # loading and accessing the annotation package >>> >>>> >>>> library(test.db) >>>> as.list(testREFSEQ) >>>> >>> >>> $gnf1m00050_at >>> [1] "NM_008929" "NP_032955" >>> >>> $gnf1m00051_a_at >>> [1] "NM_001039515" "NM_007487" "NP_001034604" "NP_031513" >>> $gnf1m00052_a_at >>> [1] "NM_178939" "NP_849270" >>> >>> $gnf1m00053_a_at >>> [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" >>> >>> $gnf1m00054_a_at >>> [1] "NM_026430" "NP_080706" >>> >>> $gnf1m00055_a_at >>> [1] "NM_029916" "NP_084192" >>> >>> $gnf1m00056_a_at >>> [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052" >>> >> >> If you look up NM_18166 and NM_172283, they are transcripts for the >> same gene, so one would expect that they will always be included >> together in the *REFSEQ lookup. The reason this is important is that, >> despite the fact that it appears that the third column in your data is >> being used, it is not. >> >> You probably want to look at the otherSrc parameter for specifying >> additional IDs to map. >> >> Sean >> >> >> >>>> >>>> sessionInfo() >>>> >>> >>> R version 2.7.0 (2008-04-22) >>> i386-apple-darwin8.10.1 >>> >>> locale: >>> C >>> >>> attached base packages: >>> [1] stats graphics grDevices datasets tools utils methods >>> [8] >>> base >>> other attached packages: >>> [1] test.db_0.9 mouse.db0_2.1.4 AnnotationDbi_1.2.0 >>> [4] RSQLite_0.6-8 DBI_0.2-4 Biobase_2.0.0 >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, >>> a charity registered in England with number 1021457 and a company >>> registered >>> in England with number 2742969, whose registered office is 215 Euston >>> Road, >>> London, NW1 2BE. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> > > > -- > Cei Abreu-Goodger, PhD > > Wellcome Trust Sanger Institute > Computational and Functional Genomics > Wellcome Trust Genome Campus > Hinxton, Cambridge, CB10 1SA, UK > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, > a charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. >
ADD REPLY
0
Entering edit mode
Sean Davis wrote: > On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger <cei at="" sanger.ac.uk=""> wrote: > >> Hi Sean, >> >> Ok, so my example was even worse than I thought. And I had forgot to mention >> that the otherSrc parameter wasn't what I needed. So, to return to my bad >> example, I now have two separate files, the first column in the first file, >> the second in the second file: >> >> >>> refseqs <- "gnf1m.test.tab" >>> refseqs2 <- "gnf1m.test2.tab" >>> >>> read.table(refseqs) >>> >> V1 V2 >> 1 gnf1m00050_at NM_008929 >> 2 gnf1m00051_a_at NM_007487 >> 3 gnf1m00052_a_at NM_178939 >> 4 gnf1m00053_a_at NM_181666 >> 5 gnf1m00054_a_at NM_026430 >> 6 gnf1m00055_a_at NM_029916 >> 7 gnf1m00056_a_at NM_181666 >> >>> read.table(refseqs2) >>> >> V1 V2 >> 1 gnf1m00050_at NM_172283 >> 2 gnf1m00051_a_at NM_172283 >> 3 gnf1m00052_a_at NM_172283 >> 4 gnf1m00053_a_at NM_172283 >> 5 gnf1m00054_a_at NM_172283 >> 6 gnf1m00055_a_at NM_172283 >> 7 gnf1m00056_a_at NM_172283 >> >> I now add the second file as an otherSrc: >> >> >>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >>> baseMapType="refseq", otherSrc=c(refseqs2), >>> >> outputDir=".", version="0.9", manufacturer="GNF- Affymetrix", >> chipName="gnf1m") >> >> >> But this till doesn't add the second gene's annotation to all the probes >> (the resulting package's annotation is exactly the same as in the first >> case). Is there any other way? >> > > I think that the way SQLForge works now, it will only use the > additional annotation if the first ID is not successfully mapped. > (Someone else should probably confirm my assertion about this). Since > it appears that your first column contains all RefSeq IDs, you will > never get to the second column. So, in short, I don't know how to > make SQLForge do what you want. > > Sean > > Hi Guys, Sean is correct about the purpose of the the otherSrc parameter, and about the way that SQLforge currently works. The thing that has me scratching my head is why you would want to map multiple genes onto a single probe in your annotation package? Marc
ADD REPLY
0
Entering edit mode
Hi Marc, Sean and list. If I can follow up on Marc's comment: "The thing that has me scratching my head is why you would want to map multiple genes onto a single probe in your annotation package?" The genomics annotation problem (what does this ProbeSet detect, and which ProbeSets detect my gene of interest) is inherently many to many, that is, one ProbeSet can map to many 'genes' (or at least many different accessions that point to the same gene), and that 1 'gene' can map to multiple ProbeSets (perhaps different isoforms). Does SQLforge handle these inevitable situations nicely? Having read the SQLForge pdf documentation, and this post, it seems that you can only provide at most 2 accessions for each ProbeSet, perhaps a RefSeq accession, and if that is not known, a GenBank accession. If this has been discussed elsewhere, can someone please point me in the right direction? Cheers, Mark ----------------------------------------------------- Mark Cowley, BSc (Bioinformatics)(Hons) Peter Wills Bioinformatics Centre Garvan Institute of Medical Research, Sydney, Australia ----------------------------------------------------- On 15/07/2008, at 6:57 AM, Marc Carlson wrote: > Sean Davis wrote: >> On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger >> <cei at="" sanger.ac.uk=""> wrote: >> >>> Hi Sean, >>> >>> Ok, so my example was even worse than I thought. And I had forgot >>> to mention >>> that the otherSrc parameter wasn't what I needed. So, to return to >>> my bad >>> example, I now have two separate files, the first column in the >>> first file, >>> the second in the second file: >>> >>> >>>> refseqs <- "gnf1m.test.tab" >>>> refseqs2 <- "gnf1m.test2.tab" >>>> >>>> read.table(refseqs) >>>> >>> V1 V2 >>> 1 gnf1m00050_at NM_008929 >>> 2 gnf1m00051_a_at NM_007487 >>> 3 gnf1m00052_a_at NM_178939 >>> 4 gnf1m00053_a_at NM_181666 >>> 5 gnf1m00054_a_at NM_026430 >>> 6 gnf1m00055_a_at NM_029916 >>> 7 gnf1m00056_a_at NM_181666 >>> >>>> read.table(refseqs2) >>>> >>> V1 V2 >>> 1 gnf1m00050_at NM_172283 >>> 2 gnf1m00051_a_at NM_172283 >>> 3 gnf1m00052_a_at NM_172283 >>> 4 gnf1m00053_a_at NM_172283 >>> 5 gnf1m00054_a_at NM_172283 >>> 6 gnf1m00055_a_at NM_172283 >>> 7 gnf1m00056_a_at NM_172283 >>> >>> I now add the second file as an otherSrc: >>> >>> >>>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >>>> baseMapType="refseq", otherSrc=c(refseqs2), >>>> >>> outputDir=".", version="0.9", manufacturer="GNF- >>> Affymetrix", >>> chipName="gnf1m") >>> >>> >>> But this till doesn't add the second gene's annotation to all the >>> probes >>> (the resulting package's annotation is exactly the same as in the >>> first >>> case). Is there any other way? >>> >> >> I think that the way SQLForge works now, it will only use the >> additional annotation if the first ID is not successfully mapped. >> (Someone else should probably confirm my assertion about this). >> Since >> it appears that your first column contains all RefSeq IDs, you will >> never get to the second column. So, in short, I don't know how to >> make SQLForge do what you want. >> >> Sean >> >> > > Hi Guys, > > Sean is correct about the purpose of the the otherSrc parameter, and > about the way that SQLforge currently works. The thing that has me > scratching my head is why you would want to map multiple genes onto > a single probe in your annotation package? > > Marc > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Mon, Jul 14, 2008 at 8:42 PM, Mark Cowley <m.cowley0 at="" gmail.com=""> wrote: > Hi Marc, Sean and list. > > If I can follow up on Marc's comment: > "The thing that has me scratching my head is why you would want to map > multiple genes onto a single probe in your annotation package?" > > The genomics annotation problem (what does this ProbeSet detect, and which > ProbeSets detect my gene of interest) is inherently many to many, that is, > one ProbeSet can map to many 'genes' (or at least many different accessions > that point to the same gene), and that 1 'gene' can map to multiple > ProbeSets (perhaps different isoforms). This is true, but the extent to which it needs to be "modeled" is up to the user. Our approach is to do everything based on probe (differential expression, etc.) and, for those probes that look VERY interesting but have unclear annotation, blast them against all known transcript databases for hints as to what they represent. The vast majority of probes/probesets do not need this special treatment on a daily basis, I do not think. > Does SQLforge handle these inevitable situations nicely? It doesn't sound like SQLForge will handle the situation that you describe. I would suggest a custom SQL database for your mappings. Of course, that will not be useful as an annotation package, but including the many-to-many issues is not generally possible for algorithms using the annotation packages, anyway. Hope that helps, at least practically speaking. Sean > On 15/07/2008, at 6:57 AM, Marc Carlson wrote: > >> Sean Davis wrote: >>> >>> On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger <cei at="" sanger.ac.uk=""> >>> wrote: >>> >>>> Hi Sean, >>>> >>>> Ok, so my example was even worse than I thought. And I had forgot to >>>> mention >>>> that the otherSrc parameter wasn't what I needed. So, to return to my >>>> bad >>>> example, I now have two separate files, the first column in the first >>>> file, >>>> the second in the second file: >>>> >>>> >>>>> refseqs <- "gnf1m.test.tab" >>>>> refseqs2 <- "gnf1m.test2.tab" >>>>> >>>>> read.table(refseqs) >>>>> >>>> V1 V2 >>>> 1 gnf1m00050_at NM_008929 >>>> 2 gnf1m00051_a_at NM_007487 >>>> 3 gnf1m00052_a_at NM_178939 >>>> 4 gnf1m00053_a_at NM_181666 >>>> 5 gnf1m00054_a_at NM_026430 >>>> 6 gnf1m00055_a_at NM_029916 >>>> 7 gnf1m00056_a_at NM_181666 >>>> >>>>> read.table(refseqs2) >>>>> >>>> V1 V2 >>>> 1 gnf1m00050_at NM_172283 >>>> 2 gnf1m00051_a_at NM_172283 >>>> 3 gnf1m00052_a_at NM_172283 >>>> 4 gnf1m00053_a_at NM_172283 >>>> 5 gnf1m00054_a_at NM_172283 >>>> 6 gnf1m00055_a_at NM_172283 >>>> 7 gnf1m00056_a_at NM_172283 >>>> >>>> I now add the second file as an otherSrc: >>>> >>>> >>>>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, >>>>> baseMapType="refseq", otherSrc=c(refseqs2), >>>>> >>>> outputDir=".", version="0.9", >>>> manufacturer="GNF-Affymetrix", >>>> chipName="gnf1m") >>>> >>>> >>>> But this till doesn't add the second gene's annotation to all the probes >>>> (the resulting package's annotation is exactly the same as in the first >>>> case). Is there any other way? >>>> >>> >>> I think that the way SQLForge works now, it will only use the >>> additional annotation if the first ID is not successfully mapped. >>> (Someone else should probably confirm my assertion about this). Since >>> it appears that your first column contains all RefSeq IDs, you will >>> never get to the second column. So, in short, I don't know how to >>> make SQLForge do what you want. >>> >>> Sean >>> >>> >> >> Hi Guys, >> >> Sean is correct about the purpose of the the otherSrc parameter, and about >> the way that SQLforge currently works. The thing that has me scratching my >> head is why you would want to map multiple genes onto a single probe in your >> annotation package? >> >> Marc >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
Mark Cowley wrote: > Hi Marc, Sean and list. > > If I can follow up on Marc's comment: > "The thing that has me scratching my head is why you would want to map > multiple genes onto a single probe in your annotation package?" > > The genomics annotation problem (what does this ProbeSet detect, and > which ProbeSets detect my gene of interest) is inherently many to > many, that is, one ProbeSet can map to many 'genes' (or at least many > different accessions that point to the same gene), and that 1 'gene' > can map to multiple ProbeSets (perhaps different isoforms). > > Does SQLforge handle these inevitable situations nicely? > Having read the SQLForge pdf documentation, and this post, it seems > that you can only provide at most 2 accessions for each ProbeSet, > perhaps a RefSeq accession, and if that is not known, a GenBank > accession. > > If this has been discussed elsewhere, can someone please point me in > the right direction? > > Cheers, > > Mark > ----------------------------------------------------- > Mark Cowley, BSc (Bioinformatics)(Hons) > > Peter Wills Bioinformatics Centre > Garvan Institute of Medical Research, Sydney, Australia > ----------------------------------------------------- > On 15/07/2008, at 6:57 AM, Marc Carlson wrote: > Hi Mark, In its current form, SQLForge takes as many IDs as you want to give it, but it currently assumes that you only intended to assign one kind of gene to a given probe at a time. That is, it assumes that when you made the probe that you really only meant to measure one thing. It is well understood by all of us who make annotation packages that in practice this may not always work out as you intended. But what was confusing me was why you would want to deal with ambiguous probes by creating an ambiguous database? It seems to me that it might really be better to just not make a gene assignment if you really don't know what your probe is measuring. If a probe is known to be sticking to more than one thing, then the interpretation of any measurement from that probe really becomes very speculative since you will have no way of knowing what proportion of the signal belongs to what. I agree with Sean that in the rare case like this you will really want to look at a recent blast alignment for your mystery probe. But since a case like that really is (ultimately) a mystery probe, I feel quite hesitant to assign multiple identities to it inside of an annotation package... Just for the sake of clarification, it is not the case that SQLForge will only take two kinds of IDs at a time for mapping. One of the parameters (otherSrc) takes a vector of filenames so you can pass several different mappings into that parameter at once if desired. Many major ID types are supported as a way to tell SQLForge what gene to assign, but once it has an assignment it will then go and get all the data for the database from public sources. So all your mapping files are just a hook to let SQLForge find the rest of the information. In most cases, your initial mapping will probably be complete enough to render the extra data that is passed into the otherSrc parameter as redundant. I hope this clarifies things, Marc
ADD REPLY
0
Entering edit mode
Thanks Marc, that does clarify things. I completely agree with you about the ambiguous mapping problem, and that ignoring probesets that may stick to multiple different genes is probably the way to go. However it is exceedingly difficult to determing when this is the case. For instance, If you trawl through the latest transcript.csv files for say the HuGene 1.0 ST array, each transcript cluster ID is annotated to map to many different things. Most of the time, these are just the different annotation db's names for the 'same gene', eg RefSeq, ENSEMBL, ... In the rare cases though, these heterogenous identifiers from heterogeneous databases will be referring to different genes. The problem then is identifying these cases. I'm just making an AffyGenePDInfoPkg for the HuGene and MoGene arrays, so i'll see how I go there. cheers, Mark On 17/07/2008, at 3:11 AM, Marc Carlson wrote: > Mark Cowley wrote: >> Hi Marc, Sean and list. >> >> If I can follow up on Marc's comment: >> "The thing that has me scratching my head is why you would want to >> map multiple genes onto a single probe in your annotation package?" >> >> The genomics annotation problem (what does this ProbeSet detect, >> and which ProbeSets detect my gene of interest) is inherently many >> to many, that is, one ProbeSet can map to many 'genes' (or at least >> many different accessions that point to the same gene), and that 1 >> 'gene' can map to multiple ProbeSets (perhaps different isoforms). >> >> Does SQLforge handle these inevitable situations nicely? >> Having read the SQLForge pdf documentation, and this post, it seems >> that you can only provide at most 2 accessions for each ProbeSet, >> perhaps a RefSeq accession, and if that is not known, a GenBank >> accession. >> >> If this has been discussed elsewhere, can someone please point me >> in the right direction? >> >> Cheers, >> >> Mark >> ----------------------------------------------------- >> Mark Cowley, BSc (Bioinformatics)(Hons) >> >> Peter Wills Bioinformatics Centre >> Garvan Institute of Medical Research, Sydney, Australia >> ----------------------------------------------------- >> On 15/07/2008, at 6:57 AM, Marc Carlson wrote: >> > Hi Mark, > > In its current form, SQLForge takes as many IDs as you want to give > it, but it currently assumes that you only intended to assign one > kind of gene to a given probe at a time. That is, it assumes that > when you made the probe that you really only meant to measure one > thing. It is well understood by all of us who make annotation > packages that in practice this may not always work out as you > intended. But what was confusing me was why you would want to deal > with ambiguous probes by creating an ambiguous database? It seems > to me that it might really be better to just not make a gene > assignment if you really don't know what your probe is measuring. > If a probe is known to be sticking to more than one thing, then the > interpretation of any measurement from that probe really becomes > very speculative since you will have no way of knowing what > proportion of the signal belongs to what. I agree with Sean that in > the rare case like this you will really want to look at a recent > blast alignment for your mystery probe. But since a case like that > really is (ultimately) a mystery probe, I feel quite hesitant to > assign multiple identities to it inside of an annotation package... > > Just for the sake of clarification, it is not the case that SQLForge > will only take two kinds of IDs at a time for mapping. One of the > parameters (otherSrc) takes a vector of filenames so you can pass > several different mappings into that parameter at once if desired. > Many major ID types are supported as a way to tell SQLForge what > gene to assign, but once it has an assignment it will then go and > get all the data for the database from public sources. So all your > mapping files are just a hook to let SQLForge find the rest of the > information. In most cases, your initial mapping will probably be > complete enough to render the extra data that is passed into the > otherSrc parameter as redundant. > > I hope this clarifies things, > > Marc
ADD REPLY
0
Entering edit mode
@cei-abreu-goodger-4433
Last seen 9.8 years ago
Mexico
The thing is, the GNF chip was annotated a few years ago, and will probably not be updated. They did annotate multiple sources though, such as Entrez gene, Unigene, RefSeq, etc. And many of these have multiple ids for each probe. For affy type chips (and others), many probes _will_ actually map to multiple genes (not only multiple transcripts), so I wanted to be sure of what SQLforge was doing exactly. Although for most cases it should definitely be good enough as it is, I don't see why it should be limited to only pass annotation from the first gene that is mapped. In post-processing, you can always decide what to do with probes that map to multiple genes, but the way it stands, you might simply not realize when this occurs... Cheers, Cei Marc Carlson <mcarlson@fhcrc.org> wrote: > Sean Davis wrote: > > On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger <cei@sanger.ac.uk> wrote: > > > >> Hi Sean, > >> > >> Ok, so my example was even worse than I thought. And I had forgot to mention > >> that the otherSrc parameter wasn't what I needed. So, to return to my bad > >> example, I now have two separate files, the first column in the first file, > >> the second in the second file: > >> > >> > >>> refseqs <- "gnf1m.test.tab" > >>> refseqs2 <- "gnf1m.test2.tab" > >>> > >>> read.table(refseqs) > >>> > >> V1 V2 > >> 1 gnf1m00050_at NM_008929 > >> 2 gnf1m00051_a_at NM_007487 > >> 3 gnf1m00052_a_at NM_178939 > >> 4 gnf1m00053_a_at NM_181666 > >> 5 gnf1m00054_a_at NM_026430 > >> 6 gnf1m00055_a_at NM_029916 > >> 7 gnf1m00056_a_at NM_181666 > >> > >>> read.table(refseqs2) > >>> > >> V1 V2 > >> 1 gnf1m00050_at NM_172283 > >> 2 gnf1m00051_a_at NM_172283 > >> 3 gnf1m00052_a_at NM_172283 > >> 4 gnf1m00053_a_at NM_172283 > >> 5 gnf1m00054_a_at NM_172283 > >> 6 gnf1m00055_a_at NM_172283 > >> 7 gnf1m00056_a_at NM_172283 > >> > >> I now add the second file as an otherSrc: > >> > >> > >>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs, > >>> baseMapType="refseq", otherSrc=c(refseqs2), > >>> > >> outputDir=".", version="0.9", manufacturer="GNF- Affymetrix", > >> chipName="gnf1m") > >> > >> > >> But this till doesn't add the second gene's annotation to all the probes > >> (the resulting package's annotation is exactly the same as in the first > >> case). Is there any other way? > >> > > > > I think that the way SQLForge works now, it will only use the > > additional annotation if the first ID is not successfully mapped. > > (Someone else should probably confirm my assertion about this). Since > > it appears that your first column contains all RefSeq IDs, you will > > never get to the second column. So, in short, I don't know how to > > make SQLForge do what you want. > > > > Sean > > > > > > Hi Guys, > > Sean is correct about the purpose of the the otherSrc parameter, and > about the way that SQLforge currently works. The thing that has me > scratching my head is why you would want to map multiple genes onto a > single probe in your annotation package? > > Marc
ADD COMMENT

Login before adding your answer.

Traffic: 398 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6