Question: makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array
0
11.2 years ago by
Heidi Dvinge2.0k
Heidi Dvinge2.0k wrote:
Dear all, I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used the makecdfenv package to build a cdf environment based on the file MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage. That worked without any problems, but out of curiosity I tried taking a closer look at the format of the array, to see how many probes were in each probe set etc. I'm aware that some probes map to multiple probe sets and are removed when the cdfenv is produced, which seems to be the case for about 8% of the probes. My question is exactly how this happens? I would expect the multiple-mapping probes to be removed from all probe sets, but this doesn't seem to be the case. Example with the two overlapping probe sets 10344719 and 10353008, where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned into only tab-delimited info and read into R, and "INDEX" being a unique probe identifier (the same as index-1 in the cdf env): > cdf[cdf$QUAL=="10344719","INDEX"] [1] 7543 661828 575792 962890 963940 140756 337977 510591 860722 968182 387524 386474 [13] 385518 384468 1076441 1075391 850724 51881 957657 100610 862535 506651 505601 82272 [25] 83322 692860 691810 494417 932343 689216 836826 894914 715393 421443 92496 485600 [37] 253868 352083 594288 1049892 370822 369772 416675 928371 505790 506840 135781 > cdf[cdf$QUAL=="10353008","INDEX"] [1] 506840 505790 928371 416675 369772 370822 1049892 485600 92496 421443 715393 894914 [13] 1073586 110809 836826 689216 932343 494417 691810 83322 82272 505601 506651 862535 [25] 100610 957657 51881 850724 1075391 1076441 384468 385518 386474 387524 968182 860722 [37] 510591 337977 140756 963940 962890 575792 661828 7543 > indexProbes(raw, genenames="10344719") $10344719 [1] 692861 253869 352084 594289 135782 > indexProbes(raw, genenames="10353008")$10353008 [1] 506841 505791 928372 416676 369773 370823 1049893 485601 92497 421444 715394 894915 [13] 1073587 110810 836827 689217 932344 494418 691811 83323 82273 505602 506652 862536 [25] 100611 957658 51882 850725 1075392 1076442 384469 385519 386475 387525 968183 860723 [37] 510592 337978 140757 963941 962891 575793 661829 7544 So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of which are overlapping. In the cdf environment 10344719 appears to have the 42 overlapping probes removed, but they're still present in 10353008. A similar situation is seen for e.g. the overlapping probe sets 10461391 and 10487930 with 41 probes each, 40 of which are identical: > cdf[cdf$QUAL=="10461391","INDEX"] [1] 483268 1022846 409057 703153 328783 372162 882399 569942 765746 868615 948367 413614 [13] 830931 434763 970910 600221 599171 135798 6746 455659 799186 912319 469313 145393 [25] 872191 126758 801051 774196 773146 965810 272742 19445 585800 999188 1012776 823868 [37] 156514 210874 645037 799505 1075142 > cdf[cdf$QUAL=="10487930","INDEX"] [1] 1075142 799505 645037 210874 156514 823868 1012776 999188 585800 19445 272742 965810 [13] 773146 774196 801051 126758 872191 145393 469313 912319 799186 839098 6746 135798 [25] 599171 600221 970910 434763 830931 413614 948367 868615 765746 569942 882399 372162 [37] 328783 703153 409057 1022846 483268 > indexProbes(raw, genenames="10461391") $10461391 [1] 455660 > indexProbes(raw, genenames="10487930")$10487930 [1] 1075143 799506 645038 210875 156515 823869 1012777 999189 585801 19446 272743 965811 [13] 773147 774197 801052 126759 872192 145394 469314 912320 799187 839099 6747 135799 [25] 599172 600222 970911 434764 830932 413615 948368 868616 765747 569943 882400 372163 [37] 328784 703154 409058 1022847 483269 Any comments on this or on exactly how the cdf environment is created would be much appreciated. Thanks \Heidi > sessionInfo() R version 2.7.0 Under development (unstable) (2008-02-12 r44439) i386-apple-darwin8.10.1 locale: en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] tools stats graphics grDevices utils datasets methods base other attached packages: [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5 affyio_1.7.17 [5] Biobase_1.99.4 ------------<<>>------------ Heidi Dvinge EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD Mail: heidi@ebi.ac.uk Phone: +44 (0) 1223 494 444 ------------<<>>------------ [[alternative HTML version deleted]]
cdf probe makecdfenv • 500 views
modified 11.2 years ago by Sean Davis21k • written 11.2 years ago by Heidi Dvinge2.0k
Answer: makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array
0
11.2 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:
On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> wrote: > Dear all, > > I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used > the makecdfenv package to build a cdf environment based on the file > MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage. > > That worked without any problems, but out of curiosity I tried taking > a closer look at the format of the array, to see how many probes were > in each probe set etc. > > I'm aware that some probes map to multiple probe sets and are removed > when the cdfenv is produced, which seems to be the case for about 8% > of the probes. My question is exactly how this happens? I would > expect the multiple-mapping probes to be removed from all probe sets, > but this doesn't seem to be the case. I believe that the probes are kept in the first or last probeset (not sure which) seen. Someone with a little more affy experience can comment more fully. Sean > Example with the two overlapping probe sets 10344719 and 10353008, > where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned > into only tab-delimited info and read into R, and "INDEX" being a > unique probe identifier (the same as index-1 in the cdf env): > > > cdf[cdf$QUAL=="10344719","INDEX"] > [1] 7543 661828 575792 962890 963940 140756 337977 > 510591 860722 968182 387524 386474 > [13] 385518 384468 1076441 1075391 850724 51881 957657 100610 > 862535 506651 505601 82272 > [25] 83322 692860 691810 494417 932343 689216 836826 894914 > 715393 421443 92496 485600 > [37] 253868 352083 594288 1049892 370822 369772 416675 928371 > 505790 506840 135781 > > cdf[cdf$QUAL=="10353008","INDEX"] > [1] 506840 505790 928371 416675 369772 370822 1049892 > 485600 92496 421443 715393 894914 > [13] 1073586 110809 836826 689216 932343 494417 691810 > 83322 82272 505601 506651 862535 > [25] 100610 957657 51881 850724 1075391 1076441 384468 385518 > 386474 387524 968182 860722 > [37] 510591 337977 140756 963940 962890 575792 661828 7543 > > indexProbes(raw, genenames="10344719") > $10344719 > [1] 692861 253869 352084 594289 135782 > > indexProbes(raw, genenames="10353008") >$10353008 > [1] 506841 505791 928372 416676 369773 370823 1049893 > 485601 92497 421444 715394 894915 > [13] 1073587 110810 836827 689217 932344 494418 691811 > 83323 82273 505602 506652 862536 > [25] 100611 957658 51882 850725 1075392 1076442 384469 385519 > 386475 387525 968183 860723 > [37] 510592 337978 140757 963941 962891 575793 661829 7544 > > So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of > which are overlapping. In the cdf environment 10344719 appears to > have the 42 overlapping probes removed, but they're still present in > 10353008. > > A similar situation is seen for e.g. the overlapping probe sets > 10461391 and 10487930 with 41 probes each, 40 of which are identical: > > > cdf[cdf$QUAL=="10461391","INDEX"] > [1] 483268 1022846 409057 703153 328783 372162 882399 > 569942 765746 868615 948367 413614 > [13] 830931 434763 970910 600221 599171 135798 6746 455659 > 799186 912319 469313 145393 > [25] 872191 126758 801051 774196 773146 965810 272742 19445 > 585800 999188 1012776 823868 > [37] 156514 210874 645037 799505 1075142 > > cdf[cdf$QUAL=="10487930","INDEX"] > [1] 1075142 799505 645037 210874 156514 823868 1012776 > 999188 585800 19445 272742 965810 > [13] 773146 774196 801051 126758 872191 145393 469313 912319 > 799186 839098 6746 135798 > [25] 599171 600221 970910 434763 830931 413614 948367 868615 > 765746 569942 882399 372162 > [37] 328783 703153 409057 1022846 483268 > > indexProbes(raw, genenames="10461391") > $10461391 > [1] 455660 > > indexProbes(raw, genenames="10487930") >$10487930 > [1] 1075143 799506 645038 210875 156515 823869 1012777 > 999189 585801 19446 272743 965811 > [13] 773147 774197 801052 126759 872192 145394 469314 912320 > 799187 839099 6747 135799 > [25] 599172 600222 970911 434764 830932 413615 948368 868616 > 765747 569943 882400 372163 > [37] 328784 703154 409058 1022847 483269 > > Any comments on this or on exactly how the cdf environment is created > would be much appreciated. > > Thanks > \Heidi > > > sessionInfo() > R version 2.7.0 Under development (unstable) (2008-02-12 r44439) > i386-apple-darwin8.10.1 > > locale: > en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] tools stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5 > affyio_1.7.17 > [5] Biobase_1.99.4 > > > ------------<<>>------------ > Heidi Dvinge > > EMBL-European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge > CB10 1SD > Mail: heidi at ebi.ac.uk > Phone: +44 (0) 1223 494 444 > ------------<<>>------------ > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
On 11 May 2008, at 14:28, Sean Davis wrote: > On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> wrote: >> Dear all, >> >> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used >> the makecdfenv package to build a cdf environment based on the file >> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage. >> >> That worked without any problems, but out of curiosity I tried taking >> a closer look at the format of the array, to see how many probes were >> in each probe set etc. >> >> I'm aware that some probes map to multiple probe sets and are removed >> when the cdfenv is produced, which seems to be the case for about 8% >> of the probes. My question is exactly how this happens? I would >> expect the multiple-mapping probes to be removed from all probe sets, >> but this doesn't seem to be the case. > > I believe that the probes are kept in the first or last probeset (not > sure which) seen. Someone with a little more affy experience can > comment more fully. > I figured it was probably something along those lines, but what's the reason for not just removing them completely, instead of keeping them in a 'random' probe set? Most probes that map multiple times map to > 2 probe sets. And in some cases it's large chunks of probe sets that 'overlap', whereas in other cases it's just a few or a single probe that 'jumps around'. \Heidi > Sean > >> Example with the two overlapping probe sets 10344719 and 10353008, >> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned >> into only tab-delimited info and read into R, and "INDEX" being a >> unique probe identifier (the same as index-1 in the cdf env): >> >>> cdf[cdf$QUAL=="10344719","INDEX"] >> [1] 7543 661828 575792 962890 963940 140756 337977 >> 510591 860722 968182 387524 386474 >> [13] 385518 384468 1076441 1075391 850724 51881 957657 100610 >> 862535 506651 505601 82272 >> [25] 83322 692860 691810 494417 932343 689216 836826 894914 >> 715393 421443 92496 485600 >> [37] 253868 352083 594288 1049892 370822 369772 416675 928371 >> 505790 506840 135781 >>> cdf[cdf$QUAL=="10353008","INDEX"] >> [1] 506840 505790 928371 416675 369772 370822 1049892 >> 485600 92496 421443 715393 894914 >> [13] 1073586 110809 836826 689216 932343 494417 691810 >> 83322 82272 505601 506651 862535 >> [25] 100610 957657 51881 850724 1075391 1076441 384468 385518 >> 386474 387524 968182 860722 >> [37] 510591 337977 140756 963940 962890 575792 661828 7543 >>> indexProbes(raw, genenames="10344719") >> $10344719 >> [1] 692861 253869 352084 594289 135782 >>> indexProbes(raw, genenames="10353008") >>$10353008 >> [1] 506841 505791 928372 416676 369773 370823 1049893 >> 485601 92497 421444 715394 894915 >> [13] 1073587 110810 836827 689217 932344 494418 691811 >> 83323 82273 505602 506652 862536 >> [25] 100611 957658 51882 850725 1075392 1076442 384469 385519 >> 386475 387525 968183 860723 >> [37] 510592 337978 140757 963941 962891 575793 661829 7544 >> >> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of >> which are overlapping. In the cdf environment 10344719 appears to >> have the 42 overlapping probes removed, but they're still present in >> 10353008. >> >> A similar situation is seen for e.g. the overlapping probe sets >> 10461391 and 10487930 with 41 probes each, 40 of which are identical: >> >>> cdf[cdf$QUAL=="10461391","INDEX"] >> [1] 483268 1022846 409057 703153 328783 372162 882399 >> 569942 765746 868615 948367 413614 >> [13] 830931 434763 970910 600221 599171 135798 6746 455659 >> 799186 912319 469313 145393 >> [25] 872191 126758 801051 774196 773146 965810 272742 19445 >> 585800 999188 1012776 823868 >> [37] 156514 210874 645037 799505 1075142 >>> cdf[cdf$QUAL=="10487930","INDEX"] >> [1] 1075142 799505 645037 210874 156514 823868 1012776 >> 999188 585800 19445 272742 965810 >> [13] 773146 774196 801051 126758 872191 145393 469313 912319 >> 799186 839098 6746 135798 >> [25] 599171 600221 970910 434763 830931 413614 948367 868615 >> 765746 569942 882399 372162 >> [37] 328783 703153 409057 1022846 483268 >>> indexProbes(raw, genenames="10461391") >> $10461391 >> [1] 455660 >>> indexProbes(raw, genenames="10487930") >>$10487930 >> [1] 1075143 799506 645038 210875 156515 823869 1012777 >> 999189 585801 19446 272743 965811 >> [13] 773147 774197 801052 126759 872192 145394 469314 912320 >> 799187 839099 6747 135799 >> [25] 599172 600222 970911 434764 830932 413615 948368 868616 >> 765747 569943 882400 372163 >> [37] 328784 703154 409058 1022847 483269 >> >> Any comments on this or on exactly how the cdf environment is created >> would be much appreciated. >> >> Thanks >> \Heidi >> >>> sessionInfo() >> R version 2.7.0 Under development (unstable) (2008-02-12 r44439) >> i386-apple-darwin8.10.1 >> >> locale: >> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] tools stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5 >> affyio_1.7.17 >> [5] Biobase_1.99.4 >> >> >> ------------<<>>------------ >> Heidi Dvinge >> >> EMBL-European Bioinformatics Institute >> Wellcome Trust Genome Campus >> Hinxton, Cambridge >> CB10 1SD >> Mail: heidi at ebi.ac.uk >> Phone: +44 (0) 1223 494 444 >> ------------<<>>------------ >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor >>
On Sun, May 11, 2008 at 10:38 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> wrote: > > On 11 May 2008, at 14:28, Sean Davis wrote: > >> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> wrote: >>> >>> Dear all, >>> >>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used >>> the makecdfenv package to build a cdf environment based on the file >>> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage. >>> >>> That worked without any problems, but out of curiosity I tried taking >>> a closer look at the format of the array, to see how many probes were >>> in each probe set etc. >>> >>> I'm aware that some probes map to multiple probe sets and are removed >>> when the cdfenv is produced, which seems to be the case for about 8% >>> of the probes. My question is exactly how this happens? I would >>> expect the multiple-mapping probes to be removed from all probe sets, >>> but this doesn't seem to be the case. >> >> I believe that the probes are kept in the first or last probeset (not >> sure which) seen. Someone with a little more affy experience can >> comment more fully. >> > I figured it was probably something along those lines, but what's the reason > for not just removing them completely, instead of keeping them in a 'random' > probe set? Most probes that map multiple times map to > 2 probe sets. And in > some cases it's large chunks of probe sets that 'overlap', whereas in other > cases it's just a few or a single probe that 'jumps around'. I think this probe "removal" is a side effect of the way the original affy package and affy chips were designed. Before these newer arrays, there were no probes that mapped to multiple probe sets, so there was never a mechanism for "removing" probes or even maintain multiple mappings. So, the current behavior is due to the fact that there is not a way to maintain the many-to-many mapping, if I understand it correctly and is not really in any particular way optimal. Again, someone with more affy experience might have more to say. Sean >>> Example with the two overlapping probe sets 10344719 and 10353008, >>> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned >>> into only tab-delimited info and read into R, and "INDEX" being a >>> unique probe identifier (the same as index-1 in the cdf env): >>> >>>> cdf[cdf$QUAL=="10344719","INDEX"] >>> >>> [1] 7543 661828 575792 962890 963940 140756 337977 >>> 510591 860722 968182 387524 386474 >>> [13] 385518 384468 1076441 1075391 850724 51881 957657 100610 >>> 862535 506651 505601 82272 >>> [25] 83322 692860 691810 494417 932343 689216 836826 894914 >>> 715393 421443 92496 485600 >>> [37] 253868 352083 594288 1049892 370822 369772 416675 928371 >>> 505790 506840 135781 >>>> >>>> cdf[cdf$QUAL=="10353008","INDEX"] >>> >>> [1] 506840 505790 928371 416675 369772 370822 1049892 >>> 485600 92496 421443 715393 894914 >>> [13] 1073586 110809 836826 689216 932343 494417 691810 >>> 83322 82272 505601 506651 862535 >>> [25] 100610 957657 51881 850724 1075391 1076441 384468 385518 >>> 386474 387524 968182 860722 >>> [37] 510591 337977 140756 963940 962890 575792 661828 7543 >>>> >>>> indexProbes(raw, genenames="10344719") >>> >>> $10344719 >>> [1] 692861 253869 352084 594289 135782 >>>> >>>> indexProbes(raw, genenames="10353008") >>> >>>$10353008 >>> [1] 506841 505791 928372 416676 369773 370823 1049893 >>> 485601 92497 421444 715394 894915 >>> [13] 1073587 110810 836827 689217 932344 494418 691811 >>> 83323 82273 505602 506652 862536 >>> [25] 100611 957658 51882 850725 1075392 1076442 384469 385519 >>> 386475 387525 968183 860723 >>> [37] 510592 337978 140757 963941 962891 575793 661829 7544 >>> >>> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of >>> which are overlapping. In the cdf environment 10344719 appears to >>> have the 42 overlapping probes removed, but they're still present in >>> 10353008. >>> >>> A similar situation is seen for e.g. the overlapping probe sets >>> 10461391 and 10487930 with 41 probes each, 40 of which are identical: >>> >>>> cdf[cdf$QUAL=="10461391","INDEX"] >>> >>> [1] 483268 1022846 409057 703153 328783 372162 882399 >>> 569942 765746 868615 948367 413614 >>> [13] 830931 434763 970910 600221 599171 135798 6746 455659 >>> 799186 912319 469313 145393 >>> [25] 872191 126758 801051 774196 773146 965810 272742 19445 >>> 585800 999188 1012776 823868 >>> [37] 156514 210874 645037 799505 1075142 >>>> >>>> cdf[cdf$QUAL=="10487930","INDEX"] >>> >>> [1] 1075142 799505 645037 210874 156514 823868 1012776 >>> 999188 585800 19445 272742 965810 >>> [13] 773146 774196 801051 126758 872191 145393 469313 912319 >>> 799186 839098 6746 135798 >>> [25] 599171 600221 970910 434763 830931 413614 948367 868615 >>> 765746 569942 882399 372162 >>> [37] 328783 703153 409057 1022846 483268 >>>> >>>> indexProbes(raw, genenames="10461391") >>> >>> $10461391 >>> [1] 455660 >>>> >>>> indexProbes(raw, genenames="10487930") >>> >>>$10487930 >>> [1] 1075143 799506 645038 210875 156515 823869 1012777 >>> 999189 585801 19446 272743 965811 >>> [13] 773147 774197 801052 126759 872192 145394 469314 912320 >>> 799187 839099 6747 135799 >>> [25] 599172 600222 970911 434764 830932 413615 948368 868616 >>> 765747 569943 882400 372163 >>> [37] 328784 703154 409058 1022847 483269 >>> >>> Any comments on this or on exactly how the cdf environment is created >>> would be much appreciated. >>> >>> Thanks >>> \Heidi >>> >>>> sessionInfo() >>> >>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439) >>> i386-apple-darwin8.10.1 >>> >>> locale: >>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>> >>> attached base packages: >>> [1] tools stats graphics grDevices utils datasets >>> methods base >>> >>> other attached packages: >>> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5 >>> affyio_1.7.17 >>> [5] Biobase_1.99.4 >>> >>> >>> ------------<<>>------------ >>> Heidi Dvinge >>> >>> EMBL-European Bioinformatics Institute >>> Wellcome Trust Genome Campus >>> Hinxton, Cambridge >>> CB10 1SD >>> Mail: heidi at ebi.ac.uk >>> Phone: +44 (0) 1223 494 444 >>> ------------<<>>------------ >>> >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
> On Sun, May 11, 2008 at 10:38 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> wrote: >> On 11 May 2008, at 14:28, Sean Davis wrote: >>> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> wrote: >>>> Dear all, >>>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used the makecdfenv package to build a cdf environment based on the file MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage. >>>> That worked without any problems, but out of curiosity I tried taking a closer look at the format of the array, to see how many probes were in each probe set etc. >>>> I'm aware that some probes map to multiple probe sets and are removed when the cdfenv is produced, which seems to be the case for about 8% of the probes. My question is exactly how this happens? I would expect the multiple-mapping probes to be removed from all probe sets, but this doesn't seem to be the case. >>> I believe that the probes are kept in the first or last probeset (not sure which) seen. Someone with a little more affy experience can comment more fully. >> I figured it was probably something along those lines, but what's the reason >> for not just removing them completely, instead of keeping them in a 'random' >> probe set? Most probes that map multiple times map to > 2 probe sets. And in >> some cases it's large chunks of probe sets that 'overlap', whereas in other >> cases it's just a few or a single probe that 'jumps around'. > > I think this probe "removal" is a side effect of the way the original affy package and affy chips were designed. Before these newer arrays, there were no probes that mapped to multiple probe sets, so there was never a mechanism for "removing" probes or even maintain multiple mappings. So, the current behavior is due to the fact that there is not a way to maintain the many-to-many mapping, if I understand it correctly and is not really in any particular way optimal. Again, someone with more affy experience might have more to say. The original use case was to be able to retrieve the probes in a given probe set, without further consideration. The need for possible alternative mappings was nevertheless considered, and it was made possible to replace the mapping used to process data at any given time (there is a vignette talking about that). Regarding many-to-many association between probes and probesets, this is indeed an annoying case (as in the original design, it was somehow assumed that this is a perfect world). It is not at all impossible to have "many-to-many" association, but it is certainly making it for a difficult analysis of the data. To keep things simple, the recommendation would be "each probe goes into one probe set"... and get rid of the rest. The package "altcdfenvs" is also proposing extensions to the CDF environments, with methods and functions to work with them. > Sean > > >>>> Example with the two overlapping probe sets 10344719 and 10353008, where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned into only tab-delimited info and read into R, and "INDEX" being a unique probe identifier (the same as index-1 in the cdf env): >>>>> cdf[cdf$QUAL=="10344719","INDEX"] >>>> [1] 7543 661828 575792 962890 963940 140756 337977 >>>> 510591 860722 968182 387524 386474 >>>> [13] 385518 384468 1076441 1075391 850724 51881 957657 100610 862535 506651 505601 82272 >>>> [25] 83322 692860 691810 494417 932343 689216 836826 894914 715393 421443 92496 485600 >>>> [37] 253868 352083 594288 1049892 370822 369772 416675 928371 505790 506840 135781 >>>>> cdf[cdf$QUAL=="10353008","INDEX"] >>>> [1] 506840 505790 928371 416675 369772 370822 1049892 >>>> 485600 92496 421443 715393 894914 >>>> [13] 1073586 110809 836826 689216 932343 494417 691810 >>>> 83322 82272 505601 506651 862535 >>>> [25] 100610 957657 51881 850724 1075391 1076441 384468 385518 386474 387524 968182 860722 >>>> [37] 510591 337977 140756 963940 962890 575792 661828 7543 >>>>> indexProbes(raw, genenames="10344719") >>>> $10344719 >>>> [1] 692861 253869 352084 594289 135782 >>>>> indexProbes(raw, genenames="10353008") >>>>$10353008 >>>> [1] 506841 505791 928372 416676 369773 370823 1049893 >>>> 485601 92497 421444 715394 894915 >>>> [13] 1073587 110810 836827 689217 932344 494418 691811 >>>> 83323 82273 505602 506652 862536 >>>> [25] 100611 957658 51882 850725 1075392 1076442 384469 385519 386475 387525 968183 860723 >>>> [37] 510592 337978 140757 963941 962891 575793 661829 7544 So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of which are overlapping. In the cdf environment 10344719 appears to have the 42 overlapping probes removed, but they're still present in 10353008. >>>> A similar situation is seen for e.g. the overlapping probe sets 10461391 and 10487930 with 41 probes each, 40 of which are identical: >>>>> cdf[cdf$QUAL=="10461391","INDEX"] >>>> [1] 483268 1022846 409057 703153 328783 372162 882399 >>>> 569942 765746 868615 948367 413614 >>>> [13] 830931 434763 970910 600221 599171 135798 6746 455659 799186 912319 469313 145393 >>>> [25] 872191 126758 801051 774196 773146 965810 272742 19445 585800 999188 1012776 823868 >>>> [37] 156514 210874 645037 799505 1075142 >>>>> cdf[cdf$QUAL=="10487930","INDEX"] >>>> [1] 1075142 799505 645037 210874 156514 823868 1012776 >>>> 999188 585800 19445 272742 965810 >>>> [13] 773146 774196 801051 126758 872191 145393 469313 912319 799186 839098 6746 135798 >>>> [25] 599171 600221 970910 434763 830931 413614 948367 868615 765746 569942 882399 372162 >>>> [37] 328783 703153 409057 1022846 483268 >>>>> indexProbes(raw, genenames="10461391") >>>> $10461391 >>>> [1] 455660 >>>>> indexProbes(raw, genenames="10487930") >>>>$10487930 >>>> [1] 1075143 799506 645038 210875 156515 823869 1012777 >>>> 999189 585801 19446 272743 965811 >>>> [13] 773147 774197 801052 126759 872192 145394 469314 912320 799187 839099 6747 135799 >>>> [25] 599172 600222 970911 434764 830932 413615 948368 868616 765747 569943 882400 372163 >>>> [37] 328784 703154 409058 1022847 483269 >>>> Any comments on this or on exactly how the cdf environment is created would be much appreciated. >>>> Thanks >>>> \Heidi >>>>> sessionInfo() >>>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439) i386-apple-darwin8.10.1 >>>> locale: >>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: >>>> [1] tools stats graphics grDevices utils datasets methods base >>>> other attached packages: >>>> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5 affyio_1.7.17 >>>> [5] Biobase_1.99.4 >>>> ------------<<>>------------ >>>> Heidi Dvinge >>>> EMBL-European Bioinformatics Institute >>>> Wellcome Trust Genome Campus >>>> Hinxton, Cambridge >>>> CB10 1SD >>>> Mail: heidi at ebi.ac.uk >>>> Phone: +44 (0) 1223 494 444 >>>> ------------<<>>------------ >>>> [[alternative HTML version deleted]] >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
On May 12, 2008, at 10:03 AM, lgautier at altern.org wrote: >> On Sun, May 11, 2008 at 10:38 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> >> wrote: >>> On 11 May 2008, at 14:28, Sean Davis wrote: >>>> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at="" ebi.ac.uk=""> >>>> wrote: >>>>> Dear all, >>>>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have >>>>> used > the makecdfenv package to build a cdf environment based on the file > MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage. >>>>> That worked without any problems, but out of curiosity I tried >>>>> taking > a closer look at the format of the array, to see how many probes > were in > each probe set etc. >>>>> I'm aware that some probes map to multiple probe sets and are >>>>> removed > when the cdfenv is produced, which seems to be the case for about 8% > of > the probes. My question is exactly how this happens? I would > expect the multiple-mapping probes to be removed from all probe > sets, but > this doesn't seem to be the case. >>>> I believe that the probes are kept in the first or last probeset >>>> (not > sure which) seen. Someone with a little more affy experience can > comment more fully. >>> I figured it was probably something along those lines, but what's >>> the > reason >>> for not just removing them completely, instead of keeping them in a > 'random' >>> probe set? Most probes that map multiple times map to > 2 probe >>> sets. > And in >>> some cases it's large chunks of probe sets that 'overlap', whereas >>> in > other >>> cases it's just a few or a single probe that 'jumps around'. >> >> I think this probe "removal" is a side effect of the way the original > affy package and affy chips were designed. Before these newer arrays, > there were no probes that mapped to multiple probe sets, so there was > never a mechanism for "removing" probes or even maintain multiple > mappings. So, the current behavior is due to the fact that there is > not a > way to maintain the many-to-many mapping, if I understand it > correctly and > is not really in any particular way optimal. Again, someone with more > affy experience might have more to say. > > The original use case was to be able to retrieve the probes in a given > probe set, without further consideration. The need for possible > alternative mappings was nevertheless considered, and it was made > possible > to replace the mapping used to process data at any given time (there > is a > vignette talking about that). > > Regarding many-to-many association between probes and probesets, > this is > indeed an annoying case (as in the original design, it was somehow > assumed > that this is a perfect world). It is not at all impossible to have > "many-to-many" association, but it is certainly making it for a > difficult > analysis of the data. To keep things simple, the recommendation > would be > "each probe goes into one probe set"... and get rid of the rest. One problem with having a probe in multiple probesets is that certain functions assume it does not happen. For example the pm method simply takes all the pm indices for the various probesets and stacks them. If you have a probe in multiple probesets, this means it is included multiple times in the resulting output from pm. And in many cases, functions using pm assumes that this is not the case. Kasper > The package "altcdfenvs" is also proposing extensions to the CDF > environments, with methods and functions to work with them. > > >> Sean >> >> >>>>> Example with the two overlapping probe sets 10344719 and 10353008, > where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned > into only tab-delimited info and read into R, and "INDEX" being a > unique probe identifier (the same as index-1 in the cdf env): >>>>>> cdf[cdf$QUAL=="10344719","INDEX"] >>>>> [1] 7543 661828 575792 962890 963940 140756 337977 >>>>> 510591 860722 968182 387524 386474 >>>>> [13] 385518 384468 1076441 1075391 850724 51881 957657 >>>>> 100610 > 862535 506651 505601 82272 >>>>> [25] 83322 692860 691810 494417 932343 689216 836826 >>>>> 894914 > 715393 421443 92496 485600 >>>>> [37] 253868 352083 594288 1049892 370822 369772 416675 >>>>> 928371 > 505790 506840 135781 >>>>>> cdf[cdf$QUAL=="10353008","INDEX"] >>>>> [1] 506840 505790 928371 416675 369772 370822 1049892 >>>>> 485600 92496 421443 715393 894914 >>>>> [13] 1073586 110809 836826 689216 932343 494417 691810 >>>>> 83322 82272 505601 506651 862535 >>>>> [25] 100610 957657 51881 850724 1075391 1076441 384468 >>>>> 385518 > 386474 387524 968182 860722 >>>>> [37] 510591 337977 140756 963940 962890 575792 661828 >>>>> 7543 >>>>>> indexProbes(raw, genenames="10344719") >>>>> $10344719 >>>>> [1] 692861 253869 352084 594289 135782 >>>>>> indexProbes(raw, genenames="10353008") >>>>>$10353008 >>>>> [1] 506841 505791 928372 416676 369773 370823 1049893 >>>>> 485601 92497 421444 715394 894915 >>>>> [13] 1073587 110810 836827 689217 932344 494418 691811 >>>>> 83323 82273 505602 506652 862536 >>>>> [25] 100611 957658 51882 850725 1075392 1076442 384469 >>>>> 385519 > 386475 387525 968183 860723 >>>>> [37] 510592 337978 140757 963941 962891 575793 661829 >>>>> 7544 > So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of > which > are overlapping. In the cdf environment 10344719 appears to have > the 42 > overlapping probes removed, but they're still present in 10353008. >>>>> A similar situation is seen for e.g. the overlapping probe sets > 10461391 and 10487930 with 41 probes each, 40 of which are identical: >>>>>> cdf[cdf$QUAL=="10461391","INDEX"] >>>>> [1] 483268 1022846 409057 703153 328783 372162 882399 >>>>> 569942 765746 868615 948367 413614 >>>>> [13] 830931 434763 970910 600221 599171 135798 6746 >>>>> 455659 > 799186 912319 469313 145393 >>>>> [25] 872191 126758 801051 774196 773146 965810 272742 >>>>> 19445 > 585800 999188 1012776 823868 >>>>> [37] 156514 210874 645037 799505 1075142 >>>>>> cdf[cdf$QUAL=="10487930","INDEX"] >>>>> [1] 1075142 799505 645037 210874 156514 823868 1012776 >>>>> 999188 585800 19445 272742 965810 >>>>> [13] 773146 774196 801051 126758 872191 145393 469313 >>>>> 912319 > 799186 839098 6746 135798 >>>>> [25] 599171 600221 970910 434763 830931 413614 948367 >>>>> 868615 > 765746 569942 882399 372162 >>>>> [37] 328783 703153 409057 1022846 483268 >>>>>> indexProbes(raw, genenames="10461391") >>>>> $10461391 >>>>> [1] 455660 >>>>>> indexProbes(raw, genenames="10487930") >>>>>$10487930 >>>>> [1] 1075143 799506 645038 210875 156515 823869 1012777 >>>>> 999189 585801 19446 272743 965811 >>>>> [13] 773147 774197 801052 126759 872192 145394 469314 >>>>> 912320 > 799187 839099 6747 135799 >>>>> [25] 599172 600222 970911 434764 830932 413615 948368 >>>>> 868616 > 765747 569943 882400 372163 >>>>> [37] 328784 703154 409058 1022847 483269 >>>>> Any comments on this or on exactly how the cdf environment is >>>>> created > would be much appreciated. >>>>> Thanks >>>>> \Heidi >>>>>> sessionInfo() >>>>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439) > i386-apple-darwin8.10.1 >>>>> locale: >>>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > attached base packages: >>>>> [1] tools stats graphics grDevices utils datasets > methods base >>>>> other attached packages: >>>>> [1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5 > affyio_1.7.17 >>>>> [5] Biobase_1.99.4 >>>>> ------------<<>>------------ >>>>> Heidi Dvinge >>>>> EMBL-European Bioinformatics Institute >>>>> Wellcome Trust Genome Campus >>>>> Hinxton, Cambridge >>>>> CB10 1SD >>>>> Mail: heidi at ebi.ac.uk >>>>> Phone: +44 (0) 1223 494 444 >>>>> ------------<<>>------------ >>>>> [[alternative HTML version deleted]] >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor