Re-mapped Affy CDF files

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 days ago

United States

Has anybody taken a close look at the re-mapped cdf files that the MBNI at the University of Michigan have produced? http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/gen omic_curated_CDF.asp I find the idea of revisiting the mapping of probes to genes as well as combining the disparate probesets for a given gene into a single probeset appealing, but I think something more than 'Hey that looks like a good idea!' is needed before one could make the case for switching from Affy's cdf to these re-mapped ones. Unfortunately, given the fact that there isn't a one-to-one correspondence between the Affy and MBNI probesets, doing a critical comparison of the results is not straighforward. For instance, using the Affy spike-in or GeneLogic dilution datasets won't work AFAIK, since it seems that both Affy and GeneLogic spiked in probe-specific cRNA rather than spiking in full length transcripts for particular genes. I know of at least one instance where a client was comparing samples with a particular gene 'knocked in' to wild type, and we didn't see any difference when using the Affy CDF but the gene was at the top of the list of significant genes when using an MBNI CDF, so we have some small indication that the MBNI CDFs might be better. So, does anybody have any experience with these CDFs that might give an indication of their usefulness? Barring that, does anybody have a good way that one could reasonably compare the two probe mappings? Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

Microarray Cancer cdf probe affy Microarray Cancer cdf probe affy • 1.2k views

ADD COMMENT • link 18.4 years ago • updated 18.3 years ago James W. MacDonald 65k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 months ago

United States

On 1/10/06 4:00 PM, "James MacDonald" <jmacdon at="" med.umich.edu=""> wrote: > Has anybody taken a close look at the re-mapped cdf files that the MBNI at the > University of Michigan have produced? > > http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/g enomic_cur > ated_CDF.asp > > I find the idea of revisiting the mapping of probes to genes as well as > combining the disparate probesets for a given gene into a single probeset > appealing, but I think something more than 'Hey that looks like a good idea!' > is needed before one could make the case for switching from Affy's cdf to > these re-mapped ones. > > Unfortunately, given the fact that there isn't a one-to-one correspondence > between the Affy and MBNI probesets, doing a critical comparison of the > results is not straighforward. For instance, using the Affy spike-in or > GeneLogic dilution datasets won't work AFAIK, since it seems that both Affy > and GeneLogic spiked in probe-specific cRNA rather than spiking in full length > transcripts for particular genes. > > I know of at least one instance where a client was comparing samples with a > particular gene 'knocked in' to wild type, and we didn't see any difference > when using the Affy CDF but the gene was at the top of the list of significant > genes when using an MBNI CDF, so we have some small indication that the MBNI > CDFs might be better. > > So, does anybody have any experience with these CDFs that might give an > indication of their usefulness? Barring that, does anybody have a good way > that one could reasonably compare the two probe mappings? I'm not sure what their build process is, but doesn't Ensembl do some probe-based mappings? Sean

ADD COMMENT • link 18.4 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > I'm not sure what their build process is, but doesn't Ensembl do some > probe-based mappings? Maybe. I couldn't find anything obvious in a cursory glance at their website. Anyway, the main question for me is not the number or type of alternative mappings that exist for Affy arrays (there are 19 different CDFs that the MBNI folks produce, including several based on Ensembl mappings). I am more concerned with being able to establish a defensible rationale for using a particular mapping. I guess what we do right now with the Affy CDFs isn't defensible except on a historical basis, but the weight of history is pretty strong. For instance, attributing significance at an alpha of < 0.05 has no rationale AFAIK, but is pretty much written in stone due to precedent. OTOH, most if not all microarray data are caveat emptor - it is incumbent on the end user to take the magical list of differentially expressed genes and validate them with an alternative methodology. Given that state of affairs, is it not reasonable to choose the probe mappings that one uses with the same logic that one uses for choosing the preferred way of computing expression values? Jim > > Sean > > -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 18.4 years ago James W. MacDonald 65k

0

Entering edit mode

Hi all, I looked at the alternative mappings a few months ago after attending a seminar given by Stanley Watson, Director of Mental Health Research Institute at University of Michigan. He recommended that the alternative mappings always be used because of the large discrepancies they found between Affymetrix's mapping and their mappings of the probes. I don't know whether they have any documentation on whether their mappings yield results that are more often validated through alternative methodologies or not, but they do have quite a lot of documentation on what they did and why they did it - see the description of custom CDF files and their new paper from links on the page Jim put in his first post. Even if Ensembl or Affymetrix updates their annotation based on remapping, the CDFs aren't changed, so the summarization and statistical analysis are done using probes that may not all map to the same "gene" uniquely. What these alternative mapping do is to remap each probe, then redefine probe sets based on all the probes that map to a "gene", and that it's these re-groupings that are most important. Many of the alternative mappings are subsets of other ones, like taking only the first 11 probes from the 3' end in cases where there are more than 11 probes, so there are not quite as many alternative mappings as it first appears. I do agree with Jim that coming up with a defensible rationale is important, as I was having trouble deciding which mapping might be the best to use. Stan Watson would argue that any of them are better than the outdated Affymetrix groupings. If Affy did theirs based on Unigene clustering, then the new mapping & grouping based on Unigene might be a defensible choice. In the end, I succumbed to historical inertia and went with Affymetrix's CDF, in part because I do analyses for many organisms, and MBNI only has alternative CDFs for human, mouse, and rat. However, I was able to get the alternative CDFs to work in Bioconductor with little trouble. As far as validating the genes on the magical "significant list", I did get some advice at a recent conference to ALWAYS first check the current probe mappings for those significant genes, then only concentrate on those that have most or all of their probes where they should be. Does anyone do this routinely? Should we, but we don't because it is too time consuming? Cheers, Jenny At 08:51 AM 1/11/2006, James W. MacDonald wrote: >Sean Davis wrote: > > I'm not sure what their build process is, but doesn't Ensembl do some > > probe-based mappings? > >Maybe. I couldn't find anything obvious in a cursory glance at their >website. > >Anyway, the main question for me is not the number or type of >alternative mappings that exist for Affy arrays (there are 19 different >CDFs that the MBNI folks produce, including several based on Ensembl >mappings). I am more concerned with being able to establish a defensible >rationale for using a particular mapping. > >I guess what we do right now with the Affy CDFs isn't defensible except >on a historical basis, but the weight of history is pretty strong. For >instance, attributing significance at an alpha of < 0.05 has no >rationale AFAIK, but is pretty much written in stone due to precedent. > >OTOH, most if not all microarray data are caveat emptor - it is >incumbent on the end user to take the magical list of differentially >expressed genes and validate them with an alternative methodology. > >Given that state of affairs, is it not reasonable to choose the probe >mappings that one uses with the same logic that one uses for choosing >the preferred way of computing expression values? > >Jim > > > > > > > > Sean > > > > > > >-- >James W. MacDonald >Affymetrix and cDNA Microarray Core >University of Michigan Cancer Center >1500 E. Medical Center Drive >7410 CCGC >Ann Arbor MI 48109 >734-647-5623 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu

ADD REPLY • link 18.4 years ago Jenny Drnevich ★ 2.2k

0

Entering edit mode

Thinking that the mapping provided by Affymetrix are to some extent outdated is in the air for quite some time. The number of discrepencies between one's own mapping done with a set of recent and curated reference sequences (such as NCBI's RefSeq), as been reported and attempts at quantification of the differences made.. but this is no trivial task. Some of the differences I observed when looking at that (now a couple of years ago)[1] were merely anecdotal, such as probe sets built to match former hypothetical genes (back in the days were the chips were designed) than not longer made sense because the hypothetical gene was later believed to be artefactual, individual probes having matches all over the place, or such as the MM (mismatch) in the probe pair matching a reference sequence while the PM (perfect match) was not matching anything. What appeared happening at a large scale was that a significant number of probes sets in an alternative mapping are in fact merges between separate probe sets in the Affymetrix mapping... which lead to the uncanny world of alternative splicing, and its intricate complexities. I ended up with mostly discarding the probes found matching in several places (which was a trivial task to automate), and curate the events I spotted while trying to figure out where were the discrepancies between the two mappings. To my knowledge, the closest to experimental validation for the relevance of new mappings have been carried out by Carter et al.[2]. Their work suggests that newer mappings are indeed better. Tools for building one's own alternative mapping have been in Bioconductor since quite some time (I am thinking of the packages "altcdfenvs" and "matchprobes"), but building one's own mapping for the larger recent chips from Affymetrix is admitedly a computationally expensive task (several days of CPU time). Dai et al.[3] have set up the automated building of such environments, and save people interested in making use of alternative environment the computing effort needed to have some. However, this is not the end of it. By offering a complete toolkit for building alternative mapping, the rationale being the 'altcdfenvs' package was not only to make as easy as possible the building of alternative CDF environments for mass consumption, but also to allow customizations for specific contexts. One obvious example is the use of stock Affymetrix chips designed for a particular specie with a sample from a slightly different specie, or with a sample in which particular genomic features are known to differ from the canonical case. An example I was giving back then was for the use of the E.coli chip, when knowing that there are quite a few E.coli strains around labs and that E.coli's genome can be easily "engineered". An other example can be when a sample is known to be possible a mixture of different cells and possible cross-hybridization not wishable (and therefore some probes discarded when remapping). Handling different mappings introduces complexity (version number tracking, etc...), and one way is to do that through packages (a la Annotation packages). The trouble is that this requires careful operations when replacing one mapping with an other: accidents like using the a mapping for one chip type with an other chip type will completely wrong results. I had a stab at that by having classes for CDF environments (as defined in the pack 'altcdfenvs'), together with a rewrite of 'affy', and putting in the repository a little more than a year ago (should be in the subversion repository under the name 'affyplus'). I am not certain anyone picked that up since then... (and some will say I should give up the idea someone will ;) ). Just some thoughts, Laurent [1]: Gautier et al., BMC Bioinformatics. 2004; 5: 111. [2]: Carter et al., BMC Bioinformatics. 2005; 6: 107. [3]: Dai et al., Nucleic Acids Res. 2005; 33(20) > Hi all, > > I looked at the alternative mappings a few months ago after attending a seminar given by Stanley Watson, Director of Mental Health Research Institute at University of Michigan. He recommended that the alternative mappings always be used because of the large discrepancies they found between Affymetrix's mapping and their mappings of the probes. I don't know > whether they have any documentation on whether their mappings yield results > that are more often validated through alternative methodologies or not, but > they do have quite a lot of documentation on what they did and why they did > it - see the description of custom CDF files and their new paper from links > on the page Jim put in his first post. Even if Ensembl or Affymetrix updates their annotation based on remapping, the CDFs aren't changed, so the summarization and statistical analysis are done using probes that may > not all map to the same "gene" uniquely. What these alternative mapping do > is to remap each probe, then redefine probe sets based on all the probes that map to a "gene", and that it's these re-groupings that are most important. Many of the alternative mappings are subsets of other ones, like taking only the first 11 probes from the 3' end in cases where there > are more than 11 probes, so there are not quite as many alternative mappings as it first appears. > > I do agree with Jim that coming up with a defensible rationale is important, as I was having trouble deciding which mapping might be the best > to use. Stan Watson would argue that any of them are better than the outdated Affymetrix groupings. If Affy did theirs based on Unigene clustering, then the new mapping & grouping based on Unigene might be a defensible choice. In the end, I succumbed to historical inertia and went > with Affymetrix's CDF, in part because I do analyses for many organisms, and MBNI only has alternative CDFs for human, mouse, and rat. However, I was able to get the alternative CDFs to work in Bioconductor with little trouble. > > As far as validating the genes on the magical "significant list", I did get > some advice at a recent conference to ALWAYS first check the current probe > mappings for those significant genes, then only concentrate on those that > have most or all of their probes where they should be. Does anyone do this > routinely? Should we, but we don't because it is too time consuming? > > Cheers, > Jenny > > > At 08:51 AM 1/11/2006, James W. MacDonald wrote: >>Sean Davis wrote: >> > I'm not sure what their build process is, but doesn't Ensembl do some probe-based mappings? >>Maybe. I couldn't find anything obvious in a cursory glance at their website. >>Anyway, the main question for me is not the number or type of >>alternative mappings that exist for Affy arrays (there are 19 different CDFs that the MBNI folks produce, including several based on Ensembl mappings). I am more concerned with being able to establish a defensible rationale for using a particular mapping. >>I guess what we do right now with the Affy CDFs isn't defensible except on a historical basis, but the weight of history is pretty strong. For instance, attributing significance at an alpha of < 0.05 has no >>rationale AFAIK, but is pretty much written in stone due to precedent. OTOH, most if not all microarray data are caveat emptor - it is >>incumbent on the end user to take the magical list of differentially expressed genes and validate them with an alternative methodology. Given that state of affairs, is it not reasonable to choose the probe mappings that one uses with the same logic that one uses for choosing the preferred way of computing expression values? >>Jim >> > >> > Sean >> > >> > >>-- >>James W. MacDonald >>Affymetrix and cDNA Microarray Core >>University of Michigan Cancer Center >>1500 E. Medical Center Drive >>7410 CCGC >>Ann Arbor MI 48109 >>734-647-5623 >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist > W.M. Keck Center for Comparative and Functional Genomics > Roy J. Carver Biotechnology Center > University of Illinois, Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > ph: 217-244-7355 > fax: 217-265-5066 > e-mail: drnevich at uiuc.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 18.3 years ago lgautier@altern.org ▴ 950

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 days ago

United States

Hi Karl, Interesting results. I see about the same thing when I analyze a set of data using both mappings; the probesets with big differences are usually consistent, whereas the probesets with smaller differences may vary. The only time this would likely make much difference is when you have an experiment where there really are very limited differences between the groups (often the case with brain research, which I think is one of the reasons the MBNI folks started doing these things). Anyway, it looks like we are going to be making these cdfenvs and probe packages available on BioC. Hopefully this will increase interest/experimentation. Best, Jim Dykema, Karl wrote: > Jim, > > I've spent some time investigating the re-mapped CDF files you asked > about last week on the BioC mailing list. > > We used a quick-and-dirty (but surprisingly effective) categorical > approach to address this problem. > > Basically, if you compare a sample that contains an extra chromosome > (i.e. 3 copies of chromosome 7) to a sample that contains a normal > chromosome number (i.e. two copies) many of the genes that map to that > chromosome will show relatively increased expression. > > So the process is: > > 1) create gene lists that map genes to chromosomes > 2) preprocess gene expression data > 3) simply subtract the gene expression profile of the normal tissue from > the gene expression profile of a sample suspected of harboring a > chromosomal abnormality > 4) see if there is an enrichment of positive gene expression values in > any of the chromosome derived gene lists. This would indicate a > chromosome gain has occurred for that chromosome. Likewise, an > enrichment of negative gene expression values would indicate a > chromosome loss has occurred. > 5) see (http://genomebiology.com/2002/3/12/RESEARCH/0075) for more > details of this quick method. > > The advantage of this categorical approach is that certain tumors > contain very reproducible sets of chromosomal abnormalities and can > serve as positive controls. > > In this case, the tumor samples are all papillary renal cell (kidney) > carcinomas and type of cancer has been shown to previously shown to > produce gains of chromosomes 7, 16, and 17. > > The data was preprocessed using RMA using both the U of M version 6 > Entrez CDF and the standard CDF included in BioC. Each tumor sample was > compared to a pooled normal kidney reference and enrichment scored using > a simple t.test of gene that map to each chromosome. Plotted is a > heatmap of resulting t.score (red high, blue low) > > It is easy to each the enrichment of positive values of genes that map > to chromosomes 7,16,17 using either preprocessing method. > > While this anecdotal evidence and does not prove that the custom CDF > files are any 'better' or 'worse' than the old Affy mappings, it > suggests to me that they are "reasonable" (as opposed to "unreasonable") > > I'd be interested to get your comments. Thanks. This was not cc'ed to > the BioC mailing list. If you like you can forward it on. > > > ------------------------------- > Karl Dykema > Bioinformatics Programmer/Analyst > Laboratory of Computational Biology > Van Andel Research Institute > 333 Bostwick Ave. NE > Grand Rapids, MI 49503 > (616) 234-5554 > -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 18.3 years ago James W. MacDonald 65k

Login before adding your answer.