Use of RMA in increasingly-sized datasets

0

Entering edit mode

David Kipling ▴ 110

@david-kipling-1252

Last seen 9.6 years ago

Hi This is not a "how do I process 1000 chips with RMA" but rather something slightly different. We're starting to get projects coming thru our Affy core that involve 1000+ chips. Obviously we can use MAS5 to process the .cel files, and irrespective of what happens with subsequent chips in the project the expression values from those chips will stay the same because of the single-chip nature of the algorithm. It would be nice to run, in parallel, RMA-style processing of the data. The issue this raises for me relates to the desire of the scientists to look at their data before the end of the project (e.g. you'd want to explore the first 200 cancer samples rather than wait for all 1000 to be done), which is understandable. My concern is that the multi-chip nature of RMA means that, for any specific .cel file, the expression values will depend on the other chips included in the run, and so the expression values from that .cel file will be different in the early stages (200 chips) and at the end (1000 chips). Such a 'moving target' dataset may be confusing and would certainly cause an audit headache. Has anyone explored this issue and proposed a solution? It's entirely possible that I am being totally paranoid and that after 100+ chips in a dataset the expression values plateau out and are stable in the face of additional .cel files being included; I don't yet have access to big-enough datasets to critically address that. I do have some recollection in the deep mists of time a comment (?from Ben Bolstad?) suggesting the use of a standard 'training set' of (say) 50 chips, to which you would add your new chips one at a time and process. All comments, thoughts, or experiences gratefully received! Regards David Prof David Kipling Department of Pathology School of Medicine Cardiff University Heath Park Cardiff CF14 4XN Tel: 029 2074 4847 Email: KiplingD@cardiff.ac.uk

Cancer affy PROcess Cancer affy PROcess • 982 views

ADD COMMENT • link updated 18.9 years ago by Darlene Goldstein ▴ 230 • written 18.9 years ago by David Kipling ▴ 110

0

Entering edit mode

Ben Bolstad ★ 1.1k

@ben-bolstad-93

Last seen 9.6 years ago

To answer the "how do I process 1000 chips with RMA" question first: While I don't usually promote it on the BioC lists, the latest version of RMAExpress can process virtually unlimited numbers of cel files (I have personally processed around 800 chip datasets with no trouble while testing) provided you can allocate it sufficient temporary disk space. See: http://www.stat.berkeley.edu/~bolstad/RMAExpress/RMAExpress.html On the second question, it is matter of there not being an implementation to do what you want rather than it be an impossibility. The most important things for such an implementation: 1. A consistent normalization step 2. Probe effects estimates made based on a reasonable number of arrays The rmaPLM function in affyPLM will return the probe-effect estimates for RMA and PLMset objects have a slot for normalization vector (unfortunately not filled up by anything right now). A previous time this issue has been discussed on this mailing list was this thread: http://files.protsuggest.org/biocond/html/1816.html but there are probably others as well. Ben On Fri, 2005-06-03 at 09:07 +0100, David Kipling wrote: > Hi > > This is not a "how do I process 1000 chips with RMA" but rather > something slightly different. > > We're starting to get projects coming thru our Affy core that involve > 1000+ chips. Obviously we can use MAS5 to process the .cel files, and > irrespective of what happens with subsequent chips in the project the > expression values from those chips will stay the same because of the > single-chip nature of the algorithm. > > It would be nice to run, in parallel, RMA-style processing of the data. > The issue this raises for me relates to the desire of the scientists > to look at their data before the end of the project (e.g. you'd want to > explore the first 200 cancer samples rather than wait for all 1000 to > be done), which is understandable. My concern is that the multi- chip > nature of RMA means that, for any specific .cel file, the expression > values will depend on the other chips included in the run, and so the > expression values from that .cel file will be different in the early > stages (200 chips) and at the end (1000 chips). Such a 'moving target' > dataset may be confusing and would certainly cause an audit headache. > > Has anyone explored this issue and proposed a solution? It's entirely > possible that I am being totally paranoid and that after 100+ chips in > a dataset the expression values plateau out and are stable in the face > of additional .cel files being included; I don't yet have access to > big-enough datasets to critically address that. I do have some > recollection in the deep mists of time a comment (?from Ben Bolstad?) > suggesting the use of a standard 'training set' of (say) 50 chips, to > which you would add your new chips one at a time and process. > > All comments, thoughts, or experiences gratefully received! > > Regards > > David > > > > Prof David Kipling > Department of Pathology > School of Medicine > Cardiff University > Heath Park > Cardiff CF14 4XN > > Tel: 029 2074 4847 > Email: KiplingD@cardiff.ac.uk > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- Ben Bolstad <bolstad@stat.berkeley.edu> http://www.stat.berkeley.edu/~bolstad

ADD COMMENT • link 18.9 years ago Ben Bolstad ★ 1.1k

0

Entering edit mode

Darlene Goldstein ▴ 230

@darlene-goldstein-1004

Last seen 9.6 years ago

Hi David, BioC list, apologies in advance for the length of this email........ I have a few things to add to the advice already given, some might also be relevant to the thread that Ben Bolstad mentioned in his reply: http://files.protsuggest.org/biocond/html/1816.html You asked if anyone has looked at this problem. I have studied 'subset-based' RMA strategies, including the extrapolation approach (take e.g. 50 chips and extrapolate that model to get RMA values for the rest of the chips), partitioning the entire set of chips into manageable size (however many you can do in a run, like 50), and doing this partitioning multiple times and averaging to get RMA values. The 'partitioning' approaches depend on having an entire set available. To get an idea of how much RMA values can vary, as well as how inferences might vary, please see http://mbi.osu.edu/2004/ws1materials/goldstein.pdf I have a working ms on this and will be happy to send a preprint when it's submitted. You also ask if anyone has a solution. Unfortunately, I have to say no here (at least for myself), but I also think that there will not be a general solution. Rather, the way the issue is approached will depend on the specifics of the study. There are many ways to get 1000 chips. For instance, a lab may process a bunch of stored samples over a relatively short period of time; alternatively, the same lab may process samples coming in over a longer period of time, as in a prospective trial where patients are recruited into the study over time. Another common possibility is that multiple centers are collaborating on a larger trial, with each center doing some processing of chips. There may be different types of problems and artifacts in each of these scenarios. For example, the first 50 chips in a study occurring over a period of time may be qualitatively different from subsequent sets of chips if there is a time trend for some reason. In the multi-center case, between lab variability is likely to be an important artifact. Ben made the point that what you need are: 1. A consistent normalization step 2. Probe effects estimates made based on a reasonable number of arrays I could not agree more with 1, however in my opinion there is a problem in how to get that. Some people seem to think that quantile normalization of all chips together will safely remove all artifactual differences between chips. This is emphatically _not_ true (and many people are recognizing this). In an experiment replicated by the same lab a few months apart (using different animals each time but following the same protocols in all experimental aspects), the experimental 'batch' effect persists even if you RMA all chips together. This is really easy to see if you just cluster samples based on RMA values - the major split is between the two replications. So, if you're hoping to get rid of this kind of effect merely by RMAing all chips together, I think you are likely to be disappointed. I have a preprint of this study if you want more details. As for 2, I think that the number of arrays is only one component. The arrays should also be somehow 'representative'. In practice, this might be difficult to achieve. As you say, if the target is moving then it won't be easy to hit (as well as cause confusion). It is not only reasonable but I would also say necessary that the scientists examine early/preliminary results. What I would do in this case is RMA the 'preliminary' set together if possible and base early analyses on that. As more chips come in, most likely I would re-RMA after 'enough' came in. However, you still need to carry out careful exploratory analyses to ensure that you are really removing the artifacts that you think you are. What you should look for depends on the specifics of your study. Persistent artifacts will need to be removed by other means (by regression for example). In the event that you are unable to RMA all your chips together, I would recommend multiple partitioning to get 'final' RMA values for all chips. This is in contrast to extrapolating from a single subset. Yes, the RMA values will change, which may be confusing and an audit nightmare, but you will give yourself some protection against 'locking in' an artifact by averaging over different sets (which are likely to have different artifacts). I see this as a major benefit. Don't hesitate to write back, on or off list, if any of this seems unclear, Best regards, Darlene On Fri, 2005-06-03 at 09:07 +0100, David Kipling wrote: > Hi > > This is not a "how do I process 1000 chips with RMA" but rather > something slightly different. > > We're starting to get projects coming thru our Affy core that involve > 1000+ chips. Obviously we can use MAS5 to process the .cel files, and > irrespective of what happens with subsequent chips in the project the > expression values from those chips will stay the same because of the > single-chip nature of the algorithm. > > It would be nice to run, in parallel, RMA-style processing of the data. > The issue this raises for me relates to the desire of the scientists > to look at their data before the end of the project (e.g. you'd want to > explore the first 200 cancer samples rather than wait for all 1000 to > be done), which is understandable. My concern is that the multi- chip > nature of RMA means that, for any specific .cel file, the expression > values will depend on the other chips included in the run, and so the > expression values from that .cel file will be different in the early > stages (200 chips) and at the end (1000 chips). Such a 'moving target' > dataset may be confusing and would certainly cause an audit headache. > > Has anyone explored this issue and proposed a solution? It's entirely > possible that I am being totally paranoid and that after 100+ chips in > a dataset the expression values plateau out and are stable in the face > of additional .cel files being included; I don't yet have access to > big-enough datasets to critically address that. I do have some > recollection in the deep mists of time a comment (?from Ben Bolstad?) > suggesting the use of a standard 'training set' of (say) 50 chips, to > which you would add your new chips one at a time and process. > > All comments, thoughts, or experiences gratefully received! > > Regards > > David > > > > Prof David Kipling > Department of Pathology > School of Medicine > Cardiff University > Heath Park > Cardiff CF14 4XN > > Tel: 029 2074 4847 > Email: KiplingD@cardiff.ac.uk > -- Darlene Goldstein ?cole Polytechnique F?d?rale de Lausanne (EPFL) Institut de Math?matiques Batiment MA, Station 8 Tel: +41 21 693 2552 CH-1015 Lausanne Fax: +41 21 693 4303 SWITZERLAND

ADD COMMENT • link 18.9 years ago Darlene Goldstein ▴ 230

Login before adding your answer.