Question

Strategies for fRMA on custom dataset(s)?

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 11 months ago

Scripps Research, La Jolla, CA

I'm going to be using frmatools to generate experiment-specific fRMA vectors from a set of training data, and these vectors will be used to normalize that training data as well as individual samples that will subsequently become available over time (all arrays are/will be run by the same group). I have a couple of questions about the specifics of implementing this approach.

First, I actually have two datasets. The first one is all blood samples, while the second is kidney biopsy samples. So each dataset is a single tissue, but with several different conditions (the conditions are healthy transplant and several different types of transplant dysfunction/rejection). The datasets will be analyzed independently, but I wonder if it makes sense to pool them for the purpose of generating the fRMA vectors. Pooling would obviously result in one set of vectors based on a larger dataset, and therefore hopefully more robust. But with only a few tissues/conditions, would pooling them in this way cause problems for the analysis?

Second, what is the proper way to define a batch? Is there one correct way? Specifically, since my samples are all the same tissue, should I include the sample condition in my definition of batches in addition to technical variables like run date? What are the implications of this choice of batch definition, conceptually?

Finally, fRMA has single-array and multi-array normalization modes. Am I correct in assuming that I should use the single-array mode for all arrays, since the test arrays will need to be normalized in single-array mode as the come it?

frma frmatools • 1.5k views

ADD COMMENT • link updated 9.5 years ago by Matthew McCall ▴ 830 • written 9.5 years ago by Ryan C. Thompson ★ 7.9k

score 2 · Accepted Answer · 2015-02-06

Regarding combining the blood and kidney samples, it comes down to what you hope to pick up by looking across both tissue types. For example, let's say the blood training samples were all run in May-August and the kidney training samples were all run in December-March. But in the future you are planning on running both kidney and blood samples throughout the year. Then by including both tissues in your training data, you could pick up on differences in probe behavior between the different time periods. The downside could be that there are probes that behave differently between blood and kidney, but behave consistently within each tissue. If you are only analyzing samples within a tissue, then you may not really care if these probes behave differently across tissues.

I don't think there is any one "proper" way to define a batch. It could be any combination of tissue, run date, lab, lab tech, etc. Basically, you want to try to capture variables that could affect probe behavior (remember we're looking at the residuals, so you can include things that are biologically interesting as well as things that are not). We discuss this briefly in: http://www.biomedcentral.com/1471-2105/12/369 but I imagine you are aware of the much better discussion of batch effects in Jeff Leek's paper: http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html

Finally, once you have generated your frma vectors, I would use the default single-array method to analyze your samples. You would only want to use the multi-array version of frma (the "random_effect" summarization method) if you generated another (relatively large) group of arrays that you thought might differ somewhat from the training data.

Hope that helps.