Question

Why is batch effect correction an option in SCAN.UPC normalization?

0

Entering edit mode

y.li.1 • 0

@yli1-16928

Last seen 5.7 years ago

Hi,

My goal is to normalized multiple GEO microarray datasets to make them comparable to each other and to future samples. I'm using UPC function in the SCAN.UPC package. I saw there's an option to provide batch information for each sample and the batch effect will be corrected. I read the following in the SCAN.UPC documentation which I don't really understand --

"Batch adjusting will be performed after values have been SCAN normalized and summarized at the gene/probeset level. This is also true when UPC and UPCfast are being used—the data will be SCAN normalized and summarized, then batch adjusting will be performed, and lastly UPC transformation will occur. This process is different from when UPC or UPCfast are invoked without batch information; in this scenario, no SCAN normalization will occur."

I have some questions regarding the batch effect:

1. Is it true that SCAN and UPC build models on each sample individually and are applied to each sample individually? If so, why is batch effect relevant? Why perform batch correction after SCAN normalization?

2. In the document it says that "no SCAN normalization will occur" when no batch information is provided. Can someone please elaborate what this implies? Can UPC skip the SCAN step? I guess the final normalized values will be different than those using batch information. But which one is better?

3. Lastly I'd like to know if I'll get the same results by the following three approaches --

a) Process each sample (GSM) individually, without batch information obviously

b) Process a dataset (GSE) at a time, with batch information

c) Process a dataset (GSE) at a time, without batch information

Thanks a lot!

Yunlei

scan.upc upc normalization batch effect correction • 2.0k views

ADD COMMENT • link updated 5.7 years ago by Stephen Piccolo ▴ 590 • written 5.7 years ago by y.li.1 • 0

score 2 · Answer 1 · 2018-08-15

2

Entering edit mode

Stephen Piccolo ▴ 590

@stephen-piccolo-6761

Last seen 3.6 years ago

United States

Thanks for your question and sorry for the confusion. I put an option to do batch normalization in SCAN.UPC, but most people don't use it. I did this before the sva package was available. Now I would just point people to the sva package.

First, I'll answer your questions:

1. Yes, that's a key aspect of SCAN and UPC. They only use data from within a single sample for normalization. This has many benefits (see our paper) and does a pretty good job of standardizing data across samples. However, even though multiple samples may have the same/similar mean and variance, there may still be batch effects (systematic variations associated with when/how each batch was processed). So it's still a good idea in many cases to correct for these effects (I typically use ComBat). Having said that, batch effects **should** affect UPC relatively little compared to SCAN.

2. It depends on what type of data you are using. If you are using Affymetrix microarrays, it will skip SCAN and go straight to UPC. However, there is an option to first SCAN normalize, then batch-adjust, then UPC normalize. The reason for this sequence of steps is that batch-adjusting expects the data to be normally distributed. SCAN values are normally distributed, whereas UPC values are not (by design). In theory, this sequence should work well, but I haven't done thorough comparisons between this approach and simply using UPC.

3. a and c should give you the same results. b will be different because it will batch adjust across the samples.

Let me know if that isn't clear.

ADD COMMENT • link 5.7 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Hi Stephen,

Thanks for your quick and elaborate answer! You answered all my questions! The only thing left is to decide whether to use batch information or not. What would you do? Would it still be a fair comparison if I normalize some datasets with batch information and some without?

Thanks!

Yunlei

ADD REPLY • link 5.7 years ago y.li.1 • 0

0

Entering edit mode

Are you combining it with other type of microarray data or RNA-seq data? Or will it all be Affymetrix data?

ADD REPLY • link 5.7 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

There're other types of microarray data. Not only Affymetrix.

ADD REPLY • link 5.7 years ago y.li.1 • 0

0

Entering edit mode

In that case, I would say not to use batch adjustment because, as you say, it would be inconsistent to use it in one place rather than another. However, one limitation of the UPC method is that the shape of the data is pretty different for UPC normalized Affymetrix data compared to UPC normalized data for other platforms. I admit it's a bit hokey, but you the results might be more comparable if you first SCAN normalize the Affy data and then UPC transform those values using UPC_Generic.

ADD REPLY • link 5.7 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

I appreciate your advice very much! For now I'll settle down with UPC without batch adjustment, because UPC_generic requires extra information like lengths and GC content... Do you recommend to do post-normalization on the UPC-normalized values? I mean, first do UPC on every sample from any type of platform, then quantile normalization or Combat taking the platform as the batch factor?

ADD REPLY • link 5.7 years ago y.li.1 • 0

0

Entering edit mode

Lengths and GC content are optional. UPC_Generic should just accept the expression values so should be easy to apply, but let me know if you find otherwise.

In theory you shouldn't need to do post-normalization on UPC-normalized values. However, realistically, there will still be some systematic differences among the datasets. It might be worth a try to do quantile norm or ComBat and just see how it behaves.

ADD REPLY • link 5.7 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Thanks! My last question -- In your previous comment "if you first SCAN normalize the Affy data and then UPC transform those values using UPC_Generic", do you mean this is only for Affy data, and apply UPC_Generic alone for other data? And without batch correction, right? But again, you use different methods... I'll follow your lead if you think this is better than the post-normalization.

ADD REPLY • link 5.7 years ago y.li.1 • 0

0

Entering edit mode

Yes and yes.

ADD REPLY • link 5.7 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Ok! I'll follow your lead! Thank you so much!

ADD REPLY • link 5.7 years ago y.li.1 • 0

0

Entering edit mode

Hi Stephen! I've tried your approach (SCAN + UPC_Generic) but it produced a very bi-polarized distribution: only ~3% of the probesets are between 0.2 and 0.8. While UPC-transformed data has ~33% of the probesets in that range, which I thought is more reasonable.

Another issue I had with UPC_Generic is that it only accepts a vector as the input, not a matrix. Is that so? This is inconvenient when there are many samples per dataset.

The 3rd question is about Affy U133A and U133B. How can I combine them while using UPC? Is it OK to just run UPC on the two CEL files separately and then merge the results by gene name (and select the max?)?

Thanks!

ADD REPLY • link 5.7 years ago y.li.1 • 0

2

Entering edit mode

Hi Yunlei,

1. Although UPC-transformed data with ~33% in the middle range may be more reasonable, the UPC method provides a more bimodal distribution for other data types (due to the underlying distributions of these other data types as well as assumptions we make). That's why I suggested to do it the way I suggested...it would make it more consistent across different types of expression data. But to be frank, it's never going to be perfect. As with any method, it is limited by the assumptions we make as well as noise in the data.

2. You can just use the apply function. If your matrix is x and each column is a sample, you can do this: y <- apply(x, 2, UPC_Generic).

3. Some people do what you're suggesting. But I don't really have a tried-and-true method for that. I like to use the BrainArray annotations because they get rid of problematic probes. You could use those to map the probes to genes for each platform and then take the max (or average) of those gene-level values.

ADD REPLY • link 5.7 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Thank you sincerely! And I appreciate your openness!

Yunlei

ADD REPLY • link 5.7 years ago y.li.1 • 0