Hi,
My goal is to normalized multiple GEO microarray datasets to make them comparable to each other and to future samples. I'm using UPC function in the SCAN.UPC package. I saw there's an option to provide batch information for each sample and the batch effect will be corrected. I read the following in the SCAN.UPC documentation which I don't really understand --
"Batch adjusting will be performed after values have been SCAN normalized and summarized at the gene/probeset level. This is also true when UPC and UPCfast are being used—the data will be SCAN normalized and summarized, then batch adjusting will be performed, and lastly UPC transformation will occur. This process is different from when UPC or UPCfast are invoked without batch information; in this scenario, no SCAN normalization will occur."
I have some questions regarding the batch effect:
1. Is it true that SCAN and UPC build models on each sample individually and are applied to each sample individually? If so, why is batch effect relevant? Why perform batch correction after SCAN normalization?
2. In the document it says that "no SCAN normalization will occur" when no batch information is provided. Can someone please elaborate what this implies? Can UPC skip the SCAN step? I guess the final normalized values will be different than those using batch information. But which one is better?
3. Lastly I'd like to know if I'll get the same results by the following three approaches --
a) Process each sample (GSM) individually, without batch information obviously
b) Process a dataset (GSE) at a time, with batch information
c) Process a dataset (GSE) at a time, without batch information
Thanks a lot!
Yunlei
Hi Stephen,
Thanks for your quick and elaborate answer! You answered all my questions! The only thing left is to decide whether to use batch information or not. What would you do? Would it still be a fair comparison if I normalize some datasets with batch information and some without?
Thanks!
Yunlei
Are you combining it with other type of microarray data or RNA-seq data? Or will it all be Affymetrix data?
There're other types of microarray data. Not only Affymetrix.
In that case, I would say not to use batch adjustment because, as you say, it would be inconsistent to use it in one place rather than another. However, one limitation of the UPC method is that the shape of the data is pretty different for UPC normalized Affymetrix data compared to UPC normalized data for other platforms. I admit it's a bit hokey, but you the results might be more comparable if you first SCAN normalize the Affy data and then UPC transform those values using UPC_Generic.
I appreciate your advice very much! For now I'll settle down with UPC without batch adjustment, because UPC_generic requires extra information like lengths and GC content... Do you recommend to do post-normalization on the UPC-normalized values? I mean, first do UPC on every sample from any type of platform, then quantile normalization or Combat taking the platform as the batch factor?
Lengths and GC content are optional. UPC_Generic should just accept the expression values so should be easy to apply, but let me know if you find otherwise.
In theory you shouldn't need to do post-normalization on UPC-normalized values. However, realistically, there will still be some systematic differences among the datasets. It might be worth a try to do quantile norm or ComBat and just see how it behaves.
Thanks! My last question -- In your previous comment "if you first SCAN normalize the Affy data and then UPC transform those values using UPC_Generic", do you mean this is only for Affy data, and apply UPC_Generic alone for other data? And without batch correction, right? But again, you use different methods... I'll follow your lead if you think this is better than the post-normalization.
Yes and yes.
Ok! I'll follow your lead! Thank you so much!
Hi Stephen! I've tried your approach (SCAN + UPC_Generic) but it produced a very bi-polarized distribution: only ~3% of the probesets are between 0.2 and 0.8. While UPC-transformed data has ~33% of the probesets in that range, which I thought is more reasonable.
Another issue I had with UPC_Generic is that it only accepts a vector as the input, not a matrix. Is that so? This is inconvenient when there are many samples per dataset.
The 3rd question is about Affy U133A and U133B. How can I combine them while using UPC? Is it OK to just run UPC on the two CEL files separately and then merge the results by gene name (and select the max?)?
Thanks!
Hi Yunlei,
1. Although UPC-transformed data with ~33% in the middle range may be more reasonable, the UPC method provides a more bimodal distribution for other data types (due to the underlying distributions of these other data types as well as assumptions we make). That's why I suggested to do it the way I suggested...it would make it more consistent across different types of expression data. But to be frank, it's never going to be perfect. As with any method, it is limited by the assumptions we make as well as noise in the data.
2. You can just use the apply function. If your matrix is x and each column is a sample, you can do this: y <- apply(x, 2, UPC_Generic).
3. Some people do what you're suggesting. But I don't really have a tried-and-true method for that. I like to use the BrainArray annotations because they get rid of problematic probes. You could use those to map the probes to genes for each platform and then take the max (or average) of those gene-level values.
Thank you sincerely! And I appreciate your openness!
Yunlei