sva and svaseq for unsupervised analysis of expression data
1
0
Entering edit mode
Last seen 3.7 years ago

Hi,

I would like to use sva package to adjust for the effects of batch on two independent datasets. Ideally, one dataset should be the training set, while the second one is for prediction. My understanding is that the sva is made to facilitate for the questions of  differential expression type; however, what I am trying to do, is to observe intrinsic structures of the training set, and  then testing to see if the same prominent substructures exist in the prediction set. So, this is not a classical differential expression question. Cases in two sets are fairly randomly distributed, in terms of their clinical features,  and there is no major bias in the sets' compositions.

As I am interested to use svaseq and ComBat for the adjustment, my first question is which of these adjustments may be the better fit for this purpose. Again, I don't have any p or q-values and I am not interested in them right now (they may have uses later on). The focus is to keep the intrinsic biological variation intact, and adjust for the analytical variation.

Secondly, in the special case of having these training and prediction sets produced by different analytical platforms, how supported is the idea of using sva for the correction? Definitely, there are differences stemming from the platform difference; but is there any rational and recommended approach to a) find the features relatively consistent between the sets, and b) to adjust  these features?

I hope the question is clear; however, if more details are needed, I would be glad to provide them as far as I can.

Thank you!

sva combat gene expression batch effect rnaseq • 2.0k views
1
Entering edit mode
Jeff Leek ▴ 610
@jeff-leek-5015
Last seen 8 months ago
United States
Hi Farshad,

You are right, current methods aren't designed for prediction problems. We did make an effort at studying this problem:

https://peerj.com/articles/561/

The basic idea is to:

1. Freeze a training set

2. Identify surrogate variables by appending each test sample to the training set

3. Do a regression cleaning approach

This seems to work reasonably well for data sets where they are measured on the same platform, the batch effects follow a reasonably similar distribution, and the training set is sufficiently large.

It sounds like you have at least one variable we didn't consider there (two platforms). But you could give it a shot and see how it does in cross-validation as a place to start. This is implemented in the "fsva" function in the sva package.

Cheers

Jeff