Question

Regarding batch effect in Deseq2 package

0

Entering edit mode

szenitha ▴ 20

@szenitha-10863

Last seen 9.0 years ago

Hello All,

I have a some questions regarding batch effects. In the workflow ,the sva package is used to find the source of unwanted variation and used the function stripchart to see how well the SVA method did at recovering these variables. Could you please explain the result returned by the method strip chart. I am unable to understand the meaning of this plot.
Also could we pass deseq2 normalized data as input to svaseq method? There is another method sva which also finds surrogate variables. This was designed for microarray data. The only difference is that sva takes transformed data as input while svaseq takes either count or normalized data as input and then transforms the data by taking log2. So would it be better if I use sva method by giving rlog transformed data as input or use svaseq?

deseq2 sva svaseq • 2.4k views

ADD COMMENT • link updated 9.2 years ago by Michael Love 43k • written 9.2 years ago by szenitha ▴ 20

score 2 · Accepted Answer · 2016-08-26

"Could you please explain the result returned by the method strip chart. I am unable to understand the meaning of this plot."

In the workflow, we know the batches (it's actually not "batch", but patient-of-origin for the cells, but I'll just use the word "batch" here), so we can just put the batch variable into the design, e.g. ~batch + condition. This is the recommended approach.

However, as a way to demonstrate how svaseq works, we suppose we weren't provided with the batch variable. In this case, we could use svaseq to recover the batch information, as "surrogate variables". So we go through the steps we would take if we didn't have any batch information, and we produce two surrogate variables, SV1 and SV2.

How can we test if SV1 and SV2 recover similar information about the samples as batch? If SV1 and SV2 help us to see the differences across the known batches then we know that svaseq "worked". So in the stripchart we are plotting the value for SV1 for the four different patients. We can see that SV1 explains the difference from one patient to the other three, and SV2 explains the difference between one patient and two others, etc. If we estimated another surrogate variable, it would likely help to explain the difference between the first two patients. So the surrogate variables do recover differences in gene expression correlated with patient information.

Re: normalized counts to svaseq vs VST/rlog transformed data to sva, I haven't tested this, but it shouldn't make a big difference. I think normalized counts to svaseq is fine, it's one less step to perform, and so this is what I recommend in the workflow.