Search
Question: Regarding batch effect in Deseq2 package
0
2.1 years ago by
szenitha0
szenitha0 wrote:

Hello All,

I have a some questions regarding batch effects.  In the workflow ,the sva package is used to find the source of unwanted variation and used the function stripchart to see how well the SVA method did at recovering these variables. Could you please explain the result returned by the method strip chart. I am unable to understand the meaning of this plot.
Also could we pass deseq2 normalized data as input to svaseq method? There is another method sva which also finds surrogate variables. This was designed for microarray data. The only difference is that sva takes transformed data as input while svaseq takes either count or normalized data as input and then transforms the data by taking log2.  So would it be better if I use sva method by giving rlog transformed data as input or use svaseq?

modified 2.1 years ago by Michael Love19k • written 2.1 years ago by szenitha0
2
2.1 years ago by
Michael Love19k
United States
Michael Love19k wrote:

"Could you please explain the result returned by the method strip chart. I am unable to understand the meaning of this plot."

In the workflow, we know the batches (it's actually not "batch", but patient-of-origin for the cells, but I'll just use the word "batch" here), so we can just put the batch variable into the design, e.g. ~batch + condition. This is the recommended approach.

However, as a way to demonstrate how svaseq works, we suppose we weren't provided with the batch variable. In this case, we could use svaseq to recover the batch information, as "surrogate variables". So we go through the steps we would take if we didn't have any batch information, and we produce two surrogate variables, SV1 and SV2.

How can we test if SV1 and SV2 recover similar information about the samples as batch? If SV1 and SV2 help us to see the differences across the known batches then we know that svaseq "worked". So in the stripchart we are plotting the value for SV1 for the four different patients. We can see that SV1 explains the difference from one patient to the other three, and SV2 explains the difference between one patient and two others, etc. If we estimated another surrogate variable, it would likely help to explain the difference between the first two patients. So the surrogate variables do recover differences in gene expression correlated with patient information.

Re: normalized counts to svaseq vs VST/rlog transformed data to sva, I haven't tested this, but it shouldn't make a big difference. I think normalized counts to svaseq is fine, it's one less step to perform, and so this is what I recommend in the workflow.