"Could you please explain the result returned by the method strip chart. I am unable to understand the meaning of this plot."
In the workflow, we know the batches (it's actually not "batch", but patient-of-origin for the cells, but I'll just use the word "batch" here), so we can just put the batch variable into the design, e.g. ~batch + condition
. This is the recommended approach.
However, as a way to demonstrate how svaseq works, we suppose we weren't provided with the batch variable. In this case, we could use svaseq to recover the batch information, as "surrogate variables". So we go through the steps we would take if we didn't have any batch information, and we produce two surrogate variables, SV1 and SV2.
How can we test if SV1 and SV2 recover similar information about the samples as batch? If SV1 and SV2 help us to see the differences across the known batches then we know that svaseq "worked". So in the stripchart we are plotting the value for SV1 for the four different patients. We can see that SV1 explains the difference from one patient to the other three, and SV2 explains the difference between one patient and two others, etc. If we estimated another surrogate variable, it would likely help to explain the difference between the first two patients. So the surrogate variables do recover differences in gene expression correlated with patient information.
Re: normalized counts to svaseq vs VST/rlog transformed data to sva, I haven't tested this, but it shouldn't make a big difference. I think normalized counts to svaseq is fine, it's one less step to perform, and so this is what I recommend in the workflow.