Hi!

I was wondering how could I compare an RNA-seq matrix of raw counts that I downloaded from a repository to my own RNA-seq experiment. Specifically, I want to see which samples from the downloaded paper are the closest to my samples. I have tried to combine all samples into one matrix and use DESeq2 to process the data and visualize PCA and the sample correlations without success. I can clearly see the differences between the download data and my own. The conditions of the total data set can be reproduced this way:

`colData <- data.frame(condition = c(rep(c("A","B"), each = 2), rep(c("C","D","E"),each = 3)), batch = c(rep("1",7),rep("2",6)))`

```
condition batch
1 A 1
2 A 1
3 B 1
4 B 1
5 C 1
6 C 1
7 C 1
8 D 2
9 D 2
10 D 2
11 E 2
12 E 2
13 E 2
```

Trying create the DESeq2 object with the design formula "~condition + batch" returns the "Model matrix not full rank" error. However, I am unsure about how to solve it since it seems to me this is a "perfect confounding" problem.

I have also tried to use the recommendation of using the limma package function removeBatchEffect after Variance stabilizing transformation with the design "~condition" only:

```
mat <- assay(vsd)
mat <- limma::removeBatchEffect(mat, vsd$batch)
assay(vsd) <- mat
plotPCA(vsd)
```

But I can still see the batch effect. I was wondering if I should use the Combat algorithm to remove the batch effects, although it seems that I need to clean and normalize the data. Several people have also recommended me to not use it and try to model it instead with DESeq.

It would be much appreciated if someone could point me in the right direction! Thank you very much!

Jose

Hi Michael! This is the plot after. You can see a bunch of samples clustered together (G,H,I and J are my samples) https://ibb.co/gmY7CrK

How do the batches line up with A-J?

From A to F is batch 1. From G to J are my samples (2nd batch). The conditions I wrote above are simplified, but end up being the same.

So what's happened here is that your samples are very similar to each other, and the other conditions are very different from each other. The middle of your samples and the average of all other conditions is now coincident: they both are centered on (0,0).

I'm not sure this is going to work as a method of comparison.

What are these conditions by the way? Are these very different than your samples?

Different early developmental stages and I want to check if how my samples reacts to a different media, that is, which stage do they seem to be closer to. I expected that at least the control should be similar to one of the downloaded dataset. Is there any other reasonable way to do this kind of comparison? How can be different datasets be compared to each other then? I am very curious about it now.

I'm not sure. I would probably not attempt to do the batch removal as you have above but just look at a PCA on the original VST data. Note that plotPCA has an argument ntop, and you can increase this to represent more genes in the PCA plot.

I tried that but unfortunately it does not change the look of the PCA. All my samples are still completely overlapping. Isn't there then a way to modify the design to approach it in a reasonable way? I have tried to take a look at the vignettes on "Model matrix not full rank”, but it seems that my case is indeed a perfect confounding problem, right?

The design matrix won't help here. I'm not sure if I have any further suggestions. I think the samples are so far from the stages because of batch and biology that it's not easy to say which is closest.

I have tried a couple of things more and it seems to be the right conclusion. Thank you very much! I will try with another dataset. Just one more question, was the approach I used with DESeq2 a correct one? I mean, should I try to do the same with another dataset?

I don’t have any other suggestions than what you tried here

Great, thank very much!

You could do nearest neighbors mapping of your samples onto the public data, under the assumption that the batch effect is orthogonal to the biological axes of variation. Just pick the top 500 or so genes with the largest variance/dispersion and use those to find nearest neighbors in transformed space.

I can do that but I do not quite get what implies that assumption. Thanks!

We'll pretend that you only have one sample, for the sake of this explanation.

Imagine a perfect world in which there is no batch effect between your data and the public data. Your sample is closest to, say, sample X in the public data, differing by some distance $d$. If you introduce a batch effect that is orthogonal to the vector difference between your samples, the distance between samples becomes $\sqrt{d^2 + b^2}$ where $b$ is the length of the batch vector - this is just good ol' Pythagoras.

Further assume that the batch vector is

alsoorthogonal to the vector difference between your sample and some other public sample Y. If the distance between your sample and Y was $f$, then the new distance after introducing a batch effect is $\sqrt{f^2 + b^2}$. But if $d < f$, then $\sqrt{d^2 + b^2} < \sqrt{f^2 + b^2}$, so X is still a closer neighbor to your sample than Y in the presence of this batch effect.Repeat the application of this logic to all public samples, and you can see that if the batch effect is orthogonal to any (presumably biological) variation between samples, the nearest neighbor identities will be unchanged. Of course, in practice, this depends on noise and such, but hopefully it should be enough to get a good indication of what's going on. One could also search for the $k$ nearest neighbors and look for a sharp increase in the distance, but YMMV as high-dimensional distances become very similar.

Would it work if you calculate the adjustment coefficients using the controls only and then apply the adjustment to all samples? In that way controls from both databases will be centered and the large biological effect will not affect the adjustment.