Question: How should I compare two different RNA-seq datasets using DESeq2 or EgdeR?
1
13 days ago by
joseale231010
joseale231010 wrote:

Hi!

I was wondering how could I compare an RNA-seq matrix of raw counts that I downloaded from a repository to my own RNA-seq experiment. Specifically, I want to see which samples from the downloaded paper are the closest to my samples. I have tried to combine all samples into one matrix and use DESeq2 to process the data and visualize PCA and the sample correlations without success. I can clearly see the differences between the download data and my own. The conditions of the total data set can be reproduced this way:

colData <- data.frame(condition = c(rep(c("A","B"), each = 2), rep(c("C","D","E"),each = 3)), batch = c(rep("1",7),rep("2",6)))

   condition batch
1          A     1
2          A     1
3          B     1
4          B     1
5          C     1
6          C     1
7          C     1
8          D     2
9          D     2
10         D     2
11         E     2
12         E     2
13         E     2


Trying create the DESeq2 object with the design formula "~condition + batch" returns the "Model matrix not full rank" error. However, I am unsure about how to solve it since it seems to me this is a "perfect confounding" problem.

I have also tried to use the recommendation of using the limma package function removeBatchEffect after Variance stabilizing transformation with the design "~condition" only:

mat <- assay(vsd)
mat <- limma::removeBatchEffect(mat, vsd$batch) assay(vsd) <- mat plotPCA(vsd)  But I can still see the batch effect. I was wondering if I should use the Combat algorithm to remove the batch effects, although it seems that I need to clean and normalize the data. Several people have also recommended me to not use it and try to model it instead with DESeq. It would be much appreciated if someone could point me in the right direction! Thank you very much! Jose limma deseq2 combat • 106 views ADD COMMENTlink modified 13 days ago by Michael Love24k • written 13 days ago by joseale231010 Answer: How should I compare two different RNA-seq datasets using DESeq2 or EgdeR? 0 13 days ago by Michael Love24k United States Michael Love24k wrote: Can you post the PCA plot where you still see the batch effect after applying removeBatchEffect? I'm curious what it looks like. ADD COMMENTlink written 13 days ago by Michael Love24k Hi Michael! This is the plot after. You can see a bunch of samples clustered together (G,H,I and J are my samples) https://ibb.co/gmY7CrK ADD REPLYlink modified 13 days ago • written 13 days ago by joseale231010 How do the batches line up with A-J? ADD REPLYlink written 13 days ago by Michael Love24k From A to F is batch 1. From G to J are my samples (2nd batch). The conditions I wrote above are simplified, but end up being the same. ADD REPLYlink written 13 days ago by joseale231010 1 So what's happened here is that your samples are very similar to each other, and the other conditions are very different from each other. The middle of your samples and the average of all other conditions is now coincident: they both are centered on (0,0). I'm not sure this is going to work as a method of comparison. What are these conditions by the way? Are these very different than your samples? ADD REPLYlink written 13 days ago by Michael Love24k Different early developmental stages and I want to check if how my samples reacts to a different media, that is, which stage do they seem to be closer to. I expected that at least the control should be similar to one of the downloaded dataset. Is there any other reasonable way to do this kind of comparison? How can be different datasets be compared to each other then? I am very curious about it now. ADD REPLYlink written 13 days ago by joseale231010 I'm not sure. I would probably not attempt to do the batch removal as you have above but just look at a PCA on the original VST data. Note that plotPCA has an argument ntop, and you can increase this to represent more genes in the PCA plot. ADD REPLYlink written 13 days ago by Michael Love24k I tried that but unfortunately it does not change the look of the PCA. All my samples are still completely overlapping. Isn't there then a way to modify the design to approach it in a reasonable way? I have tried to take a look at the vignettes on "Model matrix not full rank”, but it seems that my case is indeed a perfect confounding problem, right? ADD REPLYlink written 12 days ago by joseale231010 1 The design matrix won't help here. I'm not sure if I have any further suggestions. I think the samples are so far from the stages because of batch and biology that it's not easy to say which is closest. ADD REPLYlink written 12 days ago by Michael Love24k I have tried a couple of things more and it seems to be the right conclusion. Thank you very much! I will try with another dataset. Just one more question, was the approach I used with DESeq2 a correct one? I mean, should I try to do the same with another dataset? ADD REPLYlink written 12 days ago by joseale231010 1 I don’t have any other suggestions than what you tried here ADD REPLYlink written 12 days ago by Michael Love24k Great, thank very much! ADD REPLYlink written 12 days ago by joseale231010 You could do nearest neighbors mapping of your samples onto the public data, under the assumption that the batch effect is orthogonal to the biological axes of variation. Just pick the top 500 or so genes with the largest variance/dispersion and use those to find nearest neighbors in transformed space. ADD REPLYlink modified 12 days ago • written 12 days ago by Aaron Lun24k I can do that but I do not quite get what implies that assumption. Thanks! ADD REPLYlink written 12 days ago by joseale231010 We'll pretend that you only have one sample, for the sake of this explanation. Imagine a perfect world in which there is no batch effect between your data and the public data. Your sample is closest to, say, sample X in the public data, differing by some distance$d$. If you introduce a batch effect that is orthogonal to the vector difference between your samples, the distance between samples becomes$\sqrt{d^2 + b^2}$where$b$is the length of the batch vector - this is just good ol' Pythagoras. Further assume that the batch vector is also orthogonal to the vector difference between your sample and some other public sample Y. If the distance between your sample and Y was$f$, then the new distance after introducing a batch effect is$\sqrt{f^2 + b^2}$. But if$d < f$, then$\sqrt{d^2 + b^2} < \sqrt{f^2 + b^2}$, so X is still a closer neighbor to your sample than Y in the presence of this batch effect. Repeat the application of this logic to all public samples, and you can see that if the batch effect is orthogonal to any (presumably biological) variation between samples, the nearest neighbor identities will be unchanged. Of course, in practice, this depends on noise and such, but hopefully it should be enough to get a good indication of what's going on. One could also search for the$k\$ nearest neighbors and look for a sharp increase in the distance, but YMMV as high-dimensional distances become very similar.