Question

How should I compare two different RNA-seq datasets using DESeq2 or EgdeR?

1

Entering edit mode

joseale2310 ▴ 10

@joseale2310-20445

Last seen 4.9 years ago

Hi!

I was wondering how could I compare an RNA-seq matrix of raw counts that I downloaded from a repository to my own RNA-seq experiment. Specifically, I want to see which samples from the downloaded paper are the closest to my samples. I have tried to combine all samples into one matrix and use DESeq2 to process the data and visualize PCA and the sample correlations without success. I can clearly see the differences between the download data and my own. The conditions of the total data set can be reproduced this way:

colData <- data.frame(condition = c(rep(c("A","B"), each = 2), rep(c("C","D","E"),each = 3)), batch = c(rep("1",7),rep("2",6)))

   condition batch
1          A     1
2          A     1
3          B     1
4          B     1
5          C     1
6          C     1
7          C     1
8          D     2
9          D     2
10         D     2
11         E     2
12         E     2
13         E     2

Trying create the DESeq2 object with the design formula "~condition + batch" returns the "Model matrix not full rank" error. However, I am unsure about how to solve it since it seems to me this is a "perfect confounding" problem.

I have also tried to use the recommendation of using the limma package function removeBatchEffect after Variance stabilizing transformation with the design "~condition" only:

mat <- assay(vsd) 
mat <- limma::removeBatchEffect(mat, vsd$batch)
assay(vsd) <- mat 
plotPCA(vsd)

But I can still see the batch effect. I was wondering if I should use the Combat algorithm to remove the batch effects, although it seems that I need to clean and normalize the data. Several people have also recommended me to not use it and try to model it instead with DESeq.

It would be much appreciated if someone could point me in the right direction! Thank you very much!

Jose

deseq2 limma combat • 4.6k views

ADD COMMENT • link updated 4.9 years ago by Michael Love 42k • written 4.9 years ago by joseale2310 ▴ 10

score 0 · Answer 1 · 2019-06-11

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 13 hours ago

United States

Can you post the PCA plot where you still see the batch effect after applying removeBatchEffect? I'm curious what it looks like.

ADD COMMENT • link 4.9 years ago Michael Love 42k

0

Entering edit mode

Hi Michael! This is the plot after. You can see a bunch of samples clustered together (G,H,I and J are my samples) https://ibb.co/gmY7CrK

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

0

Entering edit mode

How do the batches line up with A-J?

ADD REPLY • link 4.9 years ago Michael Love 42k

0

Entering edit mode

From A to F is batch 1. From G to J are my samples (2nd batch). The conditions I wrote above are simplified, but end up being the same.

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

1

Entering edit mode

So what's happened here is that your samples are very similar to each other, and the other conditions are very different from each other. The middle of your samples and the average of all other conditions is now coincident: they both are centered on (0,0).

I'm not sure this is going to work as a method of comparison.

What are these conditions by the way? Are these very different than your samples?

ADD REPLY • link 4.9 years ago Michael Love 42k

0

Entering edit mode

Different early developmental stages and I want to check if how my samples reacts to a different media, that is, which stage do they seem to be closer to. I expected that at least the control should be similar to one of the downloaded dataset. Is there any other reasonable way to do this kind of comparison? How can be different datasets be compared to each other then? I am very curious about it now.

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

0

Entering edit mode

I'm not sure. I would probably not attempt to do the batch removal as you have above but just look at a PCA on the original VST data. Note that plotPCA has an argument ntop, and you can increase this to represent more genes in the PCA plot.

ADD REPLY • link 4.9 years ago Michael Love 42k

0

Entering edit mode

I tried that but unfortunately it does not change the look of the PCA. All my samples are still completely overlapping. Isn't there then a way to modify the design to approach it in a reasonable way? I have tried to take a look at the vignettes on "Model matrix not full rank”, but it seems that my case is indeed a perfect confounding problem, right?

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

1

Entering edit mode

The design matrix won't help here. I'm not sure if I have any further suggestions. I think the samples are so far from the stages because of batch and biology that it's not easy to say which is closest.

ADD REPLY • link 4.9 years ago Michael Love 42k

0

Entering edit mode

I have tried a couple of things more and it seems to be the right conclusion. Thank you very much! I will try with another dataset. Just one more question, was the approach I used with DESeq2 a correct one? I mean, should I try to do the same with another dataset?

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

1

Entering edit mode

I don’t have any other suggestions than what you tried here

ADD REPLY • link 4.9 years ago Michael Love 42k

0

Entering edit mode

Great, thank very much!

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

0

Entering edit mode

You could do nearest neighbors mapping of your samples onto the public data, under the assumption that the batch effect is orthogonal to the biological axes of variation. Just pick the top 500 or so genes with the largest variance/dispersion and use those to find nearest neighbors in transformed space.

ADD REPLY • link 4.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

I can do that but I do not quite get what implies that assumption. Thanks!

ADD REPLY • link 4.9 years ago joseale2310 ▴ 10

0

Entering edit mode

We'll pretend that you only have one sample, for the sake of this explanation.

Imagine a perfect world in which there is no batch effect between your data and the public data. Your sample is closest to, say, sample X in the public data, differing by some distance $d$. If you introduce a batch effect that is orthogonal to the vector difference between your samples, the distance between samples becomes $\sqrt{d^2 + b^2}$ where $b$ is the length of the batch vector - this is just good ol' Pythagoras.

Further assume that the batch vector is also orthogonal to the vector difference between your sample and some other public sample Y. If the distance between your sample and Y was $f$, then the new distance after introducing a batch effect is $\sqrt{f^2 + b^2}$. But if $d < f$, then $\sqrt{d^2 + b^2} < \sqrt{f^2 + b^2}$, so X is still a closer neighbor to your sample than Y in the presence of this batch effect.

Repeat the application of this logic to all public samples, and you can see that if the batch effect is orthogonal to any (presumably biological) variation between samples, the nearest neighbor identities will be unchanged. Of course, in practice, this depends on noise and such, but hopefully it should be enough to get a good indication of what's going on. One could also search for the $k$ nearest neighbors and look for a sharp increase in the distance, but YMMV as high-dimensional distances become very similar.

ADD REPLY • link 4.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Would it work if you calculate the adjustment coefficients using the controls only and then apply the adjustment to all samples? In that way controls from both databases will be centered and the large biological effect will not affect the adjustment.

ADD REPLY • link 4.9 years ago agustin.gonvi ▴ 20