We have an RNA-seq experiment where multiple tumour regions per patient were sequenced (average 4/patient) and we are using DESeq2 to compare the transcriptomes of treatment responders vs non-responders:

- In one comparison, we compared 31 regions (from 9 patients) vs 21 regions (from 6 patients) and got 1675 differentially expressed genes (DEGs)
- In a second comparison, we compared the same 31 regions vs 7 regions from 2 patients that are a subset of the 21 regions above, and we got a dramatic drop in the number of DEGs with 'only' 259 DEGs

Could the fact that the second comparison is less balanced than the first one at least contribute to such a drop (especially knowing that the 7 regions are a subset of the 21)?

Hi Michael, many thanks for your quick reply. It could indeed be due to increased sensitivity with higher number of samples - I will check the lfc I get from both comparisons. However, in two other comparisons we saw: 1193 DEGs in 31 vs 14 and 246 DEGs in 54 vs 21 (the 31 and 14 of the first are included in the 54 and 21 of the second). Here, although the second comparison has more samples than the first one, it had significantly less DEGs and this is why I am suspecting imbalance to play a role in DEG numbers in our case. Could this be a combination of both sensitivity and imbalance? Thanks again for your help.

I think it has to do with sample size and within-group variability. You can get a sense of this with the PCA plots.

Thanks Michael. I did the PCA before and there didn't seem to be any striking within-group variability. I will keep investigating this though. On another note, I tried EdgeR and got 154 DEGs for the same comparison where I got 1193 DEGs with DESEq2. In both tools, I chose my DEGs based on a Benjamini-Hochberg FDR threshold of <5%. A good thing is that 152/154 EdgeR DEGs were also in the 1193 from DESEq2. However, this is almost a ten-fold difference in terms of numbers. Not sure if I'm interpreting this right but it seems DESEq2 (as run using default parameters) is more sensitive than EdgeR, and this also shows in the smaller p-values (and therefore FDRs) returned by DESEq2. Any thoughts on this?

It really depends on a lot of things including how you filter. I don't have any general comment, but make sure you are actually testing the same set of genes.

Thanks. Going back to PCA, I did it (starting from the raw counts) on a cohort where we get 1193 DEGs, another one where we get 196 DEGs and another one with both cohorts combined where we get 246 DEGs:

First cohort (the one with 1193 DEGs):

https://imgur.com/a/Iex05z4

Second cohort (the one with 196 DEGs):

https://imgur.com/a/ViIuqQm

Both cohorts combined (246 DEGs):

https://imgur.com/a/Q6gUjfj

I may be wrong, but it looks like there is more variability within responders of the second cohort as they cluster more or less into 3 different clusters in the PCA plot (the second one). Would you say the same and, if so, could this contribute to seeing less DEGs in the second cohort and when we combine it with the first one as compared to the 1193 we see in the first cohort only? Many thanks for your help.

I guess it's not so easy to understand what's going on here, the top two PCs don't translate easily into how many DE genes you find across condition.

One thing that I notice is that there seem to be overlapping samples from responders and non-responders (the majority of the points are in the middle, around the origin), and then some responders which are very different than the rest of the samples. This last group could be driving the differences in the test. They appear to drive PC1 in all of the three plots.

Thanks for your comment Michael. Indeed in both cohorts there are overlapping samples from resp/nresp and the PCA variation in general seems to be driven by responders. But as you said, PCA does not seem to be very helpful in understanding the differences in terms of number of DEGs in this case.

I was thinking to play with some Deseq2 parameters to try and get around this. What I mean is that, for some DEGs in the 1193 DEG list, Deseq2 assigns a very small p value and a big abs(log2FC), although when looking at the count/CPM distribution of such DEGs, the signal seems to be driven by outliers, or the gene is not expressed in one group and only a fraction of samples in the other group express it at low levels. What would be the best parameter in DEseq2 to act on in order to control this sensitivity? Would you advise pre-filtering for lowly expressed genes?

"signal seems to be driven by outliers..."The issue I think is that your samples are a bit more complex. What looks to be happening -- and across a good number of genes according to the PCA -- is that you have substantial heterogeneity within the responders. I don't think these are easily dismissed as outliers because it is showing up as PC1. The null hypothesis is that all of these samples have the same mean expression, so you are rejecting for many genes where that is not plausible given the counts.

You may want to explore the data a bit more and perhaps simple two-group DE is not the right tool for the job here as the story may be more complicated. Or if you want to down-play the role of the samples that are showing up in the PCA plot, you could use SAMseq to perform Wilcoxon-based testing with permutation for FDR correction. This will give higher weight to genes where the DE signal is spread across more of the responders, as opposed to being driven by a subset. The reason this works is because, when transforming the data to ranks, the influence of the high expression in the subset of responders is reduced.

In the past I've worked on expression datasets where disease sub-types were discovered by EDA of such a dataset followed by independent validation in another dataset.

Thanks very much for all your feedback/suggestions Michael. As you said, there is definitely a high level of heterogeneity in some/subgroup of samples and I have been investigating specifically this. Since visualisation of the RNA-seq data with PCA or MDS plots was not really conclusive, we suspected clinical factors may be at play here. So I introduced a set of clinical co-variates to account for in the Deseq2 analysis and here's what I found in one of the comparisons:

linesoftreatment + germlinecondition + treatmentreceived + tumourstage + Response; gave 2405 DEGs!The DEGs are chosen with FDR<5% and |log2FC|>1.5. I wanted to ask your opinion from experience on whether the inclusion of such co-variates is sound or if it should be rather avoided? This is especially knowing that their inclusion seems to play a quite important role in this case as there is more than a 20-fold increase in the number of DEGs for the same comparison.

Are you including all the covariates in the null model? E.g. you are testing only the response in the second approach?

Yes, in both comparisons I am testing differential expression with respect to the Response variable.

Then it appears that those nuisance variables are important to control for, that they help to reduce the unexplained variance and then provide more accurate response estimation.