Question

RNAseq analysis: how to compare two very different samples?

0

Entering edit mode

hwu12 ▴ 10

@hwu12-12353

Last seen 15 months ago

United States

Hi everyone,

I heard that edgeR and DESeq2 assume that most genes in the samples are equally expressed, and only a small fraction of genes are differentially expressed. I was wondering how to compare two very different RNA samples. For example, one from muscle and the other from the liver. I know some people just use a more stringent criterion (e.g. 4-fold difference and FDR adjusted p-value <0.001) to reduce the number of significant genes. I have 2 questions related to this issue:

Q1. How different of the samples should we start worrying about this issue? For example, when we found >10 percent of genes are differentially expressed using edgeR or DESeq2 ?

Q2. Is there a more statistically sound way to fix this problem?

Thanks a lot!

p.s. I also saw this concept in a paper:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91
"Still, it is important to keep in mind that even these methods are based on an assumption that most genes are equivalently expressed in the samples, and that the differentially expressed genes are divided more or less equally between up- and downregulation"

edgeR deseq2 RNAseq • 1.3k views

ADD COMMENT • link updated 6.3 years ago by Aaron Lun ★ 28k • written 6.3 years ago by hwu12 ▴ 10

score 2 · Accepted Answer · 2018-01-16

The issue you describe is not inherent to the statistical modelling in edgeR (or DESeq2) itself, but rather, relates to how the normalization is performed. In the absence of other information, we need to assume that most genes are not DE between samples in order to estimate the sample-specific bias due to sequencing depth, composition effects, etc. This would not be possible unless you supplied some other prior knowledge, e.g., spike-in transcripts added at a constant amount, possibly something about cDNA concentrations.

The critical point for failure of the normalization procedure is pretty high - both edgeR and DESeq2 use robust estimators, so we're talking about 50-60% of genes needing to be DE before the normalization factor estimates completely fall apart. In your case, 10% DE is not a concern. Of course, every gene is probably truly DE to some extent after perturbation, given the nature of biological networks. But as long as the log-fold changes of most genes are close to zero, any error in the normalization factors will be tolerable.

With respect to the comment about balanced DE; it is true that the accuracy of the estimates will deteriorate somewhat with highly unbalanced DE, due to the nature of quantile-based trimming. However, it is quite rare to find, e.g., 30% of genes upregulated and 0% of genes downregulated. Moreover, it's only a problem when the absolute number of DE genes is high, and the variance of the log-fold changes among non-DE genes is also high. Experience suggests that such cases are fairly extreme.

I have, in fact, previously attempted to use estimators based on the mode to overcome the issues with unbalanced DE in other contexts - namely, comparison between ChIP-seq libraries and input controls where binding should only go up in the latter. I can't remember exactly why it wasn't satisfactory - I believe it was too sensitive to the shape of the M-value distribution. Indeed, if you had many genes being upregulated at the same log-fold change, they could form the highest-density mode and be (incorrectly) used for normalization!