Search
Question: RNASeq interaction between factors with edgeR when cpms tend to zero
1
10 months ago by
David Rengel70
European Union
David Rengel70 wrote:

Hi,

I am using edgeR to analyze RNASeq data and I wonder about how the interactions between two factors is dealt with, especially when the normalized cpm values tend to 0 on one of the factors.

I have the factors “time” (T0/T30) and “genotype” (A,B,C). In order to build my contrasts I created a dummy factor comprising all possible combinations among levels of those two factors.

Please find attached a table with 4 genes on which different contrasts (in columns) have been tested. The coding on the table tells us whether the given comparison is significant (i.e. “1” when BH<0.05) or not (“0”).

Please find also some plots where the normalized cpms for those genes are shown. The lines join the genotype-wise mean cpm values on each time point.

For those four genes, in my view, time does not play the same role on all genotypes, and yet the table tells me that edgeR does not find the interaction between factors significant. Is this because the cpms for those genes are close to 0 when T0? If this is the case, why is so? And if that is not the reason, what could it be? I guess Gene3 is worth mentioning cause cpms on genotype A are different from zero and yet no interaction is significant between time and genotype.

Thanks for your help on this.

Best,

David

modified 10 months ago • written 10 months ago by David Rengel70
2
10 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

You will run into this problem whenever you have a fitted value of zero (i.e., all counts are zero for that group). This is because the interaction terms are equivalent to differences (between genotypes) in the log-fold changes (between time points). As soon as zeroes get involved, the definition of a log-fold change becomes difficult.

Consider gene 3. This gene has a log-fold change of around 1 between your two time points in genotype A. It also has counts of zero (as far as I can tell) in both time points in genotype C. However, this does not mean that the log-fold changes are different between genotypes - it only means that you can't estimate the log-fold change in C at all. For all we know, the true log-fold change between time points in genotype C could be 1, it's just that the means are so low that we end up with zero counts most of the time. This means that edgeR (or any count-based test) cannot reliably state that the log-fold changes are significantly different between genotypes.

The same reasoning applies to the other genes. Plotting the CPMs on the raw scale is misleading as it suggests obvious differences between genotypes, whereas mathematically (for the log-fold changes), there are none. The exception is gene 4 where the log-fold changes in all genotypes are well-estimated. It is only in this case that your interaction terms are significantly non-zero.

For your other contrasts, the fact that the log-fold change is poorly defined is not a problem for detecting significant differences. This is because we only need to know if the means are different, we are not so concerned with the exact value of the log-fold change.

1
10 months ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

Yes, the reason is the zero counts. As Aaron has already explained, fold-changes are indeterminate when the counts are zero. This is a universal statistical issue, not particular to edgeR.

You can see the issue clearly from a Fisher exact test. Suppose the counts for a particular gene are as follows:

> y
T0   T30
gtypeA  0 10000
gtypeC  0     0


You might think that there's a huge time effect in genotype A but none in genotype C, so the interaction should be significant. But a Fisher test for interaction gives p=1:

> fisher.test(y)

Fisher's Exact Test for Count Data

data:  y
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0 Inf
sample estimates:
odds ratio
0


It actually doesn't matter what numbers you put in the first row of y, the p-value will always be equal to 1 if the counts in the second row are both zero. This shows why you can't possibly get a significant interaction between A and C for your genes 1-3.

The same issue arises when either column of the contingency table is all zero. You can enter any numbers you like to the T30 column of y. As long as the first column remains all zero, the p-value will still be 1. This shows why you can't possibly get a significant interaction between A and B for your genes 1-3.

On the other hand, you can meaningfully make a direct comparison between genotype A and genotype C at T30. That comparison might well be significant for all the genes you show. You can say "Gene 1" is not expressed at T0 for any genotype, but is significantly up-regulated in genotype A at T30".

0
10 months ago by
David Rengel70
European Union
David Rengel70 wrote:

Thanks Aaron and Gordon for your invaluable  answers. Things are so much clearer now, thanks a lot.

Best,

David