Question

Loads, really loads of DEGs after edgeR differential analysis

0

Entering edit mode

David R ▴ 90

@david-rengel-6321

Last seen 22 days ago

European Union

Hi,

I am dealing with RNASeq data that render ludicrously hig number of DEGs in most of the contrasts I have tested. In the attached image file I reproduced the summary table of the analysis. The table shows percentages over the whole set, “NS” meaning non-significant (i.e. BH>0.05) and “UP” or “DOWN” being the percentages of up or downregulated genes on every comparison (in columns).

You’ll soon realize that for many cases the DEGs (i.e. 100-NS) are around 70 to 80 percent of the genes. Reassuringly enough, the UP and DOWN numbers are balanced, but I still wonder if the numbers of DEGS make the analysis dubious. I have had big numbers of DEGs in the past, but this seems too much. I should add that only 45 genes are not differentially expressed under any of the conditions.

Do you have any views on the subject you’d be willing to share? I’d be grateful if I could to get any feedback on this.

Best

David

PS:

For some reason the Imgur link was not acceped. Please find below the above-mentioned table.

	GP_G	GP_Gpl	Gpl_G	T2T_T1T	P3T_T1T	A3T_T1T	P4T_T1T	J4T_T1T	A4T_T1T	J5T_T1T	A5T_T1T	J4H_P4H	A4H_P4H	J5F_J5H	A5H_J5H	A5FS_J5H	A5F_J5H
DOWN	37	34	35	12	35	23	36	37	38	36	37	30	30	36	29	36	33
UP	40	34	36	13	27	22	36	37	38	36	37	36	37	37	24	37	34
NS	23	32	29	75	38	55	28	25	25	28	26	33	34	27	47	27	32

rnaseq edger normalization differential gene expression • 1.9k views

ADD COMMENT • link 7.0 years ago David R ▴ 90

0

Entering edit mode

Well, it's hard to tell, because you don't post any code showing what you've done, or the experimental set-up.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron, thanks for your quick reply. THe column correspond to the contrast I've looked at. They are simple comparisons between two conditions at a time: FOr instance GP vs G, GP vs Gpl etc. I attach here below the same table as above, but transposed, which looks better. For every condition there weree three replicates.

	DOWN	UP	NS
GP_G	37	40	23
GP_Gpl	34	34	32
Gpl_G	35	36	29
T2T_T1T	12	13	75
P3T_T1T	35	27	38
A3T_T1T	23	22	55
P4T_T1T	36	36	28
J4T_T1T	37	37	25
A4T_T1T	38	38	25
J5T_T1T	36	36	28
A5T_T1T	37	37	26
J4H_P4H	30	36	33
A4H_P4H	30	37	34
J5F_J5H	36	37	27
A5H_J5H	29	24	47
A5FS_J5H	36	37	27
A5F_J5H	33	34	32

ADD REPLY • link 7.0 years ago David R ▴ 90

0

Entering edit mode

Code? And do you really only have 100 genes in total? Are those actually percentages?

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes, they are percentages. Please find below the real numbers and some code further below. Just note that what I call design factor corresponds to the modalities you may see on the contrasts and that reps, which I have implemented in the model, correspond to the replicate effect. Thanks again for your help.

	DOWN	UP	NS
GP_G	8978	9692	5562
GP_Gpl	8222	8199	7811
Gpl_G	8454	8693	7085
T2T_T1T	2959	3167	18106
P3T_T1T	8568	6575	9089
A3T_T1T	5538	5441	13253
P4T_T1T	8673	8729	6830
J4T_T1T	9041	9057	6134
A4T_T1T	9135	9106	5991
J5T_T1T	8703	8727	6802
A5T_T1T	8906	9081	6245
J4H_P4H	7306	8822	8104
A4H_P4H	7213	8878	8141
J5F_J5H	8726	8975	6531
A5H_J5H	7065	5853	11314
A5FS_J5H	8798	8886	6548
A5F_J5H	8078	8291	7863

Code:

design=model.matrix(~0+design.factor+reps)

colnames(design)=sub("design.factor|reps","",colnames(design))
rownames(design)=colnames(counts)
dge <- estimateGLMCommonDisp(dge,design,verbose=TRUE)
# Disp = 0.00196 , BCV = 0.0443
dge <- estimateGLMTrendedDisp(dge,design)
dge <- estimateGLMTagwiseDisp(dge,design)
fit <- glmFit (dge,design)

my.contrasts=makeContrasts(
GP_G = GP-G,
GP_Gpl = GP-Gplus,
Gpl_G = Gplus-G,

T2T_T1T = T2T-T1T,
P3T_T1T = P3T-T1T,
A3T_T1T = A3T-T1T,
P4T_T1T = P4T-T1T,
J4T_T1T = J4T-T1T,
A4T_T1T = A4T-T1T,
J5T_T1T = J5T-T1T,
A5T_T1T = A5T-T1T,

J4H_P4H = J4H-P4H,
A4H_P4H = A4H-P4H,

J5F_J5H = J5F-J5H,
A5H_J5H = A5H-J5H,
A5FS_J5H = A5FS-J5H,
A5F_J5H = A5F-J5H,

levels=design)

lrt=as.list(c(1:ncol(my.contrasts)))
names(lrt)=colnames(my.contrasts)
for (i in 1:ncol(my.contrasts))
lrt[[i]] <- glmLRT(fit, contrast=my.contrasts[,i])

ADD REPLY • link 7.0 years ago David R ▴ 90

score 1 · Answer 1 · 2017-12-04

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 9 weeks ago

Icahn School of Medicine at Mount Sinai…

Based on the clarifications in your comments, a few things are clear. First, you are using the old dispersion estimation pipeline. I recommend that you switch to using the newer estimateDisp function, which estimates the common, trended, and tagwise dispersions all in one, and runs faster as well. This won't make much difference in the number of DE genes, but it will ensure that you benefit from the most up-to-date dispersion estimation code available in edgeR. Second, I recommend that you use the quasi-likelihood F test instead of the likelihood ratio test. The QL test is more conservative than the LRT, and in my experience it seems less likely to produce false positives. To make the switch, replace glmFit with glmQLFit and glmLRT with glmQLFTest. This should give you a somewhat smaller list of genes, although with such a low dispersion, it might not make much of a difference.

Lastly, your data has a very low dispersion value, which means that even a relatively small fold change can be considered statistically significant. Hence, it is not surprising that many genes are considered significantly differentially expressed. Given this situation, the best way to narrow down your gene list would be to choose a minimum logFC threshold that represents the smallest change you are interested in, and test against that threshold using glmTreat instead of glmLRT (or glmQLFTest). This will prioritize genes with log fold changes significantly larger than your chosen threshold, rather than just significantly different from zero. When choosing a threshold, keep in mind that fold change values are in log2 space, so for example a threshold of 1 tests for genes that are changing by more than 2x up or down (i.e. more than 2-fold or less than 0.5-fold).

ADD COMMENT • link 7.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

HI Ryan,

Thanks a lot for your helpful reply. I had already tried using the new dispersion function but it gives me the following error:

Error in .C("slocfit", x = as.numeric(x), y = as.numeric(rep(y, length.out = n)), : "slocfit" not available for .C() for package "locfit"

I guess this has to do with a recent update in R (R version is 3.4.2, edgeR 3.16.5 and locfit 1.5-9.1)

I have run glmQLFTest as you suggested and the number of DEGs is smaller than before, but still too high. You suggest using glmTreat to set up a threshold on the FC. I have always been reluctant to do so because I consider that there are gene categories that will hardly present strong FCs and hence filtering FCs would not be even among all GOs. Now, that said, I have never been confronted to such a low dispersion and high DEG numbers so I may be willing to do so. However, I would like to ask a couple of questions before I take the FC-thresholding approach;

1-Is the normalization affected by the low dispersion and the subsequent exceedingly high DEG numbers.

2-Would lowering down the BH threshold value be an alternative to using a FC threshold? What I mean by this is: is it not in theory a good thing the fact that because of low dispersion genes with low FCs are declared as DEGs, or are those genes meant to be more likely false positives?

Thanks for your help,

David

ADD REPLY • link 7.0 years ago David R ▴ 90

1

Entering edit mode

I know edgeR::estimateDisp definitely works with R 3.4.2. Perhaps the locfit package or some other dependency needs to be reinstalled after updating R?

To answer your question about normalization, you should read the paper on TMM, the default edgeR normlization, and have a look at the help page for calcNormFactors. It has ways of adjusting the degree of robustness depending on how many genes are changing. However, as long as the number of changing genes is are approximately balanced between up and down, the default normalization shouldn't have a problem.

If you don't want to prioritize larger fold changes, you always have the option of setting a more stringent false discovery rate threshold. No one is going to complain that your differential expression results are too good.

One last comment: you should consider whether a very small dispersion is plausible for your data set. A near-zero dispersion value means that there is almost no detectable biological variation between replicates in your data. Is this expected for the kind of data you are dealing with? If not, you should check whether you have mis-specified your design matrix. (Check the edgeR manual for examples of typical dispersion values that you can compare yours against.)

ADD REPLY • link 7.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

To add to Ryan's answer:

TMM is not affected by the low dispersion, but it does assume most genes are not DE. If this is violated, then your normalization factors are not strictly accurate. However, the degree of inaccuracy depends on the log-fold changes of the DE genes, and if many of these (i.e., the remaining log-fold changes after taking away the top 30% up/down DE genes) are small, then I don't think it is a major problem. Having balanced up/down DE also improves accuracy, as Ryan says.
You seem reluctant to use a fold change threshold. Why? If you don't have any a priori assumptions about the genes of interest, and you have strong DE genes with large log-fold changes, they are much stronger candidates for follow-up work than genes with weak log-fold changes. Significant genes with low log-fold changes are not (necessarily) more likely to be false positives, but they are less easy to validate; less likely to result in a cellular/molecular phenotype; and less appealing to study. In any case, DE genes won't be spread evenly among the GO categories, regardless of whether you use a log-fold change threshold or not. That's sort of the whole point of overrepresentation tests.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Ryan and Aaron,

Thank you both for your very helpful comments. I have run the analysis running glmTreat as suggested by Ryan, with logFC threshold =1. This dramatically decreases the number of DEGs, hence the firs point of Aaron's reply applies. I will have a look at a more robust parametrization of TMM though. As for Aaron’s point of FC changes, I agree with everything you say about validation etc. and my position on this will change with whatever the ultimate goal of the project is. The reason why I am wary of using FCs to filter genes out is that plant biologists that I work with are interested in many cases in very finely tuned plant responses, involving transcriptional regulators with low FCs and yet very relevant biological roles. But I agree: low FCs might be a pain.

Thanks again you both.

Best,

David

ADD REPLY • link 7.0 years ago David R ▴ 90