Loads, really loads of DEGs after edgeR differential analysis
1
0
Entering edit mode
David R ▴ 90
@david-rengel-6321
Last seen 4 months ago
European Union

Hi,

I am dealing with RNASeq data that render ludicrously hig number of DEGs in most of the contrasts I have tested. In the attached image file I reproduced the summary table of the analysis. The table shows percentages over the whole set, “NS” meaning non-significant (i.e. BH>0.05) and “UP” or “DOWN” being the percentages of up or downregulated genes on every comparison (in columns).

You’ll soon realize that for many cases the DEGs (i.e. 100-NS) are around 70 to 80 percent of the genes. Reassuringly enough, the UP and DOWN numbers are balanced, but I still wonder if the numbers of DEGS make the analysis dubious. I have had big numbers of DEGs in the past, but this seems too much. I should add that only 45 genes are not differentially expressed under any of the conditions.

Do you have any views on the subject you’d be willing to share? I’d be grateful if I could to get any feedback on this.

Best

David

PS: 

For some reason the Imgur link was not acceped. Please find below the above-mentioned table.

 

GP_G

GP_Gpl

Gpl_G

T2T_T1T

P3T_T1T

A3T_T1T

P4T_T1T

J4T_T1T

A4T_T1T

J5T_T1T

A5T_T1T

J4H_P4H

A4H_P4H

J5F_J5H

A5H_J5H

A5FS_J5H

A5F_J5H

DOWN

37

34

35

12

35

23

36

37

38

36

37

30

30

36

29

36

33

UP

40

34

36

13

27

22

36

37

38

36

37

36

37

37

24

37

34

NS

23

32

29

75

38

55

28

25

25

28

26

33

34

27

47

27

32

rnaseq edger normalization differential gene expression • 1.6k views
ADD COMMENT
0
Entering edit mode

Well, it's hard to tell, because you don't post any code showing what you've done, or the experimental set-up.

ADD REPLY
0
Entering edit mode

Hi Aaron, thanks for your quick reply. THe column correspond to the contrast I've looked at. They are simple comparisons between two conditions at a time: FOr instance GP vs G, GP vs Gpl etc. I attach here below the same table as above, but transposed, which looks better. For every condition there weree three replicates.

  DOWN UP NS
GP_G 37 40 23
GP_Gpl 34 34 32
Gpl_G 35 36 29
T2T_T1T 12 13 75
P3T_T1T 35 27 38
A3T_T1T 23 22 55
P4T_T1T 36 36 28
J4T_T1T 37 37 25
A4T_T1T 38 38 25
J5T_T1T 36 36 28
A5T_T1T 37 37 26
J4H_P4H 30 36 33
A4H_P4H 30 37 34
J5F_J5H 36 37 27
A5H_J5H 29 24 47
A5FS_J5H 36 37 27
A5F_J5H 33 34 32
ADD REPLY
0
Entering edit mode

Code? And do you really only have 100 genes in total? Are those actually percentages?

ADD REPLY
0
Entering edit mode

Yes, they are percentages. Please find below the real numbers and some code further below. Just note that what I call design factor corresponds to the modalities you may see on the contrasts and that reps, which I have implemented in the model, correspond to the replicate effect. Thanks again for your help. 

  DOWN UP NS
GP_G 8978 9692 5562
GP_Gpl 8222 8199 7811
Gpl_G 8454 8693 7085
T2T_T1T 2959 3167 18106
P3T_T1T 8568 6575 9089
A3T_T1T 5538 5441 13253
P4T_T1T 8673 8729 6830
J4T_T1T 9041 9057 6134
A4T_T1T 9135 9106 5991
J5T_T1T 8703 8727 6802
A5T_T1T 8906 9081 6245
J4H_P4H 7306 8822 8104
A4H_P4H 7213 8878 8141
J5F_J5H 8726 8975 6531
A5H_J5H 7065 5853 11314
A5FS_J5H 8798 8886   6548
A5F_J5H 8078 8291 7863

Code:

design=model.matrix(~0+design.factor+reps)

colnames(design)=sub("design.factor|reps","",colnames(design))
rownames(design)=colnames(counts)
dge <- estimateGLMCommonDisp(dge,design,verbose=TRUE)
# Disp = 0.00196 , BCV = 0.0443
dge <- estimateGLMTrendedDisp(dge,design)
dge <- estimateGLMTagwiseDisp(dge,design)
fit <- glmFit (dge,design)

my.contrasts=makeContrasts(
  GP_G = GP-G,
  GP_Gpl = GP-Gplus,
  Gpl_G = Gplus-G,
  
  T2T_T1T = T2T-T1T,
  P3T_T1T = P3T-T1T,
  A3T_T1T = A3T-T1T,
  P4T_T1T = P4T-T1T,
  J4T_T1T = J4T-T1T,
  A4T_T1T = A4T-T1T,
  J5T_T1T = J5T-T1T,
  A5T_T1T = A5T-T1T,
  
  J4H_P4H = J4H-P4H,
  A4H_P4H = A4H-P4H,
  
  J5F_J5H = J5F-J5H,
  A5H_J5H = A5H-J5H,
  A5FS_J5H = A5FS-J5H,
  A5F_J5H = A5F-J5H,
  
  levels=design)

lrt=as.list(c(1:ncol(my.contrasts)))
names(lrt)=colnames(my.contrasts)
for (i in 1:ncol(my.contrasts))
  lrt[[i]] <- glmLRT(fit, contrast=my.contrasts[,i])

 

ADD REPLY
1
Entering edit mode
@ryan-c-thompson-5618
Last seen 8 months ago
Scripps Research, La Jolla, CA

Based on the clarifications in your comments, a few things are clear. First, you are using the old dispersion estimation pipeline. I recommend that you switch to using the newer estimateDisp function, which estimates the common, trended, and tagwise dispersions all in one, and runs faster as well. This won't make much difference in the number of DE genes, but it will ensure that you benefit from the most up-to-date dispersion estimation code available in edgeR. Second, I recommend that you use the quasi-likelihood F test instead of the likelihood ratio test. The QL test is more conservative than the LRT, and in my experience it seems less likely to produce false positives. To make the switch, replace glmFit with glmQLFit and glmLRT with glmQLFTest. This should give you a somewhat smaller list of genes, although with such a low dispersion, it might not make much of a difference.

Lastly, your data has a very low dispersion value, which means that even a relatively small fold change can be considered statistically significant. Hence, it is not surprising that many genes are considered significantly differentially expressed. Given this situation, the best way to narrow down your gene list would be to choose a minimum logFC threshold that represents the smallest change you are interested in, and test against that threshold using glmTreat instead of glmLRT (or glmQLFTest). This will prioritize genes with log fold changes significantly larger than your chosen threshold, rather than just significantly different from zero. When choosing a threshold, keep in mind that fold change values are in log2 space, so for example a threshold of 1 tests for genes that are changing by more than 2x up or down (i.e. more than 2-fold or less than 0.5-fold).

ADD COMMENT
0
Entering edit mode

HI Ryan,

Thanks a lot for your helpful reply. I had already tried using the new dispersion function but it gives me the following error:

Error in .C("slocfit", x = as.numeric(x), y = as.numeric(rep(y, length.out = n)),  : "slocfit" not available for .C() for package "locfit"

I guess this has to do with a recent update in R (R version is 3.4.2, edgeR 3.16.5  and locfit 1.5-9.1)

I have run glmQLFTest as you suggested and the number of DEGs is smaller than before, but still too high. You suggest using glmTreat to set up a threshold on the FC. I have always been reluctant to do so because I consider that there are gene categories that will hardly present strong FCs and hence filtering FCs would not be even among all GOs. Now, that said, I have never been confronted to such a low dispersion and high DEG numbers so I may be willing to do so. However, I would like to ask a couple of questions before I take the FC-thresholding approach;

1-Is the normalization affected by the low dispersion and the subsequent exceedingly high DEG numbers.

2-Would lowering down the BH threshold value be an alternative to using a FC threshold? What I mean by this is: is it not in theory a good thing the fact that because of low dispersion genes with low FCs are declared as DEGs, or are those genes meant to be more likely false positives?

Thanks for your help,

David

ADD REPLY
1
Entering edit mode

I know edgeR::estimateDisp definitely works with R 3.4.2. Perhaps the locfit package or some other dependency needs to be reinstalled after updating R?

To answer your question about normalization, you should read the paper on TMM, the default edgeR normlization, and have a look at the help page for calcNormFactors. It has ways of adjusting the degree of robustness depending on how many genes are changing. However, as long as the number of changing genes is are approximately balanced between up and down, the default normalization shouldn't have a problem.

If you don't want to prioritize larger fold changes, you always have the option of setting a more stringent false discovery rate threshold. No one is going to complain that your differential expression results are too good.

One last comment: you should consider whether a very small dispersion is plausible for your data set. A near-zero dispersion value means that there is almost no detectable biological variation between replicates in your data. Is this expected for the kind of data you are dealing with? If not, you should check whether you have mis-specified your design matrix. (Check the edgeR manual for examples of typical dispersion values that you can compare yours against.)

ADD REPLY
0
Entering edit mode

To add to Ryan's answer:

  1. TMM is not affected by the low dispersion, but it does assume most genes are not DE. If this is violated, then your normalization factors are not strictly accurate. However, the degree of inaccuracy depends on the log-fold changes of the DE genes, and if many of these (i.e., the remaining log-fold changes after taking away the top 30% up/down DE genes) are small, then I don't think it is a major problem. Having balanced up/down DE also improves accuracy, as Ryan says.
  2. You seem reluctant to use a fold change threshold. Why? If you don't have any a priori assumptions about the genes of interest, and you have strong DE genes with large log-fold changes, they are much stronger candidates for follow-up work than genes with weak log-fold changes. Significant genes with low log-fold changes are not (necessarily) more likely to be false positives, but they are less easy to validate; less likely to result in a cellular/molecular phenotype; and less appealing to study. In any case, DE genes won't be spread evenly among the GO categories, regardless of whether you use a log-fold change threshold or not. That's sort of the whole point of overrepresentation tests.
ADD REPLY
0
Entering edit mode

Hi Ryan and Aaron,

Thank you both for your very helpful comments. I have run the analysis running glmTreat as suggested by Ryan, with logFC threshold =1. This dramatically decreases the number of DEGs, hence the firs point of Aaron's reply applies. I will have a look at a more robust parametrization of TMM though. As for Aaron’s point of FC changes, I agree with everything you say about validation etc. and my position on this will change with whatever the ultimate goal of the project is. The reason why I am wary of using FCs to filter genes out is that plant biologists that I work with are interested in many cases in very finely tuned plant responses, involving transcriptional regulators with low FCs and yet very relevant biological roles. But I agree: low FCs might be a pain.

Thanks again you both.

Best,

David

ADD REPLY

Login before adding your answer.

Traffic: 476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6