treatDGE vs glmLRT
1
1
Entering edit mode
jet2020 ▴ 10
@jet2020-7460
Last seen 9.1 years ago
United States

Hi - I'm an R noob and an bioinformatics noob so forgive any ignorance in my post or responses.  

Doing DGE analysis from RNAseq counts using EdgeR.  I have the data set imported and have filtered, normalized, estimated dispersions and am now working on calling DE genes

I was toying with the idea of using treatDGE to keep from having to filter out low logFC data after using glmLRT but treatDGE is giving me huge logFC values that are very different compared to those from glmLRT using the same inputs (see below). 

I'm fine just using glmLRT, but just for my own edification I would appreciate an explanation of the behavior of treatDGE here.  Thanks.  -Jen

> BvsA.lrt <- glmLRT(fit, contrast=BvsA)
> BvsA.de <- treatDGE(fit, contrast=BvsA, lfc=2)
> topTags(BvsA.lrt)
Coefficient:  -1*A 1*B 
              logFC     logCPM       LR       PValue         FDR
RGS16      4.692429  6.4079442 22.79720 1.800281e-06 0.007220489
ADAMTSL3   3.899486  2.0392587 22.59436 2.000715e-06 0.007220489
KIAA1199   5.443129  3.8755438 22.49064 2.111697e-06 0.007220489
PBX4     -12.244102  2.7096437 22.07098 2.627527e-06 0.007220489
FGD3     -12.731544  4.7401369 21.54637 3.453767e-06 0.007220489
SLC16A7    6.788948  3.4398053 21.42148 3.686179e-06 0.007220489
ITIH5      5.546266  1.2237345 20.96641 4.674069e-06 0.007220489
SELL      -4.213155 -0.1066297 20.80724 5.079059e-06 0.007220489
SLA      -11.422386  1.9546705 20.63175 5.566514e-06 0.007220489
TMEM191A -12.067484  2.1965851 20.60017 5.659115e-06 0.007220489
> topTags(BvsA.de)
Coefficient:  -1*A 1*B 
                logFC    logCPM       PValue        FDR
PBX4       -144269489 2.7096437 1.535377e-05 0.05387531
FGD3       -144269490 4.7401369 1.656099e-05 0.05387531
CD84       -144269492 7.2249668 2.300909e-05 0.05387531
TMEM191A   -144269489 2.1965851 2.895685e-05 0.05387531
PRAMEL     -144269493 6.2552073 3.180367e-05 0.05387531
SLA        -144269489 1.9546705 3.497900e-05 0.05387531
AC007163.4 -144269491 3.8948523 4.228308e-05 0.05387531
ALDH3B2    -144269492 6.3312907 4.252703e-05 0.05387531
IGLV6-57   -144269490 4.0316593 4.292817e-05 0.05387531
C6orf99    -144269489 0.6054585 4.791796e-05 0.05387531
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.8.6  limma_3.22.6

 

edger • 2.2k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

I can tell you why, but you will have to wait until Gordon or Aaron get in to work for the rationale.

In treatDGE, there is a line that goes like this:

 if (glmfit$prior.count != 0) {
        coefficients.mle <- glmfit$unshrunk.coefficients
    }
    else {
        coefficients.mle <- glmfit$coefficients
    }

And you almost surely have a prior.count > 0, so the coefficients you are using for treatDGE are the unshrunk coefficients. In addition, if you look at the counts for say PBX4, most or all of the counts for your B group are very close to zero, so the (unshrunken) fold change is massive. Hence the huge fold changes.

As an example, if you run example(treatDGE), and then add in some extra counts where one group is large, and the other is zero, here is what you get:

> counts <- rbind(counts, c(420, 450, 370, 0,0,0))
> d <- DGEList(counts=counts,lib.size=rep(1e6, nlibs), group=group)
> tr <- treatDGE(gfit, coef=2, lfc=1)
> tr2 <- glmLRT(gfit, coef=2)
> topTags(tr)
Coefficient:  as.factor(group)2
            logFC   logCPM       PValue          FDR
101 -1.442695e+08 7.705051 1.048435e-37 1.058920e-35
1   -2.445750e+00 7.634200 1.302969e-04 6.579994e-03
5   -2.240345e+00 7.915874 7.426105e-04 2.357278e-02
3   -2.207999e+00 8.120664 9.335755e-04 2.357278e-02
2   -2.074439e+00 7.978228 2.801031e-03 5.362816e-02
4   -2.064800e+00 7.673598 3.185831e-03 5.362816e-02
20  -1.032605e+00 6.505144 4.671534e-01 9.996438e-01
92   8.681553e-01 6.779166 6.320188e-01 9.996438e-01
75  -8.353693e-01 6.594941 6.622075e-01 9.996438e-01
38   8.073549e-01 6.369228 6.863417e-01 9.996438e-01
> topTags(tr2)
Coefficient:  as.factor(group)2
          logFC   logCPM         LR       PValue          FDR
101 -11.6915981 7.705051 202.259971 6.709174e-46 6.776266e-44
1    -2.4433391 7.634200  36.197754 1.782759e-09 9.002933e-08
5    -2.2386470 7.915874  31.299969 2.210832e-08 7.443135e-07
3    -2.2065628 8.120664  30.698951 3.013279e-08 7.608529e-07
2    -2.0730075 7.978228  27.269348 1.769951e-07 3.575301e-06
4    -2.0630427 7.673598  26.731858 2.337360e-07 3.934555e-06
20   -1.0310269 6.505144   6.721532 9.525579e-03 1.374405e-01
92    0.8670885 6.779166   4.879078 2.718414e-02 3.431997e-01
75   -0.8342052 6.594941   4.469030 3.451457e-02 3.873302e-01
38    0.8060397 6.369228   4.108755 4.266176e-02 4.308838e-01

And you can see that the fold changes are comparable, except for the fake gene we added.

ADD COMMENT
1
Entering edit mode

Also please note that in the case of a gene with large counts in one condition, and very few or zero counts in another, the fold change isn't that useful. In other words, for the fake gene I introduced in the example, a log fold change of 11.7 (or ~3300 fold change on the linear scale) isn't materially different from the infinite fold change that treatDGE() returns. At some point, fold changes get so excessive that all you can say is 'WOW, that is really different', and leave it at that.

ADD REPLY
1
Entering edit mode

The use of unshrunk log-fold changes is intentional. This is because treatDGE uses the size of the log-fold change, to determine whether it is below the specified lfc threshold. This will affect the calculation of the p-value. In contrast, glmLRT does not use the log-fold change for any calculations, so it does a bit of shrinkage to make them look prettier.

P.S. treatDGE will be renamed to glmTreat in the next release.

ADD REPLY
0
Entering edit mode

Thanks for that.  I figured I was essentially dividing by zero somewhere with treatDGE.  

The more I read and explore the more it seems (and please correct me if my supposition is wrong here) for RNAseq data the fold changes are moderately soft anyway and its generally better to establish DE genes more as a true/false based on logical (for the biological system) p-value and FDR cutoffs rather than being married to a particular fold change threshold.

ADD REPLY

Login before adding your answer.

Traffic: 706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6