Hello,
I am trying to use contrasts with a cell means model in EdgeR to compute log fold changes. I am a little confused by the results I get by modifying weights of the contrasts. For instance, I modified the following piece of code from one the EdgeR examples.
#------------ Defining the data and design
> ngenes <- 100
> n1 <- 3
> n2 <- 3
> n3 <-3
> nlibs <- n1+n2+n3
> mu <- 100
> phi <- 0.1
> group <- c(rep(1,n1), rep(2,n2), rep(3,n3))
> design <- model.matrix(~0+as.factor(group))
> ### 4-fold change for the first 5 genes
> i <- 1:5
> fc <- 4
> mu <- matrix(mu, ngenes, nlibs)
> mu[i, 1:n1] <- mu[i, 1:n1]*fc
> counts <- matrix(rnbinom(ngenes*nlibs, mu=mu, size=1/phi), ngenes, nlibs)
> d <- DGEList(counts=counts,lib.size=rep(1e6, nlibs), group=group)
> gfit <- glmFit(d, design, dispersion=phi)
#-------------------------------------- Contrasts --------------
> tr <- glmTreat(gfit, contrast=c(1,1,-1), lfc=1)
> tr2 <- glmTreat(gfit, contrast=c(.5,.5,-1), lfc=1)
> topTags(tr)
Coefficient: 1*as.factor(group)1 1*as.factor(group)2 -1*as.factor(group)3
logFC unshrunk.logFC logCPM PValue FDR
7 -14.49698 -14.50068 6.428122 1.029359e-92 5.793101e-91
32 -14.47646 -14.47995 6.482914 1.158620e-92 5.793101e-91
74 -14.26280 -14.26602 6.621232 6.749246e-91 2.249749e-89
72 -14.11476 -14.11748 6.832884 6.579315e-90 1.537966e-88
79 -14.12208 -14.12489 6.711183 7.689829e-90 1.537966e-88
86 -14.07918 -14.08192 6.735456 1.716826e-89 2.861376e-88
11 -14.09690 -14.09979 6.547746 2.035350e-89 2.907644e-88
13 -13.87883 -13.88147 6.573221 1.673322e-87 1.921928e-86
76 -13.84859 -13.85116 6.785719 1.858638e-87 1.921928e-86
50 -13.91007 -13.91282 6.333895 1.921928e-87 1.921928e-86
> topTags(tr2)
Coefficient: 0.5*as.factor(group)1 0.5*as.factor(group)2 -1*as.factor(group)3
logFC unshrunk.logFC logCPM PValue FDR
3 1.2569501 1.2577449 7.994976 0.08410010 0.999649
5 1.2546815 1.2557728 7.541736 0.08618485 0.999649
4 1.1923287 1.1931614 7.749677 0.11551910 0.999649
1 1.1707078 1.1716912 7.535702 0.12723752 0.999649
2 0.9689215 0.9695723 7.570693 0.27704382 0.999649
60 0.7809965 0.7821674 6.599907 0.46447814 0.999649
47 0.7787079 0.7797861 6.698892 0.46783455 0.999649
32 -0.7165637 -0.7175457 6.482914 0.53317722 0.999649
72 -0.7129839 -0.7137471 6.832884 0.54020766 0.999649
68 0.6988273 0.6998642 6.588292 0.55360906 0.999649
> counts[72,]
[1] 133 60 94 104 76 87 134 165 155
> gfit$coefficients[72,]
as.factor(group)1 as.factor(group)2 as.factor(group)3
-9.253335 -9.325471 -8.795200
What confuses me here is that I expected the log fold change for the first contrast to be less negative than the second, since the weights for the first two factors are reduced in the second. However, since the coefficients are negative, it appears that the results are the opposite. In particular, for the first contrast, I expected the log fold change to be positive or closer to 0, given the counts, say for row 72 (realizing the contrasts do not directly use counts and incorporates more information such as library sizes and dispersions), I intuitively assumed (factor 1+ factor 2)
would be more upregulated than factor 3
.
Am I making a misinterpreting the outputs? How do I interpret the results?
Thank you!
Thank you very much for your prompt response Gordon Smyth! I suspected I was making a mistake with that. If you dont mind me asking - what value am I actually getting when I use the wrong contrast?
edgeR works internally with glm coefficients on the log-nucleotide-fraction scale. You have
mu=100
andlib.size=1e6
so the glm coefficients are approximately equal tolog(100 / 1e6) = -9.2
The contrast
c(a,b,c)
simply gives logFC equal toa*coef1+b*coef2+c*coef3
but converted from loge to log2. If the gene is not DE, then the three coefficients will be roughy equal. Ifa=1
,b=1
andc=-1
then for non-DE genes you roughly getcoef1
, which on the log2 scale islog2(100 / 1e6) = -13.3
.If the contrast vector adds to zero then the internal scale cancels out. You have
mu
equal to 400,100 and 100 for the DE genes. If you set contrastc(1,-1,0)
orc(1,0,-1)
orc(1,-0.5,-0.5)
then you will get logFC equal tolog2(400) - log2(100) = log2(400/100) = 2
. If you set contrastc(0.5,0.5,-1)
then you will get logFC equal to0.5*log2(400) + 0.5*log2(100) - log2(100) = 1
.Ah that makes sense - so basically because I had a large scale difference using contrasts that summed to zero, I was essentially just getting the log2 coefficient for group 1 (say) which is negative due to the normalization by the library size.
Thanks very much for your detailed answer! It is very helpful to know how the computations work.