Differences when using voom on a design with or without intercept
1
1
Entering edit mode
@francesco-gatto-6440
Last seen 6.6 years ago
Sweden

There are non-negligible differences in the number of differentially expressed genes when the linear regression is performed on a design with or without intercept for factors whose regression should not depend on the intercept.

Specifically, I have design matrix with 13 cancer types and the presence of mutation A. Each sample must belong to one cancer type, but it may or may not have a mutation A. So the design matrix, without intercept, is:

> design.noI <- model.matrix(~0+ct+mutA)

where ct is a nSamples x 13 binary matrix and mutA is a nSamples x 1 binary vector. The version with the intercept is trivially:

> design.wI <- model.matrix(~ct+mutA)

where one of the columns of ct is dropped and replaced by the intercept vector. The two designs are linearly dependent, therefore the estimated coefficients in the two scenarios may be derived with some simple algebra. In particular, the coefficients for the factor mutA should be identical in both scenarios, and thereby their statistical significance.

When I run the voom pipeline, however, the results are slightly different:

> v   <- voom(y,design.noI,plot=T)
> fit <- lmFit(v,design.noI)
> eB.noI  <- eBayes(fit)

> v   <- voom(y,design.wI,plot=T)
> fit <- lmFit(v,design.wI)
> eB.wI  <- eBayes(fit)

As predicted, the weights returned by voom and the coefficients for mutA are in both cases identical (besides negligible rounding errors). However, when decideTest is performed in the two cases, the n° of DE genes for factor mutA is different:

> adjPFactor.noI <- decideTests(eB.noI, method="global", adjust.method="BH", p.value=minP, lfc=minFC)

-1     0     1
47 18861    48

-1     0     1
59 18841    56 

Where does this slight difference arise from?

Thanks!

limma voom linear model • 2.7k views
4
Entering edit mode
@james-w-macdonald-5106
Last seen 18 hours ago
United States

I think it is because when you have an intercept, all but one coefficient is testing something of potential interest, whereas when you don't have an intercept, the only useful coefficient is mutA.

In other words, in the model with an intercept, the intercept is calculating the mean for one cancer type, and all the other 'cancer type' coefficients are the difference between that cancer type and the intercept. So these differences in expression values will vary from small to large values, depending on the difference between the two cancer types. So when you run decideTests(), all but the intercept term are reasonable things to be testing.

When you have no intercept, each coefficient is estimating the mean expression for a given gene in a given cancer type. These values will all tend to be large, and when you compute p-values (which are testing if the mean expression is different from zero), then you will generate a whole bunch of really small p-values. So now you are making 13 X nGenes tests that the expression of those genes are different from zero, and since the expression values will almost all be different from zero, you get a whole bunch of really small p-values.

When you do decideTests() with method = "global", you just pile all those p-values into a long vector, and then do the BH adjustment, which is

i <- lp:1L
o <- order(p, decreasing = TRUE)
ro <- order(o)
pmin(1, cummin(n/i * p[o]))[ro]

And the n/i * p[o] is in essence a penalty for the position of the p-value in the ranked set of p-values. Since you are piling a bunch of nonsensical, very small, p-values on top of the ones you care about, you are needlessly penalizing them.

Plus, without an intercept and without fitting contrasts, you are making a bunch of tests that are essentially meaningless.

0
Entering edit mode

I see, the problem resides purely in the adjustment of the p.values, that is sensitive to the contrasts included in the design. Thanks!