Limma - discrepancy in number of significant genes
2
1
Entering edit mode
bcbio_uk ▴ 10
@bcbio_uk-6970
Last seen 5.0 years ago
United Kingdom

Dear all,

I have a question about differential gene expression calculation using limma. I have two disease groups - D1 (n=6) and D2 (n=7) that I have first compared to each other and obtained about 30 genes that have adj.P-value < 0.05. Please see the code that I've used below.

> design <- model.matrix(~0 + Treat)
> fit <- lmFit(eset,design)
> cm <- makeContrasts(coef1 = D1 - D2, levels=design)
> fit2 <- contrasts.fit(fit, cm)
> fit2 <- eBayes(fit2)

After I had calculated that, I obtained some healthy samples - H (n=23) to compare each of the disease types to. However, now when I calculate D1 - D2, I get a lot more significant genes (~200) with adj.P-value < 0.05 as opposed to the 30 genes I was getting before with the same comparison. The sample numbers or anything else for my D1 and D2 groups haven't changed at all, I've only added the H samples. Please see code below:

> design <- model.matrix(~0 + Treat)
> fit <- lmFit(eset,design)
> cm <- makeContrasts(coef1 = D1 - D2, coef2 = D1 - H, coef3= D2 - H, levels=design)
> fit2 <- contrasts.fit(fit, cm)
> fit2 <- eBayes(fit2)

I was wondering if anybody knew the reason for this discrepancy in the numbers of significant genes (with adj.P-value < 0.05) for 'coef1' from both of these codes.

Thank you,

Akul

0
Entering edit mode

Just for completeness, I assume you renamed the columns of design according to the levels of Treat in your code.

0
Entering edit mode

I did yes, I just didn't paste that part of the code as I wanted to keep it simple. Thank you for checking.

-Akul

6
Entering edit mode
@james-w-macdonald-5106
Last seen 20 hours ago
United States

The answer by svlachavas is not correct in several ways.

First, the reason you get more significant genes when you include the healthy samples is because you have increased your degrees of freedom substantially by adding in 23 more samples. In other words, when you compute a contrast, the numerator is the difference in means between the two groups, and the denominator is a measure of the intra-group variability. The measure of intra-group variability is in essence an average of the intra-group variability over all the groups in your model. When you add a group with 23 samples, you have added another group that you can use to estimate intra-group variability. Sample variance estimators are not very efficient, which means that you need relatively more observations for the estimate to converge to the population parameter that you are trying to estimate, so adding in 23 more samples is a big deal.

If you increase your degrees of freedom substantially, then the variance estimate will be smaller, which will reduce the denominator for all of your contrasts. If you reduce the denominator while keeping the numerator constant, your statistic gets bigger, so comparisons that were not significant with just the two groups are now significant with three because you have made the denominator smaller by adding in the extra group. This has nothing to do with healthy being a 'baseline'. It is only due to the increase in observations.

Second, if you fit the model that svlachavas recommends:

design <- model,matrix(~Treat)

and then fit the contrast D1 - D2, you are NOT comparing anything to the 'healthy baseline'. This is simple algebra. The contrast you are fitting is

(D1 - H) - (D2 - H)

Which is easy to see will reduce to

D1 - D2

And the healthy samples play no part in the comparison.

0
Entering edit mode

Dear Mr James W.MacDonald,

thank you for your valuable corrections and detail explanation, as it helps all new users to improve and understand better any subject. But, please correct me if im misunderstood, but in one of my previous posts, Mr Aaron Lun (C: Correct construction of design matrix in limma for multiple contrasts for gene e) i believe(and please correct if i have understand differently) in a similar design proposed:

"I got it. Use the initial design matrix. So, one last but most important question:"

as i use the design matrix with the intercept, the specified contrast (i.e.)

con <- makeContrasts(samplesT47D_C10_OHT - samplesT47D_C10, levels=design)

interogate the difference between the dual treatment (C10 + OHT) versus one substance(C10) for identify DE between the two "kinds" of therapy, but in respect to the baseline level, that is the control group right ?

ADD REPLY • C: Correct construction of design matrix in limma for multiple contrasts for gene e • edit • moderatewritten 21 days ago by svlachavas • 50

1

Yes, your contrast will identify DE genes between the two therapies, while your "baseline" (i.e., intercept) in your design is the DMSO control group.

2
Entering edit mode

If you re-read what Aaron says, I don't think he is agreeing with your statement, (the 'Yes' in his answer notwithstanding). He is saying that the contrast you have specified will give the difference between the two therapies, and he is also acknowledging that the baseline in your design is the DMSO group. But he isn't saying that the difference is in 'relation' to the baseline, because it isn't. The baseline term is cancelled out algebraically, as I showed above.

1
Entering edit mode

James is correct here. I read svlachavas' post as two separate questions - one about the DE contrast, and another about the identity of the baseline level. My apologies for any confusion.

0
Entering edit mode

Also my apologies for any misreading. I also have noticed the important fact that Mr James McDonald mentioned, but i didnt gave the attention needed. Anyway, as you both explained, thats why i wrote my answer above "to sum up"-so to get a final confirmation as far as i continued with your suggestion of design:

samples <- factor(targets2$Sample.Group, levels=c("T47D_DMSO", "T47D_OHT", "T47D_GDC", "T47D_C10", "T47D_GDC_OHT", "T47D_C10_OHT")) batch <- factor(targets2$replicate)

design <- model.matrix(~samples + batch)

and after design if i want to find DEG genes between the double therapy and any therapies, i proceed also

con <- makeContrasts("samplesT47D_C10_OHT-samplesT47D_C10", "samplesT47D_C10_OHT-samplesT47D_OHT", levels=design)

fit2 <- contrasts.fit(fit, contrasts=con)
ebayes <- eBayes(fit2, trend=TRUE)

without mentioning anything else in  makeContrasts, and this is correct for this specific purpose right ?

1
Entering edit mode

We're drifting off-topic here, but to conclude this discussion:

topTable(fit2, coef=1, n=Inf) # gives you DE between C10 + OHT and C10 alone
topTable(fit2, coef=2, n=Inf) # gives you DE between C10 + OHT and OHT alone
topTable(fit2, n=Inf) # gives you DE between C10 + OHT and either C10 or OHT

As you can see, the final definition of DE depends on your call to topTable. Of course, if you want to compare the GDC single or double treatments, then you'll have to change the contrast matrix.

0
Entering edit mode

Thank you very much Aaron i havent seen your answer !! Regarding the GDC treatment, i made also another contrast matrix after the same design matrix, and create it in the following way

con <- makeContrasts("samplesT47D_GDC_OHT-samplesT47D_GDC", "samplesT47D_GDC_OHT-samplesT47D_OHT", levels=design)

Please excuse me if i got off topic and disturbed you and the other people in this post with my many questions, but was to confirm and be sure that i didnt misread or understand anything different from my last post !!

0
Entering edit mode

I undertand completely. There is a long way to go ahead. Please excuse me for my English, but i meant the exact thing as you described but i didnt posted appropriately. I was misunderstood of my design matrix, as my first comparison in my post was each Treatment vs DMSO which i put as the first level, so i didnt have to explicitly specify each contrast. So far here im correct ?

Moreover, for the contrasts fit, in other words you mention that due to algebra, either with model,matrix(~samples..)

and then makeContrasts("samplesT47D_C10_OHT-samplesT47D_C10")

either with model.matrix(~0 + samples)

and then makeContrasts("(samplesT47D_C10_OHT-DMSO) -(samplesT47D_C10-DMSO)") will have the same output right ?

The only difference, and is related to my study, is if i want to compute each Treatment vsDMSO, when i use ~samples i dont have to use makeContrasts, and use directly after ebayes each coefficient to represent my difference in groups right ?

Where in the other hand, if i use model.matrix(~0 +samples)

then i have to specify my exact differences for each treatment vs DMSO, as i dont specify my intercept  ?

2
Entering edit mode

That's all correct, but do note that if you have no baseline (model.matrix(~ 0 + samples)), then these two are identical:

makeContrasts((samplesT47D_C10_OHT-DMSO) -(samplesT47D_C10-DMSO), levels = design)

makeContrasts(samplesT47D_C10_OHT-samplesT47D_C10, levels = design)

in the same way that

(X - Y) - (Z - Y) is identical to X - Z because (X - Y) - (Z - Y) = X - Y - Z + Y, so the two Ys cancel out.

0
Entering edit mode

I see. Thus, to sum up, as i have constructed in my post a model,matrix with intercept term, which is the DMSO group design <- model.matrix(~samples + batch), as you have probably see in the post, then to compare for instance my one of my double treatments with any of the simple treatments, i dont have at all to write this specific contrast (X - Y) - (Z - Y) and move directly to the comparison wright? that is

"samplesT47D_C10_OHT - samplesT47D_C10" ??

Thank you for your consideration on this matter !!

0
Entering edit mode

Dear James,

Thank you very much for your detailed response. It makes a lot of sense and explains why I'm seeing more significance. This was exactly what I was looking for.

Thanks,

Akul

1
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 5 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Akulsinghania,

in my opinion the most possible reason for the discrepancy in the number of significant genes, is the fact that in your first design matrix you didnt included the Healthy samples, so now your factor has one more level that it has to be taken into account. Also ,regarding your code you have to "refit" in way with contrasts.fit each disease group with the control group, in order for each of your coefficients to represent the log-fold change of your two disease groups versous your control group. In other words,

when you used cm<- makeContrasts(coef=D1-D2, levels=design) in the first code, you dont compute the difference between two two disease groups while your baseline group are the healthy samples. Thus, you could use:

Treat <- factor("your factor assignment", levels=c("H","D1","D2 )) # so the baseline level is the Healthy samples-group

design <- model,matrix(~Treat)

fit <- lmFit(eset, design)
fit2 <- eBayes(fit, trend=TRUE)
# trend argument optional if you have performed any initial non-intensity filtering prior testing

where you can see with head(fit2\$coefficients) that the two coefficients represent each of your two disease groups vs Healthy group

Furthermore, if you intend to compare the two disease groups, in respect to the Healthy samples(while the "baseline level" is "H"),

you can move after the same design matrix with:

con <- makeContrasts("D1-D2", levels=design) # or D2-D1, as the coefficient output of design matrix
fit2 <- contrasts.fit(fit, contrasts=con)
ebayes <- eBayes(fit2, trend=TRUE)

The only think that i dont know for sure is if the difference in your number of samples(unequal number), hampers the analysis

0
Entering edit mode

Dear svlachavas,

Thank you very much for your quick and very helpful reply. I've tried your suggestions (without the 'trend' attribute for now), and I get the same exact 3 gene lists for coef1, coef2 and coef3 that I got when I used my code:

> design <- model.matrix(~0 + Treat)
> fit <- lmFit(eset,design)
> cm <- makeContrasts(coef1 = D1 - D2, coef2 = D1 - H, coef3= D2 - H, levels=design)
> fit2 <- contrasts.fit(fit, cm)
> fit2 <- eBayes(fit2)

I do understand and realise that the sample numbers are not balanced at all, however, I'm still struggling with the fact that when the linear model hasn't seen the healthy group at all in the input matrix, I get about 30 genes different between D1 and D2. However, when I change my input matrix and introduce healthy samples, I get about 200 genes different between the same D1 and D2.

Also, I was hoping if you could expand a bit on the 'trend' attribute. I'm not sure what it does. Before calculating for differential expression, my data has already been log transformed and normalised, and I've also already performed non-specific filtering to remove redundant and unimportant genes.

Akul

0
Entering edit mode

Dear Akul,

regarding the trend argument, in simple words it accomodates  a mean-variance trend, and it is applicable especially if you have performed non-specific intensity filtering prior statistical analysis. You can check also the thread (http://permalink.gmane.org/gmane.science.biology.informatics.conductor/48552) which describes the matter in more detail.

Regarding the interprentation of limma, you could have in mind that as you use the ~ operator to form such models, an expression of the form y~model is translated as a specification that the response variable y is modelled by a linear predictor specified symbolically by a model(y~model). More generally, linear models are used to make comparisons between different groups. Thus, when you included a extra level in your factor Treat which includes the healthy samples, i believe that is the main reason to change the outcome. However, as im also a "freshman" in R, any more experience person could provide us with more appropriate info or suggestions. Anyway, i believe that the inbalance of your samples between your groups of comparisons affects a lot your study. Finally, i would not stick with the ~0 intercept, and move along with the code i posted above-not that there is a huge difference, but if you dont explicitly specify the specific contrasts in makeContrasts between your Disease versus the Control Group, there is no need to use it