Question

Limma Fold Change Manual Calculation

0

Entering edit mode

andrew.j.skelton73 ▴ 370

@andrewjskelton73-7074

Last seen 9 weeks ago

United Kingdom

Hi,

(I've asked a lot of questions about limma over the past week so I'm hoping this'll be the last for a while!)

Given the following details:

expression_data <- c(1.27135202935009, 1.41816160331787, 1.2572772420417, 1.70943398046296, 1.30290218641586, 0.632660015122616, 1.73084258791384, 0.863826352944684, 0.62481665344628, 0.356064235030147, 1.31542028558644, 0.30549909383238, 0.464963176430548, 0.132181421105667, -0.284799809563931, 0.216198538884642, -0.0841133304341238, -0.00184472290008803, -0.0924271878885008, -0.340291804468472, -0.236829711453303, 0.0529690806587626, 0.16321956624511, -0.310513510587778, -0.12970035111176, -0.126398635780533, 0.152550803185228, -0.458542514769473, 0.00243517688116406, -0.0190192219685527, 0.199329876859774, 0.0493831375210439, -0.30903829000185, -0.289604319193543, -0.110019942085281, -0.220289950537685, 0.0680403723818882, -0.210977291862137, 0.253649629045288, 0.0740109953273042, 0.115109148186167, 0.187043445057404, 0.705155251555554, 0.105479342752451, 0.344672919872447, 0.303316487542805, 0.332595721664644, 0.0512213943473417, 0.440756755046719, 0.091642538588249, 0.477236022595909, 0.109140019847968, 0.685001267317616, 0.183154080053337, 0.314190891668279, -0.123285017407119, 0.603094973500324, 1.53723917249845, 0.180518835745199, 1.5520102749957, -0.339656677699664, 0.888791974821514, 0.321402618155527, 1.31133008668306, 0.287587853884556, -0.513896569786498, 1.01400498573403, -0.145552182640197, -0.0466811491949621, 1.34418631328095, -0.188666887863983, 0.920227741574566, -0.0182196762358299, 1.18398082848213, 0.0680539755381465, 0.389472802053599, 1.14920099633956, 1.35363045061024, -0.0400907708395635, 1.14405154287124, 0.365672853509181, -0.0742688460368051, 1.60927415300638, -0.0312210890874907, -0.302097025523754, 0.214897201115632, 2.029775196118, 1.46210810601113, -0.126836819148653, -0.0799005522761045, 0.958505775644153, -0.209758749029421, 0.273568395649965, 0.488150388217536, -0.230312627718208, -0.0115780974342431, 0.351708198671371, 0.11803520077305, -0.201488605868396, 0.0814169684941098, 1.32266103732873, 1.9077004570343, 1.34748531668521, 1.37847539147601, 1.85761827653095, 1.11327229058024, 1.21377936983249, 1.167867701785, 1.3119314966728, 1.01502530573911, 1.22109375841952, 1.23026951795161, 1.30638557237133, 1.02569437924906, 0.812852833149196) 

treatment <- c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'B', 'A', 'C', 'A', 'C', 'A', 'B', 'C', 'B', 'C', 'C', 'A', 'C', 'A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'C', 'A', 'C', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'C', 'B', 'B', 'C', 'C', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A')

variation <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)

And the model:

design               <- model.matrix(~0 + factor(treatment,
                                                 levels=unique(treatment) +
                                          factor(variation))
colnames(design)     <- c(unique(treatment),
                          paste0("b",
                                 unique(variation)[-1]))

#expression_data consists of more than the data given. The data given is just one row from the object
fit                  <- lmFit((expression_data), design)

cont_mat             <- makeContrasts(B-A,
                                      levels=design)
fit2                 <- contrasts.fit(fit,
                                      contrasts=cont_mat)
fit2                 <- eBayes(fit2)

If I was to do B-A and pull it up in topTable, I'd see a log fold change of -0.8709646. If I was to do the same again, but don't include the "variation" in my model design, I see a fold change of -0.8587972 (Which is the Mean(B)-Mean(A)). This is an adjustment in the log fold change of 0.01216742. How would I go about manually working out the log fold change of the model above?

Thanks for any advice given! I've been trying to figure this out for a while!

Limma • 1.8k views

ADD COMMENT • link updated 8.9 years ago by Aaron Lun ★ 28k • written 8.9 years ago by andrew.j.skelton73 ▴ 370

score 3 · Accepted Answer · 2015-07-04

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

I wouldn't worry about it. The unbalanced number of samples for each group in each batch means that the estimate of the log-fold change between A and B depends on the estimated size of the batch effect. In turn, the batch effect estimate depends on the log-fold change, as well as the expression of samples in group C. This results in a very large set of simultaneous equations that requires a lot of effort to solve. Well, maybe not that much effort, if you set it up as a linear model; but if you're going to do it like that, it's going to be the same as running lmFit directly, which defeats the purpose of trying to compute it manually.

ADD COMMENT • link 8.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

I guess it was more for my understanding than anything else, I like to be able to solve things manually where I can. Is there any way to set it up as a linear model and have it spit out the equations? - It was more for this particular example above that I was interested.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

1

Entering edit mode

Well, the easiest way to do so is this:

qr.solve(design, expression_data)

This will solve the over-determined system of equations, giving a least-squares estimate for each coefficient. It is then a simple matter of subtracting the coefficient for treatment A from that of treatment B, to obtain the B-over-A log-fold change.

ADD REPLY • link 8.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

That's perfect. Thank you for that, and I really appreciate the explanations. The only bit that I'm struggling to understand is in a broad sense, how the model accounts for my "variation" variable. I understand the principle, in that it controls for that variable so that I can compare between "B" and "A", but I'm struggling to understand what it does mathematically. - I realise that it's most likely very complex and involves solving a lot of simultaneous equations, but is there a broad explanation? Thanks again!

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

1

Entering edit mode

Consider your design matrix. The log-average expression of each group in each batch is represented by

u_A1 = beta_A # group A in batch 1
u_B1 = beta_B # group B in batch 1
u_A2 = beta_A + beta_batch # group A in batch 2
u_B2 = beta_B + beta_batch # group B in batch 2

where beta_A or beta_B represents the coefficient for each group, and beta_batch represents the batch effect. Any differences in log-expression between A1 and A2, or between B1 and B2, are absorbed into the value of beta_batch. This ensures that the batch effect doesn't affect the DE comparison between groups.

---

Edit: okay, here's a simple practical example to help you understand. Let's say we have two groups (A, B) in two batches (X, Y). Consider a gene with the following (log-)expression pattern across samples:

Sample     Group     Batch     Expression
     1         A         X              1
     2         A         X              1
     3         B         X              2
     4         A         Y              3
     5         B         Y              4
     6         B         Y              4

If we were to ignore the batch effect, then the log-fold change between A and B would be:

(1 + 1 + 3)/3 - (2 + 4 + 4)/3 = -5/3

This is not correct, because the B samples have more samples in the second batch than the A samples, which means that the effect of the batch is not balanced between groups. As a result, the log-fold change between groups will be confounded by the batch. We'd rather avoid this, if possible.

If we add a batch effect in our model, the corresponding coefficient is equal to 2 (as this is the difference in the mean expression values between X and Y for samples of the same group). The coefficient absorbs the batch effect of Y over X, which is equivalent to subtracting 2 from all expression values in the second batch. So, we get a log-fold change of:

(1 + 1 + 1)/3  - (2 + 2 + 2)/3 = -1

This is the correct state of affairs.

ADD REPLY • link 8.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

That's exactly what I was looking for. Thanks for the explanation!

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

one other quick question... You subtract two from all expression values in batch 2, is there a way to get those adjustment constants from an lm call?

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

1

Entering edit mode

The correction for each batch will be the value of the corresponding coefficient, obtained with:

fit <- lm(expression_data ~ 0 + factor(treatment) + factor(variation))
coef(fit)

The factor(variation)2 coefficient refers to the correction for samples in batch 2, while factor(variation)3 refers to that for batch 3. Batch 1 is the baseline, so no correction is required. Note that if you just want to correct for the batch effect, the removeBatchEffect function may be more convenient.