Limma Fold Change Manual Calculation
1
0
Entering edit mode
@andrewjskelton73-7074
Last seen 8 months ago
United Kingdom

Hi, 

(I've asked a lot of questions about limma over the past week so I'm hoping this'll be the last for a while!)

Given the following details:

expression_data <- c(1.27135202935009, 1.41816160331787, 1.2572772420417, 1.70943398046296, 1.30290218641586, 0.632660015122616, 1.73084258791384, 0.863826352944684, 0.62481665344628, 0.356064235030147, 1.31542028558644, 0.30549909383238, 0.464963176430548, 0.132181421105667, -0.284799809563931, 0.216198538884642, -0.0841133304341238, -0.00184472290008803, -0.0924271878885008, -0.340291804468472, -0.236829711453303, 0.0529690806587626, 0.16321956624511, -0.310513510587778, -0.12970035111176, -0.126398635780533, 0.152550803185228, -0.458542514769473, 0.00243517688116406, -0.0190192219685527, 0.199329876859774, 0.0493831375210439, -0.30903829000185, -0.289604319193543, -0.110019942085281, -0.220289950537685, 0.0680403723818882, -0.210977291862137, 0.253649629045288, 0.0740109953273042, 0.115109148186167, 0.187043445057404, 0.705155251555554, 0.105479342752451, 0.344672919872447, 0.303316487542805, 0.332595721664644, 0.0512213943473417, 0.440756755046719, 0.091642538588249, 0.477236022595909, 0.109140019847968, 0.685001267317616, 0.183154080053337, 0.314190891668279, -0.123285017407119, 0.603094973500324, 1.53723917249845, 0.180518835745199, 1.5520102749957, -0.339656677699664, 0.888791974821514, 0.321402618155527, 1.31133008668306, 0.287587853884556, -0.513896569786498, 1.01400498573403, -0.145552182640197, -0.0466811491949621, 1.34418631328095, -0.188666887863983, 0.920227741574566, -0.0182196762358299, 1.18398082848213, 0.0680539755381465, 0.389472802053599, 1.14920099633956, 1.35363045061024, -0.0400907708395635, 1.14405154287124, 0.365672853509181, -0.0742688460368051, 1.60927415300638, -0.0312210890874907, -0.302097025523754, 0.214897201115632, 2.029775196118, 1.46210810601113, -0.126836819148653, -0.0799005522761045, 0.958505775644153, -0.209758749029421, 0.273568395649965, 0.488150388217536, -0.230312627718208, -0.0115780974342431, 0.351708198671371, 0.11803520077305, -0.201488605868396, 0.0814169684941098, 1.32266103732873, 1.9077004570343, 1.34748531668521, 1.37847539147601, 1.85761827653095, 1.11327229058024, 1.21377936983249, 1.167867701785, 1.3119314966728, 1.01502530573911, 1.22109375841952, 1.23026951795161, 1.30638557237133, 1.02569437924906, 0.812852833149196

treatment <- c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'B', 'A', 'C', 'A', 'C', 'A', 'B', 'C', 'B', 'C', 'C', 'A', 'C', 'A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'C', 'A', 'C', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'C', 'B', 'B', 'C', 'C', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A')

variation <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)

And the model:

 

design               <- model.matrix(~0 + factor(treatment,
                                                 levels=unique(treatment) +
                                          factor(variation))
colnames(design)     <- c(unique(treatment),
                          paste0("b",
                                 unique(variation)[-1]))
#expression_data consists of more than the data given. The data given is just one row from the object
fit                  <- lmFit((expression_data), design)

cont_mat             <- makeContrasts(B-A,
                                      levels=design)
fit2                 <- contrasts.fit(fit,
                                      contrasts=cont_mat)
fit2                 <- eBayes(fit2)

If I was to do B-A and pull it up in topTable, I'd see a log fold change of -0.8709646. If I was to do the same again, but don't include the "variation" in my model design, I see a fold change of -0.8587972 (Which is the Mean(B)-Mean(A)). This is an adjustment in the log fold change of 0.01216742. How would I go about manually working out the log fold change of the model above?

Thanks for any advice given! I've been trying to figure this out for a while!

Limma • 2.0k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 18 hours ago
The city by the bay

I wouldn't worry about it. The unbalanced number of samples for each group in each batch means that the estimate of the log-fold change between A and B depends on the estimated size of the batch effect. In turn, the batch effect estimate depends on the log-fold change, as well as the expression of samples in group C. This results in a very large set of simultaneous equations that requires a lot of effort to solve. Well, maybe not that much effort, if you set it up as a linear model; but if you're going to do it like that, it's going to be the same as running lmFit directly, which defeats the purpose of trying to compute it manually.

ADD COMMENT
0
Entering edit mode

I guess it was more for my understanding than anything else, I like to be able to solve things manually where I can. Is there any way to set it up as a linear model and have it spit out the equations? - It was more for this particular example above that I was interested.

ADD REPLY
1
Entering edit mode

Well, the easiest way to do so is this:

qr.solve(design, expression_data)

This will solve the over-determined system of equations, giving a least-squares estimate for each coefficient. It is then a simple matter of subtracting the coefficient for treatment A from that of treatment B, to obtain the B-over-A log-fold change.

ADD REPLY
0
Entering edit mode

That's perfect. Thank you for that, and I really appreciate the explanations. The only bit that I'm struggling to understand is in a broad sense, how the model accounts for my "variation" variable. I understand the principle, in that it controls for that variable so that I can compare between "B" and "A", but I'm struggling to understand what it does mathematically. - I realise that it's most likely very complex and involves solving a lot of simultaneous equations, but is there a broad explanation? Thanks again!

ADD REPLY
1
Entering edit mode

Consider your design matrix. The log-average expression of each group in each batch is represented by

u_A1 = beta_A # group A in batch 1
u_B1 = beta_B # group B in batch 1
u_A2 = beta_A + beta_batch # group A in batch 2
u_B2 = beta_B + beta_batch # group B in batch 2

where beta_A or beta_B represents the coefficient for each group, and beta_batch represents the batch effect. Any differences in log-expression between A1 and A2, or between B1 and B2, are absorbed into the value of beta_batch. This ensures that the batch effect doesn't affect the DE comparison between groups.

---

Edit: okay, here's a simple practical example to help you understand. Let's say we have two groups (A, B) in two batches (X, Y). Consider a gene with the following (log-)expression pattern across samples:

Sample     Group     Batch     Expression
     1         A         X              1
     2         A         X              1
     3         B         X              2
     4         A         Y              3
     5         B         Y              4
     6         B         Y              4

If we were to ignore the batch effect, then the log-fold change between A and B would be:

(1 + 1 + 3)/3 - (2 + 4 + 4)/3 = -5/3

This is not correct, because the B samples have more samples in the second batch than the A samples, which means that the effect of the batch is not balanced between groups. As a result, the log-fold change between groups will be confounded by the batch. We'd rather avoid this, if possible.

If we add a batch effect in our model, the corresponding coefficient is equal to 2 (as this is the difference in the mean expression values between X and Y for samples of the same group). The coefficient absorbs the batch effect of Y over X, which is equivalent to subtracting 2 from all expression values in the second batch. So, we get a log-fold change of:

(1 + 1 + 1)/3  - (2 + 2 + 2)/3 = -1

This is the correct state of affairs.

ADD REPLY
0
Entering edit mode

That's exactly what I was looking for. Thanks for the explanation! 

ADD REPLY
0
Entering edit mode

one other quick question... You subtract two from all expression values in batch 2, is there a way to get those adjustment constants from an lm call? 

ADD REPLY
1
Entering edit mode

The correction for each batch will be the value of the corresponding coefficient, obtained with:

fit <- lm(expression_data ~ 0 + factor(treatment) + factor(variation))
coef(fit)

The factor(variation)2 coefficient refers to the correction for samples in batch 2, while factor(variation)3 refers to that for batch 3. Batch 1 is the baseline, so no correction is required. Note that if you just want to correct for the batch effect, the removeBatchEffect function may be more convenient.

ADD REPLY
0
Entering edit mode

Ah, I understand now. Thanks for your patience in explaining this! 

ADD REPLY

Login before adding your answer.

Traffic: 608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6