Hi,
(I've asked a lot of questions about limma over the past week so I'm hoping this'll be the last for a while!)
Given the following details:
expression_data <- c(1.27135202935009, 1.41816160331787, 1.2572772420417, 1.70943398046296, 1.30290218641586, 0.632660015122616, 1.73084258791384, 0.863826352944684, 0.62481665344628, 0.356064235030147, 1.31542028558644, 0.30549909383238, 0.464963176430548, 0.132181421105667, -0.284799809563931, 0.216198538884642, -0.0841133304341238, -0.00184472290008803, -0.0924271878885008, -0.340291804468472, -0.236829711453303, 0.0529690806587626, 0.16321956624511, -0.310513510587778, -0.12970035111176, -0.126398635780533, 0.152550803185228, -0.458542514769473, 0.00243517688116406, -0.0190192219685527, 0.199329876859774, 0.0493831375210439, -0.30903829000185, -0.289604319193543, -0.110019942085281, -0.220289950537685, 0.0680403723818882, -0.210977291862137, 0.253649629045288, 0.0740109953273042, 0.115109148186167, 0.187043445057404, 0.705155251555554, 0.105479342752451, 0.344672919872447, 0.303316487542805, 0.332595721664644, 0.0512213943473417, 0.440756755046719, 0.091642538588249, 0.477236022595909, 0.109140019847968, 0.685001267317616, 0.183154080053337, 0.314190891668279, -0.123285017407119, 0.603094973500324, 1.53723917249845, 0.180518835745199, 1.5520102749957, -0.339656677699664, 0.888791974821514, 0.321402618155527, 1.31133008668306, 0.287587853884556, -0.513896569786498, 1.01400498573403, -0.145552182640197, -0.0466811491949621, 1.34418631328095, -0.188666887863983, 0.920227741574566, -0.0182196762358299, 1.18398082848213, 0.0680539755381465, 0.389472802053599, 1.14920099633956, 1.35363045061024, -0.0400907708395635, 1.14405154287124, 0.365672853509181, -0.0742688460368051, 1.60927415300638, -0.0312210890874907, -0.302097025523754, 0.214897201115632, 2.029775196118, 1.46210810601113, -0.126836819148653, -0.0799005522761045, 0.958505775644153, -0.209758749029421, 0.273568395649965, 0.488150388217536, -0.230312627718208, -0.0115780974342431, 0.351708198671371, 0.11803520077305, -0.201488605868396, 0.0814169684941098, 1.32266103732873, 1.9077004570343, 1.34748531668521, 1.37847539147601, 1.85761827653095, 1.11327229058024, 1.21377936983249, 1.167867701785, 1.3119314966728, 1.01502530573911, 1.22109375841952, 1.23026951795161, 1.30638557237133, 1.02569437924906, 0.812852833149196) treatment <- c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'B', 'A', 'C', 'A', 'C', 'A', 'B', 'C', 'B', 'C', 'C', 'A', 'C', 'A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'C', 'A', 'C', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'C', 'B', 'B', 'C', 'C', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A') variation <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
And the model:
design <- model.matrix(~0 + factor(treatment, levels=unique(treatment) + factor(variation)) colnames(design) <- c(unique(treatment), paste0("b", unique(variation)[-1]))
#expression_data consists of more than the data given. The data given is just one row from the object fit <- lmFit((expression_data), design) cont_mat <- makeContrasts(B-A, levels=design) fit2 <- contrasts.fit(fit, contrasts=cont_mat) fit2 <- eBayes(fit2)
If I was to do B-A and pull it up in topTable, I'd see a log fold change of -0.8709646. If I was to do the same again, but don't include the "variation" in my model design, I see a fold change of -0.8587972 (Which is the Mean(B)-Mean(A)). This is an adjustment in the log fold change of 0.01216742. How would I go about manually working out the log fold change of the model above?
Thanks for any advice given! I've been trying to figure this out for a while!
I guess it was more for my understanding than anything else, I like to be able to solve things manually where I can. Is there any way to set it up as a linear model and have it spit out the equations? - It was more for this particular example above that I was interested.
Well, the easiest way to do so is this:
This will solve the over-determined system of equations, giving a least-squares estimate for each coefficient. It is then a simple matter of subtracting the coefficient for treatment A from that of treatment B, to obtain the B-over-A log-fold change.
That's perfect. Thank you for that, and I really appreciate the explanations. The only bit that I'm struggling to understand is in a broad sense, how the model accounts for my "variation" variable. I understand the principle, in that it controls for that variable so that I can compare between "B" and "A", but I'm struggling to understand what it does mathematically. - I realise that it's most likely very complex and involves solving a lot of simultaneous equations, but is there a broad explanation? Thanks again!
Consider your design matrix. The log-average expression of each group in each batch is represented by
where
beta_A
orbeta_B
represents the coefficient for each group, andbeta_batch
represents the batch effect. Any differences in log-expression between A1 and A2, or between B1 and B2, are absorbed into the value ofbeta_batch
. This ensures that the batch effect doesn't affect the DE comparison between groups.---
Edit: okay, here's a simple practical example to help you understand. Let's say we have two groups (A, B) in two batches (X, Y). Consider a gene with the following (log-)expression pattern across samples:
If we were to ignore the batch effect, then the log-fold change between A and B would be:
This is not correct, because the B samples have more samples in the second batch than the A samples, which means that the effect of the batch is not balanced between groups. As a result, the log-fold change between groups will be confounded by the batch. We'd rather avoid this, if possible.
If we add a batch effect in our model, the corresponding coefficient is equal to 2 (as this is the difference in the mean expression values between X and Y for samples of the same group). The coefficient absorbs the batch effect of Y over X, which is equivalent to subtracting 2 from all expression values in the second batch. So, we get a log-fold change of:
This is the correct state of affairs.
That's exactly what I was looking for. Thanks for the explanation!
one other quick question... You subtract two from all expression values in batch 2, is there a way to get those adjustment constants from an
lm
call?The correction for each batch will be the value of the corresponding coefficient, obtained with:
The
factor(variation)2
coefficient refers to the correction for samples in batch 2, whilefactor(variation)3
refers to that for batch 3. Batch 1 is the baseline, so no correction is required. Note that if you just want to correct for the batch effect, theremoveBatchEffect
function may be more convenient.Ah, I understand now. Thanks for your patience in explaining this!