Question

Analysis of 2*4 factorial design

0

Entering edit mode

Sudipta • 0

@eeaf45c1

Last seen 13 months ago

Ireland

Hi, first I want to thank you for creating limma package.

I have a 2(Factor A)*4(Factor B) experimental design.

I am trying to ask three questions,

Which genes are different between factor A (main affect, moderated t Statistic and log fold change),
Which genes are different due to factor B (main effect, moderated F Statistic),
Which genes are different between the levels of factor B given factor A (moderated t Statistic and log fold change), Example: (factorB_level2 - factorB_level1) under factorA_level1, etc.

A. Two solve first two questions (1 and 2), I used,

# Classic Interaction model with sum to zero parametrization

## setting contrasts
contrasts(factor_A) <- contr.sum(2)
contrasts(factor_B) <- contr.sum(4)

## creating design matrix
design_2 <- model.matrix(~factor_A*factor_B)

## model fitting
model2 <- lmFit(exprs_data, design_2)
model2_bayes <- eBayes(model2)

## extracting the main effect due to factor A and factor B
top2_A2_vs_A1 <- topTable(model2_bayes, coef = "factor_A1", n=Inf, adjust = "BH")
top2_B2_vs_B1 <- topTable(model2_bayes, coef = c("factor_B1", "factor_B2", "factor_B3"), n=Inf, adjust = "BH")

B. To solve the 3rd question with the above design, it became a bit complex, so I tried the below code (I believe the corresponding coeff of A2_vs_A1, B2_vs_B1_givenA2, etc., can be considered as an estimate of log FC)

# Analysis as a single factor

## creating the single factor
group <- paste(factor_A, factor_B, sep = "_")

## creating the design matrix
design <- model.matrix(~0+group)
colnames(design) <- gsub("group","", colnames(design))

## fitting the model
model <- lmFit(exprs_data, design)

## making the contrast
contrasts_group <- makeContrasts(
  A2_vs_A1 = (A2_B1+A2_B2+A2_B3+A2_B4)/8 - (A1_B1+A1_B2+A1_B3+A1_B4)/8,
  # B = ((A2_B2-A2_B1)+(A2_B3-A2_B1)+(A2_B4-A2_B1)+(A2_B3-A2_B2)+(A2_B4-A2_B3)+(A2_B4-A2_B2)) + ((A1_B2-A1_B1) + (A1_B3-A1_B1)+(A1_B4-A1_B1)+(A1_B3-A1_B2)+(A1_B4-A1_B3)+(A1_B4-A1_B2))/6,
  B2_vs_B1_givenA2 = (A2_B2-A2_B1),
  B3_vs_B1_givenA2 = (A2_B3-A2_B1),
  B4_vs_B1_givenA2 = (A2_B4-A2_B1),
  B3_vs_B2_givenA2 = (A2_B3-A2_B2),
  B4_vs_B3_givenA2 = (A2_B4-A2_B3),
  B4_vs_B2_givenA2 = (A2_B4-A2_B2),
  B2_vs_B1_givenA1 = (A1_B2-A1_B1),
  B3_vs_B1_givenA1 = (A1_B3-A1_B1),
  B4_vs_B1_givenA1 = (A1_B4-A1_B1),
  B3_vs_B2_givenA1 = (A1_B3-A1_B2),
  B4_vs_B3_givenA1 = (A1_B4-A1_B3),
  B4_vs_B2_givenA1 = (A1_B4-A1_B2),
  levels = colnames(design)
)

## fitting the model
model_contr <-  contrasts.fit(model, contrasts_group)
model_contr_bayes <- eBayes(model_contr)


## The main effect of factor A
top_A2_vs_A1 <- topTable(model_contr_bayes, coef = "A2_vs_A1", n=Inf, adjust = "BH")

## ~ factorB | factorA 
top_B2_vs_B1_givenA2 <- topTable(model_contr_bayes, coef = "B2_vs_B1_givenA2", n=Inf, adjust = "BH")
### same for others.

Here is how I created the data,

# fake data generation
set.seed(123) 

# Define parameters
num_genes <- 1000 # Number of genes
num_replicates <- 5 # Number of replicates per group

# Treatments/factors
factor_A <- factor(rep(c("A1", "A2"), each = num_replicates * 4))
factor_B <- factor(rep(c("B1", "B2", "B3", "B4"), times = num_replicates * 2))

# dummy data frame to hold the expression data
exprs_data <- matrix(nrow = num_genes, ncol = num_replicates * 8)

# Filling the matrix with random expression values
for (i in 1:(num_replicates * 8)) {
  exprs_data[, i] <- rnorm(num_genes, mean = sample(5:15, 1), sd = 2)
}

# Converting the matrix to data frame and naming the columns appropriately
colnames(exprs_data) <- paste(factor_A, factor_B, rep(1:num_replicates, times = 8), sep = "_")
exprs_data <- as.data.frame(exprs_data)

# Adding row names to represent gene IDs
rownames(exprs_data) <- paste0("Gene", 1:num_genes)

I am very new to limma trying to migrate from anova and posthoc based concept to linear modeling. I am wondering if I am doing it correct, also, for the second approach I am not sure, how to get the main effect of factor B. My real data is little more complex than this, it doesn't have values for one of the combination.

Thank you in advance.

limma • 1.2k views

ADD COMMENT • link updated 14 months ago by Gordon Smyth 53k • written 15 months ago by Sudipta • 0

score 2 · Accepted Answer · 2024-10-25

2

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

This is covered by Section 9.5 of the limma User's Guide.

The analysis with a single-factor is much better. Your definition of A2 vs A1 main (average) effect would be correct if you divided by 4 instead of by 8.

ADD COMMENT • link 15 months ago Gordon Smyth 53k

0

Entering edit mode

Hi Prof. Gordon, Thank you for your answer.

Is dividing with 4 will give the correct logFC values too? Because I see in case of dividing with 8, I am getting the same logFC as the "Classic Interaction model with sum to zero parametrization" analysis, not sure about the reasoning.
Is there anyway I can extract the main effect of B from the analysis with a single factor?

I was trying to do,

# Analysis as a single factor
top_B <- topTable(model_contr_bayes, coef = unlist(colnames(contrasts_group)[-1]), n=Inf, adjust = "BH") # where, unlist(colnames(contrasts_group)[-1]) is c(B2_vs_B1_givenA2,B3_vs_B1_givenA2,B4_vs_B1_givenA2,B3_vs_B2_givenA2,B4_vs_B3_givenA2,B4_vs_B2_givenA2,B2_vs_B1_givenA1,B3_vs_B1_givenA1,B4_vs_B1_givenA1,B3_vs_B2_givenA1,B4_vs_B3_givenA1,B4_vs_B2_givenA1)

but that doesn't match with the result of

# Classic Interaction model with sum to zero parametrization
top2_B <- topTable(model2_bayes, coef = c("factor_B1", "factor_B2", "factor_B3"), n=Inf, adjust = "BH")

I am not sure, how to do it correctly.

Thank you again for correcting me.

ADD REPLY • link 15 months ago Sudipta • 0

2

Entering edit mode

I have already answered your question, in the limma User's Guide and in my answer above.

I suggest that you have a rethink about your approach to analysing genomic or proteomics data, because you don't seem to be thinking about the biological meaning of the logFCs you are computing but are instead trying to compute an abstract mathematical idea of "main effect" without knowing exactly what that is. I have long argued on this forum that factorial models are not a helpful way to analyse genomic data. The idea of a "main effect" is a mathematical abstraction that is convenient in mathematical theorems but which doesn't have any biological meaning. It is much more biologically meaningful to estimate simple effects or conditional effects. Main effects do not represent "the effect of A regardless of B". That is a false interpretation that is repeated on many sites in the internet. You should not report main effects in a biological publication, a reviewer would likely object. Even purely mathematical texts will advise you that you should not be examining a main effect unless the interaction effect is absent. In your code, dividing by 4 at least gives you the average effect of A averaged equally over the levels of B. Dividing by 8 gives you something that has no meaning at all. It is not a logFC for anything.

For advice about main effects vs conditional effects, see:

ADD REPLY • link 15 months ago • updated 14 months ago Gordon Smyth 53k