I realized that I get different results for one factor dependent on the order of levels within another factor. In principle, I have 2 factors with 2 levels: Age (young, old), and Type (Treated, Control)
I have performed a DESeq2 model with ~ Age + Type + Age*Type When I use 'relevel' to change the order of the levels of the Age factor in the design matrix, I get different results for the Type factor dependent on whether young or old is mentioned first.
So the question is, why does the order of levels in Age (young,old OR old,young) influence the results of the Type factor? For Age and the interaction, the results are equal (despite from the sign of log2FC values changing). I have also realized that this is only an issue when the interaction is included.
Summary code: Option 1, Age level 'old' mentioned first.
variables$Age = relevel(variables$Age,ref = "old")
levels(variables$Age) #"old" "young"
model_old_first = DESeqDataSetFromMatrix(countData = counts,
colData = variables,
design = ~ Age + Type + Age*Type)
model_old_first.type = as.data.frame(results(model_old_first, list("Type_Treated_vs_Control") ))
head(model_old_first.type) #Sample results for Type factor
baseMean log2FoldChange lfcSE stat pvalue padj
GeneA 580134.903 -2.63734863 0.1444825 -18.25376109 1.931426e-74 2.375654e-72
GeneB 12104.815 -0.02514629 0.1419333 -0.17716979 8.593750e-01 9.112339e-01
GeneC 4238.506 -0.02010040 0.5115374 -0.03929409 9.686559e-01 9.801558e-01
Option 2, Age level 'young' mentioned first.
variables$Age = relevel(variables$Age,ref = "young")
levels(variables$Age) #"young" "old"
model_young_first = DESeqDataSetFromMatrix(countData = counts,
colData = variables,
design = ~ Age + Type + Age*Type)
model_young_first.type = as.data.frame(results(model_young_first, list("Type_Treated_vs_Control") ))
head(model_young_first.type) #Sample results for Type factor
baseMean log2FoldChange lfcSE stat pvalue padj
GeneA 580134.903 -1.75705214 0.1444881 -12.1605335 5.042827e-34 5.597538e-32
GeneB 12104.815 0.06076783 0.1419378 0.4281300 6.685565e-01 8.718356e-01
GeneC 4238.506 0.30813657 0.5116091 0.6022891 5.469817e-01 8.414831e-01
I would really appreciate if someone can explain what is going on.
Many thanks for your quick reply. Are you referring to this part in the vignette? "by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype". So in my case, does this mean that the "Type" factor only shows the results for the reference level in "Age"? So its only the differences between treated and control in the young group (if young is the reference)? And the "Age" factor shows differences between young and old only for the reference level in "Type" (i.e. control level)? If I don't relevel, then 'old' would be automatically the reference factor for "Age".
Edit: If "Age" is modelled as a continuous variable, do the results of the Type factor reflect the overall effect across all ages? Or would the results only be shown for the oldest age? Thanks!
Yes and I’d recommend consulting with a statistician if you have questions about the interpretations of coefficient in a linear model