edgeR - testing interaction in a scenario of 2 tissues and 2 ages
1
0
Entering edit mode
kobipe3 • 0
@kobipe3-11007
Last seen 7.7 years ago

Hi,

I am comparing gene expression measured in two different tissues (tissue1, tissue2) in two different ages (age1, age2).

I am trying to obtain three lists:

1. Genes differentially expressed between tissue1 and tissue2.

2. Genes differentially expressed between age1 and age2.

3. Genes for which the expression ratio between tissue1 and tissue2 is changing significantly with age progression (from age1 to age2).

 

I tried two different design matrices.

Option 1 - Defining each combination as a group, and defining contrasts

samples$Tissue <- factor(samples$Tissue, levels = c("tissue1", "tissue2"), ordered = F)
samples$Age <- factor(samples$Age, levels = c("age1", "age2"), ordered = F)
samples$Group <- factor(interaction(samples$Age, samples$Tissue))
design <- model.matrix(~0+Group, data = samples)
colnames(design) <- levels(samples$Group)
d <- DGEList(counts = counts)
d <- estimateGLMCommonDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)
contrasts <- makeContrasts((age2.tissue2 + age1.tissue2) - (age2.tissue1 + age1.tissue1), (age2.tissue2 + age2.tissue1) - (age1.tissue2 + age1.tissue1),
(age2.tissue2 - age2.tissue1) - (age1.tissue2 - age1.tissue1), levels = colnames(design))
colnames(contrasts) <- c("tissue", "age", "tissue.age")

Option 2 - Defining an interaction model

design <- model.matrix(~Tissue + Age + Tissue:Age, data = samples)
d <- DGEList(counts = counts)
d <- estimateGLMCommonDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)

Then, I use glmLRT to test the contrasts (option 1) or coefficients (option 2).

When I use option 1, I get that the numbers of differentially expressed genes testing the contrasts "tissue", "age", "tissue.age" are (q-value < 0.05) are: 6541, 10196, 1956.

Using option 2, testing the coefficients 2,3,4, the numbers of differentially expressed genes are: 4490, 7635, 1956.

That is, I get higher numbers of genes differentially expressed between the different ages and between the different tissues when using option 1. However, in practice, I think that the results of option 1 are better, in the sense that I get better enrichments down the road.

What is the different assumptions that the two option of analysis make? Can you recommend me which one to follow?

edgeR R differential gene expression differential expression • 863 views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 12 hours ago
The city by the bay

Ah, the old problem of interpreting main effects in an interaction model. Let's have a look at a simple example:

Tissue <- factor(rep(1:2, each=2))
Age <- factor(rep(1:2, 2))
design <- model.matrix(~Tissue + Age + Tissue:Age)

Now, let's look at design:

  (Intercept) Tissue2 Age2 Tissue2:Age2
1           1       0    0            0
2           1       0    1            0
3           1       1    0            0
4           1       1    1            1

The column name of the second coefficient might suggest that it represents the tissue effect, such that dropping it would test for the difference between tissue 1 and 2. However, what actually happens is that dropping the second coefficient will only test for DE between samples 1 and 3 (i.e., between tissue 1 and 2 for age 1). In the null model, the expression of sample 2 will still be fitted perfectly due to the freedom of the third coefficient, while the expression of sample 4 will be fitted perfectly due to the fourth coefficient - thus, they will not contribute to the GLM deviance or the p-value. Similarly, dropping the third coefficient would only test for DE between samples 1 and 2 (i.e., between age 1 and 2 for tissue 1), even though the column name might suggest that it represents the age effect.

This is why you're getting differences in the number of DE genes, because you're fundamentally performing different DE contrasts. With the interaction model, the main effects can't be interpreted in the presence of a significantly non-zero interaction term, because that allows sample 4 to always be fitted perfectly (which then allows samples 2 or 3 to be fitted perfectly as well, depending on which coefficient is dropped). More intuitively, if there's a strong interaction effect, you might have a scenario where you get upregulation due to age in tissue 1 and downregulation due to age in tissue 2. How could you sensibly define a "main effect of age" in this situation?

The conventional approach to this problem is to test interaction terms first, and if they're not significant, drop them and proceed onto the tests for the main effects. This is a bit tricky to do for gene expression studies, especially if some of the genes have significant interaction terms and others do not, which makes it difficult to model the variance by sharing information across genes. The simpler approach is to just compare the averages within each group, as you've done in option 1, to gauge whether there is an average increase or decrease due to time or age. Note that if you want accurate fold changes for your tissue or age contrasts, you need to divide by two:

(age2.tissue2 + age1.tissue2)/2 - (age2.tissue1 + age1.tissue1)/2
ADD COMMENT

Login before adding your answer.

Traffic: 627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6