RNA-Seq analysis with Limma/edgeR, time course experiment, aberrant results
0
Entering edit mode
astigliani31 • 10
@astigliani31-24235
Last seen 5 months ago

Hello everyone,

I am working on RNA-Seq data. I have 2 conditions (C = control, T = treatment), 4 time points (1, 5, 8 and 10) and 2 or 3 replicates (depending on the quality of the libraries) for each {condition, time point}

I am interested in the genes whose the expression patterns are different in the control and the treatment over time. It means that I want the genes whose the Fold change is different from 1 in at least one time point.

To this end, I build the model :

~ condi + time + time:condi


where condi is my condition (C or T) and time is the time point (1, 5, 8 and 10) so that my design matrix is :

Intercept  condiT  time5  time8  time10  condiT.time5  condiT.time8  condiT.time10
1          0       0      0      0       0             0             0
1          0       0      0      0       0             0             0
1          0       0      0      0       0             0             0
1          1       0      0      0       0             0             0
1          1       0      0      0       0             0             0
1          1       0      0      0       0             0             0
1          0       1      0      0       0             0             0
1          0       1      0      0       0             0             0
1          0       1      0      0       0             0             0
1          1       1      0      0       1             0             0
1          1       1      0      0       1             0             0
1          1       1      0      0       1             0             0
1          0       0      1      0       0             0             0
1          0       0      1      0       0             0             0
1          0       0      1      0       0             0             0
1          1       0      1      0       0             1             0
1          1       0      1      0       0             1             0
1          0       0      0      1       0             0             0
1          0       0      0      1       0             0             0
1          0       0      0      1       0             0             0
1          1       0      0      1       0             0             1
1          1       0      0      1       0             0             1
1          1       0      0      1       0             0             1


Note that at time point {1}, my control (C) and my treatments (T) are biologically identical (they are actual replicates). It means that the differences between C and T at a given time point (5,8 or 10) is modeled by the the interaction term condiT.time(5, 8, or 10), as the condiT term should be 0 for almost every gene. So, I don't need to build a contrast matrix.

For sanity checks, I look at the importance of the condiT term, It gives me only one gene for which this condiT term is significantly different from 0 (pval<0.05), and I am super happy with this. code : topTable(efit, coef=c(2), n=Inf, p.value=0.05) # the coeff '2' is the condiT term

If I check the importance of the condi.Time terms (which is what I am interested in), It gives me 16 genes. code : topTable(efit, coef=c(6:8), n=Inf, p.value=0.05) # the coeff '(6:8)' are the interaction terms

As this is a very few genes, I tried to see what was going on, so I added the condiT tem to the test :

code : topTable(efit, coef=c(2,6:8), n=Inf, p.value=0.05) # same line of code as the one above but with '(2,6:8)' instead of '(6:8)' And this gives me 1531 genes for which this four terms are important (pval<0.05) to model the gene expression...

I just don't get how 16 genes become 1531 genes by adding a term that is not significant taken independently.

Would someone be able to help me ? I am very confused and I have no idea what to trust. Thanks a lot in advance.

Here is my code, if you want to have a look:

design <- model.matrix(  ~  condi + time + time:condi , data=coldata)
colnames(design) <- str_replace_all(colnames(design),":",".") %>% str_replace_all(.,"[()]","") # remove parenthesis from Intercept and change ':' to '.'
idx <- filterByExpr(gse, design) # gse is my expression matrix
dds <- gse[idx,,keep.lib.size=FALSE]
dds <- calcNormFactors(dds)
v <- voom(dds, design)
fit <- lmFit(v, design)
efit <- eBayes(fit)


Best,

Arnaud Stigliani

University of Copenhagen

limma edger • 266 views
2
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

When you do

topTable(efit, 2, Inf, p.value = 0.05)



You are testing for any gene that is different between the two conditions.

when you do

topTable(efit, 6:8, Inf, p.value = 0.05)


You are now doing an F-test, where you are testing for any differences between those three coefficients, each of which is computing the interaction between T and C, between a given time and time zero. This makes some sense; you want to know if any genes are affected differently at one of the time points. I probably wouldn't do that, but instead would do the three t-tests for each interaction, but it's not my analysis.

If you then add in coefficient 2 to your F-test, you are testing for any gene that has a coefficient that is different between the three interaction terms and the treatment main effect. This doesn't make sense! Comparing four things, one of which is a completely different thing, isn't a reasonable thing to do. So yeah, you get more genes, but you should expect that because you are making a nonsensical comparison.

1
Entering edit mode

Dear James,

Can you elaborate when you say 'you are testing for any gene that has a coefficient that is different between the three interaction terms and the treatment main effect' ?

I thought that I was testing this model :

Intercept + condiT +  time5 + time8 +  time10 + condiT.time5 +  condiT.time8 +  condiT.time10


Against this one

Intercept + time5 + time8 +  time10


And I was surprised by the fact that this condiT influences the results as is seems to me that it is equal to 0 (Basically, I thought that it was useless to model the gene expression).

Best,

Arnaud

0
Entering edit mode

Could you elaborate on why the interpretation of c(2,6:8) is not just if the coefficient is different from zero in any of the 4 terms?

2
Entering edit mode

Well, what I said isn't 100% correct. An F-test in conventional ANOVA does just what I said - testing to see if the mean of any groups is different. Not testing if the mean of any group is equal to zero.

However, in this context none of the coefficients you are estimating is the mean of a group, but instead is a contrast of some sort. So what limma does is use a function called classifyTestsF, which is intended to compute an F-statistic using all the underlying t-statistics for each coefficient, with the goal of detecting any situations where one or more of the underlying t-statistics are significant. This is still pretty much the same thing as a conventional F-test, because if any of the t-statistics are significant, then by definition there are differences in the means of each group.

So why you get more significant genes using the F-test including the main effect is probably due to increased power due to the increased degrees of freedom? I don't know. In general I stay away from this sort of model formulation, because there's no profit in it. You really have 8 groups (four time points x two treatments), and in general I would just compute a cell means model by combining them

combo <- paste(condi, time, sep = "_")
design <- model.matrix(~ 0 + combo)
## make contrasts matrix here
fit <- lmFit(v, design)
fit2 <- eBayes(contrasts.fit(fit, contrasts))


You can make any contrast you care to make, without having to deal with a treatment-contrasts design.

0
Entering edit mode

Hi James. Thanks for the explanation it really helped although I guess the increased power is still a mystery. I'm sorry I'm not familiar enough with contrast fits but would that approach also result in using a F-test or in 4 individual comparisons? I like the former approach as the result is a single p-value instead of multiple.

3
Entering edit mode

Why do you want a single p-value? The goal is usually to find genes that are varying at a particular time point. I guess you could detect genes that vary at any of the time points, but then the next question is 'but which one?' at which point you have to do the individual comparisons anyway, so why not start there? Anyway, it's simple

> combo <- paste(rep(rep(c("N","T"), 4), c(3,3,3,3,3,2,3,3)), rep(c(1,5,8,10), c(6,6,5,6)), sep = "_")
> combo
[1] "N_1"  "N_1"  "N_1"  "T_1"  "T_1"  "T_1"  "N_5"  "N_5"  "N_5"  "T_5"
[11] "T_5"  "T_5"  "N_8"  "N_8"  "N_8"  "T_8"  "T_8"  "N_10" "N_10" "N_10"
[21] "T_10" "T_10" "T_10"
> design <- model.matrix(~0 + combo)
> colnames(design) <- gsub("combo","", colnames(design))
> contrast <- makeContrasts((T_5 - T_1) - (N_5 - N_1),
+ (T_8 - T_1) - (N_8 - N_1),
+ (T_10 - T_1) - (N_10 - N_1),
+ (T_1 + T_5 + T_8 + T_10)/4 - (N_1 + N_5 + N_8 + N_10)/4,
+ levels = design)



Now the first three coefficients are your interaction terms, and the last one is the main effect, testing if the average T is different from the average N. If you want the F-test, you can still do that. But you can also see that the F-test using all four coefficients is sort of weird, because three are doing similar tests and the last one isn't. I'm not smart enough to know if that is a problem or not, but it looks weird to me so I probably wouldn't do it to begin with.

0
Entering edit mode

Dear James,

I thank you very much for your explanations and your detailed answers. I think that you helped me a lot ! I will try to go for a contrast matrix, which seems more straightforward for my analysis.

Best, Arnaud

0
Entering edit mode

(T_1 + T_5 + T_8 + T_10)/4 - (N_1 + N_5 + N_8 + N_10)/4


Does it mean that I am comparing the mean of all the genes expressed in T vs N ? Or does it mean that I am testing at the same time 4 pair-wise comparisons ?

Thank you !

2
Entering edit mode

I think your definitions of the comparison are really the same thing, algebraically. The way I defined the contrast, it's the average of the T samples as compared to the N samples. But it's a simple agebraic manipulation to change to

(T_1 - N_1)/4 + (T_5 - N_5)/4 + (T_8 - N_8)/4 + (T_10 - N_10)/4


In which case it's the average of the within timepoint differences. But it's the same thing! We have just used words to describe it in a slightly different way, but algebraically they are identical.

0
Entering edit mode

Best, Arnaud