Dear all,
my apologies if this topic has some redundancy with my previous posts but I think this specific question deserves to be treated separately.
Previous posts:
https://www.biostars.org/p/321516/#321565
C: EdgeR analysis batch effect
I am trying to integrate a differential gene expression analysis into my automatized workflow.
For convenience reasons (automatic analyses), I do not specify my edgeR model with an intercept as recommended in the documentation.
Instead, I took inspiration from this sentence in chapter "3.3 Experiments with all combinations of multiple factors": "A simple, multi-purpose approach is to combine all the experimental factors into one combined factor".
This approach allows me to build a general purpose model that users can query with contrasts of interest.
To be honest, with my computer science background, I have a hard time understanding the 'formula'-based approach to specify an analysis model (though it allows for an efficient way to build complex matrices), and specially the use of intercept term.
In brief, I usually create a factor that summarizes all combinations of experimental conditions and generate a matrix using: model.matrix(~0+groupsFactor, data=myCountMatrix)
.
As an example, let's imagine we want to study bacterias in 2 different growing phases (G), under different temperature conditions (T).
Here is an example of such a "generic" design matrix :
G1_T1 G1_T2 G1_T3 G2_T1 G2_T2 G2_T3 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
Then, users can happily (?) provide a contrast matrix to query the model:
temp1 VS temp2: G1_T1 G1_T2 G1_T3 G2_T1 G2_T2 G2_T3 1 -1 0 1 -1 0
temp1 VS temp2 in G1 only: G1_T1 G1_T2 G1_T3 G2_T1 G2_T2 G2_T3 1 -1 0 0 0 0
temp1 VS temp2 in G2 only: G1_T1 G1_T2 G1_T3 G2_T1 G2_T2 G2_T3 0 0 0 1 -1 0
G2 VS G1: G1_T1 G1_T2 G1_T3 G2_T1 G2_T2 G2_T3 1 1 1 -1 -1 -1
Question 1: I managed to automatize this whole process but I would appreciate your opinion on how correct/wrong is this approach regarding the analysis ?
Now let's imagine that growing phase (G) condition drives most of the expression differences between my samples (it actually does).
Question 2: So far I was conducting analyses separately as shown above (temp1 VS temp2 in G1 only, then temp1 VS temp2 in G2 only), is this acceptable ?
However, if I understood correctly edgeR's documentation, it is recommended to block G for direct comparison of temperature conditions.
Question 3: In this case (and assuming answer to question 1 is ok), what would my model matrix needs to look like ?
I presume something like that but I might be completely wrong:
T1 T2 T3 G1 G2 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1
With such a matrix, I would expect that differences between G1 and G2 are fitted by the resulting model and that I can directly compare temperatures with appropriate contrasts.
Question 4: How does such approach compare to the first matrix in this post with the contrast "temp1 VS temp2": 1 -1 0 1 -1 0
When using the first matrix, I presumed that differences between temperatures in G1 and G2 would be taken in account by (larger) variance estimation.
Question 5 (last one): When I try to build the matrix (with blocking) using model.matrix(~0+temperature+growingPhase), data=myCountMatrix)
, I get:
T1 T2 T3 G1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0
How correct is this matrix as compared to the one I was expecting (with additional G2 column) ?
Sorry for the long post and thank ou very much for your help.
R.
Hello James,
thank you so much for this comprehensive answer, it helps a lot.
About your sentence "But your original design is easier to use and interpret" in answer 3. Do you suggest that both method can answer similar questions ? I agree that original design is more intuitive but I thought that it failed to model differences between G1 and G2 (hence the need for implicit blocking in the design matrix).
In other words, would the following strategies be mathematically equivalent ?
Orignal design + contrast "0.5 -0.5 0 0.5 -0.5 0"
Second design + contrast "1 -1 0 0" (?)
In other words (again), is there a way to implement blocking at the "contrast specification stage" with original design (which was the original idea) ?
Thank you.
No, you can't introduce blocking in a contrast. At that point you are simply computing differences between your coefficients, and if there wasn't blocking to begin with you can't add it in.
As to my statement about the cell means model being easier to interpret than the factor effects model (e.g., the first vs third design you specify), this is dependent on my answer #2, which was that there is no universally correct way to analyze data. There are obviously wrong ways to do so, but reasonable people can argue that one model is more 'reasonable' than another, for some definition of reasonable.
The original design and contrast are testing to see if the average expression at T1 for the G1 and G2 phase is different from the average at T2 for those phases. Is it reasonable to do that? Or are the phases so different that it makes no sense? If there is an interaction between T and G for some of the genes then that contrast makes no sense for those genes.
The second design and contrast can be interpreted a couple of ways. One could say that you are comparing T1 and T2 for the G2 phase (if we use the third design you provided), and incorporating information from the G1 phase data by subtracting out the differences between G2 and G1. Or you could say that you are comparing T1 and T2 by controlling for differences between the two phases (which is my preferred way of thinking about it - the fact that we are adjusting all the data to G2 equivalent levels is irrelevant; we could use a different parameterization that controls for the G2 vs G1 differences but still get the same results). But this is dependent on the idea that the G1 and G2 data are simply shifted up or down, and by subtracting out the mean of one group we can adjust to make things comparable. As with the previous model, if there is an interaction between T and G for some genes, then that model and contrast make no sense for those genes.
But either way, the two different contrasts are testing different questions, and which you use is dependent on the question you are attempting to answer.
Hello James,
thank you again for this excellent answer !
I appreciate your effort to explain details, and I'll definitely use this post as reference for future analyses.
The example exposed here has been over-simplified so we could focus on the technical details about blocking. As you guessed, G1 and G2 phases are obviously too different to be combined in a single analysis (most of metabolism is supposed to be different). So far, I was computing temperature-dependent analyses separately for G1 and G2 phases. Your answers convinced me to keep doing it separately.
My concerns with using blocking in the model came when I received a second batch of (partially overlapping) dataset, that we hoped to integrate with previous analysis (see https://support.bioconductor.org/p/112953/#112995).
In this context, I believe that blocking makes sense in order to fit and 'correct' batch effect, based on common conditions between batches.
Now I can proceed with integration of blocking mechanism in my script.
Before I do however, could you confirm that when using the following contrast (in the theoretical example used in this post), the (average) differences between G1 and G2 phases are modeled (/compensated) when comparing T1 vs T2 ?
In other words, is the 4th value (0) fine because differences between G1 and G2 are implicitly taken in account by the model ?
Thank you.