Dear all,
I am analysing an Affymetrix microarray dataset using Affy and limma but have encountered a problem with my code.
The experiment has two genotypes (RAB and N50) and two treatments (C and D), with three reps of each combination. I would like to know 1) which genes are differentially expressed due to treatment, 2) which genes are differentially expressed due to genotype and 3) which genes are differentially expressed due to a genotype*treatment interaction.
Firstly, I have normalised and corrected for background etc using gcrma and written a mydata file which is a matrix of expression data for all genes for each of the 12 samples.
I have also made a targets file, which specifies the file name meanings as below:
FileName Genotype Treatment N50C2 N50 C N50C4 N50 C N50C7 N50 C N50D1 N50 D N50D4 N50 D N50D8 N50 D RABC3 RAB C RABC8 RAB C RABC12 RAB C RABD2 RAB D RABD9 RAB D RABD10 RAB D > targets <- readTargets("targets.txt") > GT <- paste(targets$Genotype, targets$Treatment, sep=".") > design <- model.matrix(~Genotype*Treatment) > Genotype <- factor(targets$Genotype, levels=c("N50", "RAB")) > Treatment <- factor(targets$Treatment, levels=c("C", "D")) > design <- model.matrix(~Genotype*Treatment) > fit <- lmFit(eset, design) > colnames(design) [1] "(Intercept)" "Genotype1" "Treatment1" [4] "Genotype1:Treatment1"
My question is how to extract the contrasts which I am interested in at this point? From reading the manual, I understand I need to make a contrast matrix but I am unsure how to do this. Once I have this line of code, I think the next steps are:
> fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > topTable (fit2, coef=1) # and so on for other coefficients
Is this correct? Any help you could give me would be very much appreciated. If I need to give any more information then please do let me know.
Many thanks,
Hazel
Following up on James' answer, the explicit command to make the contrast matrix (using the above notation) would be:
This would then be followed by:
The specification of
coef
is important here, as it refers to the comparison in the corresponding column of the contrast matrix.---
Edit: You can also accommodate comparisons between genotypes, e.g., by defining
RAB_D - N50_D
orRAB_C - N50_C
in your contrast matrix. If you want to compare between genotypes across both treatments, you have two options. One is to use the two pairwise comparisons that I just mentioned, and then intersect the resulting DE lists. The other is to define a contrast like(RAB_D + RAB_C) - (N50_D + N50_C)
, i.e., test for equality in the average of each genotype. The second approach is more powerful but tends to be harder to interpret, as there needn't be a consistent change in the pairwise comparisons for rejection to occur. This logic also applies for comparisons between treatments across the two genotypes.Hi Aaron and sorry for my intrusion! I would like to get more insight into the last advice you gave in your comment, i.e. the definition of a contrast like
(RAB_D + RAB_C) - (N50_D + N50_C)
.Why you said it's more powerful but harder to interpret? Is this difficult related to the fact that there could be an interaction between the genotype and the treatment? I don't know if this could be a good example, but let's suppose one have this expression values for a gene:
So this gene has means 15.64 for RAB_D, 9.46 for RAB_C, 9.47 for N50_D and 15.82 for N50_C. I would think this gene can be considered differentially expressed for both the pairwise comparisons of treatment inside the genotypes (with opposite signs of the FC), and also there should be a significant interaction term between the two factors. (Am I wrong?)
If this is true, when one use the contrast (RAB_D + RAB_C)/2 - (N50_D + N50_C)/2 the gene would probably not be called DE due to the compensation of the two effects, but actually I think it should be considered DE.
So I think that in general if there isn't a significant interaction between the factors one can expect that using the above contrast and take the intersection of the two pairwise comparisons should give more or less the same results. If, instead, there is a significant interaction of the factors for one gene, only if the two pairwise comparisons call that gene as DE, one should see if the signs of the two FC (of the pairwise comparisons) are opposite or not. If they are, one should insert that gene into the list of DE.
Do you think this could be true?
Marco
The contrast is more powerful because it can detect genes where, e.g.,
RAB_D
is much greater thanN50_D
, butRAB_C
is only slightly greater thanN50_C
. In such cases, a pairwise comparison ofRAB_C - N50_C
will fail to pick up this gene as being DE, so it wouldn't show up in the intersection. The average-based contrast will detect this gene asRAB_D + RAB_C
will still be greater thanN50_D + N50_C
, due to the greater size ofRAB_D
compared toN50_D
.The reason that it's difficult to interpret is that this approach will pick up a lot of possibly irrelevant genes. For example, if
RAB_D
is much greater thanN50_D
, butRAB_C
is slightly lower thanRAB_D
, then the average-based contrast will still detect this gene. This may not be desirable given that there is no consistent change in both RAB genotypes over their N50 counterparts.As to your specific example, you are correct in saying that this gene will not be called as DE by the average-based contrast. Lack of detection here is sensible to me, as there's no consistent change between treatments across the two genotypes. If you want it to be detected, then you should use the intersection approach on two pairwise comparisons instead. You may also want to filter on the sign of the log-fold change prior to intersection, depending on what you want to do.
Finally, these two approaches will indeed be qualitatively similar if the interaction term is not significant. However, in practice, the intersection operation will be much more conservative. This is because a gene must reject the null separately in each pairwise comparison, whereas the average-based contrast can combine information from both comparisons. Thus, I would generally expect to see fewer DE genes with the former approach.