Question: limma contrast matrix
0
4.9 years ago by
Smith H.10
United Kingdom
Smith H.10 wrote:

Dear all,

I am analysing an Affymetrix microarray dataset using Affy and limma but have encountered a problem with my code.

The experiment has two genotypes (RAB and N50) and two treatments (C and D), with three reps of each combination. I would like to know 1) which genes are differentially expressed due to treatment, 2) which genes are differentially expressed due to genotype and 3) which genes are differentially expressed due to a genotype*treatment interaction.

Firstly, I have normalised and corrected for background etc using gcrma and written a mydata file which is a matrix of expression data for all genes for each of the 12 samples.

I have also made a targets file, which specifies the file name meanings as below:

FileName      Genotype         Treatment
N50C2              N50                 C
N50C4              N50                 C
N50C7              N50                 C
N50D1              N50                 D
N50D4              N50                 D
N50D8              N50                 D
RABC3              RAB                 C
RABC8              RAB                 C
RABC12             RAB                 C
RABD2              RAB                 D
RABD9              RAB                 D
RABD10             RAB                 D

> GT <- paste(targets$Genotype, targets$Treatment, sep=".")
> design <- model.matrix(~Genotype*Treatment)
> Genotype <- factor(targets$Genotype, levels=c("N50", "RAB")) > Treatment <- factor(targets$Treatment, levels=c("C", "D"))
> design <- model.matrix(~Genotype*Treatment)
> fit <- lmFit(eset, design)
> colnames(design)
[1] "(Intercept)"          "Genotype1"            "Treatment1"
[4] "Genotype1:Treatment1"

My question is how to extract the contrasts which I am interested in at this point? From reading the manual, I understand I need to make a contrast matrix but I am unsure how to do this. Once I have this line of code, I think the next steps are:

> fit2 <- contrasts.fit(fit, cont.matrix)
> fit2 <- eBayes(fit2)
> topTable (fit2, coef=1) # and so on for other coefficients

Is this correct? Any help you could give me would be very much appreciated. If I need to give any more information then please do let me know.

Many thanks,

Hazel

modified 4.9 years ago • written 4.9 years ago by Smith H.10
1
4.9 years ago by
United States
James W. MacDonald51k wrote:

You don't need a contrasts matrix for this experiment, as your coefficients are implicit comparisons as it is. Let's do a design matrix in such a way that it is more obvious.

> gt <- gl(2,6,labels = c("N50","RAB"))
> gt
[1] N50 N50 N50 N50 N50 N50 RAB RAB RAB RAB RAB RAB
Levels: N50 RAB
> trt <- gl(2,3,12,c("C","D"))
> trt
[1] C C C D D D C C C D D D
Levels: C D
> model.matrix(~gt*trt)
(Intercept) gtRAB trtD gtRAB:trtD
1            1     0    0          0
2            1     0    0          0
3            1     0    0          0
4            1     0    1          0
5            1     0    1          0
6            1     0    1          0
7            1     1    0          0
8            1     1    0          0
9            1     1    0          0
10           1     1    1          1
11           1     1    1          1
12           1     1    1          1

The gtRAB coefficient is the difference between RAB and N50 (the genotype main effect), trtD is the difference between D and C (the treatment main effect), and gtRAB:trtD is the interaction.

But this is probably not what you want to do, as these main effects are probably not what you are after.

> z <- factor(paste(gt, trt, sep = "_"))
> design <- model.matrix(~0+z)
> colnames(design) <- gsub("z", "", colnames(design))
> design
N50_C N50_D RAB_C RAB_D
1      1     0     0     0
2      1     0     0     0
3      1     0     0     0
4      0     1     0     0
5      0     1     0     0
6      0     1     0     0
7      0     0     1     0
8      0     0     1     0
9      0     0     1     0
10     0     0     0     1
11     0     0     0     1
12     0     0     0     1

In which case each coefficient is estimating the mean of each genotype_treatment combination. So you can now test for differences in treatment for the N50 genotype (N50_D - N50_C), differences in treatment for the RAB genotype (RAB_D - RAB_C) and the interaction (N50_D - N50_C - RAB_D + RAB_C).

Following up on James' answer, the explicit command to make the contrast matrix (using the above notation) would be:

cont.matrix <- makeContrasts(
N50_D - N50_C,
RAB_D - RAB_C,
N50_D - N50_C - (RAB_D - RAB_C),
levels=design)

This would then be followed by:

fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
topTable(fit2, coef=1) # For differences in N50
topTable(fit2, coef=2) # For differences in RAB
topTable(fit2, coef=3) # For the interaction

The specification of coef is important here, as it refers to the comparison in the corresponding column of the contrast matrix.

---

Edit: You can also accommodate comparisons between genotypes, e.g., by defining RAB_D - N50_D or RAB_C - N50_C in your contrast matrix. If you want to compare between genotypes across both treatments, you have two options. One is to use the two pairwise comparisons that I just mentioned, and then intersect the resulting DE lists. The other is to define a contrast like (RAB_D + RAB_C) - (N50_D + N50_C), i.e., test for equality in the average of each genotype. The second approach is more powerful but tends to be harder to interpret, as there needn't be a consistent change in the pairwise comparisons for rejection to occur. This logic also applies for comparisons between treatments across the two genotypes.

Hi Aaron and sorry for my intrusion! I would like to get more insight into the last advice you gave in your comment, i.e. the definition of a contrast like (RAB_D + RAB_C) - (N50_D + N50_C).

Why you said it's more powerful but harder to interpret? Is this difficult related to the fact that there could be an interaction between the genotype and the treatment? I don't know if this could be a good example, but let's suppose one have this expression values for a gene:

    RAB_D     RAB_D     RAB_D     RAB_C     RAB_C     RAB_C     N50_D     N50_D     N50_D     N50_C     N50_C     N50_C
15.696070 15.855590 15.379840  9.460456  9.270855  9.660012  9.210566  9.584845  9.606120 15.767570 15.999980 15.697920

So this gene has means 15.64 for RAB_D, 9.46 for RAB_C, 9.47 for N50_D and 15.82 for N50_C. I would think this gene can be considered differentially expressed for both the pairwise comparisons of treatment inside the genotypes (with opposite signs of the FC), and also there should be a significant interaction term between the two factors. (Am I wrong?)

If this is true, when one use the contrast (RAB_D + RAB_C)/2 - (N50_D + N50_C)/2 the gene would probably not be called DE due to the compensation of the two effects, but actually I think it should be considered DE.

So I think that in general if there isn't a significant interaction between the factors one can expect that using the above contrast and take the intersection of the two pairwise comparisons should give more or less the same results. If, instead, there is a significant interaction of the factors for one gene, only if the two pairwise comparisons call that gene as DE, one should see if the signs of the two FC (of the pairwise comparisons) are opposite or not. If they are, one should insert that gene into the list of DE.

Do you think this could be true?

Marco

The contrast is more powerful because it can detect genes where, e.g., RAB_D is much greater than N50_D, but RAB_C is only slightly greater than N50_C. In such cases, a pairwise comparison of RAB_C - N50_C will fail to pick up this gene as being DE, so it wouldn't show up in the intersection. The average-based contrast will detect this gene as RAB_D + RAB_C will still be greater than N50_D + N50_C, due to the greater size of RAB_D compared to N50_D.

The reason that it's difficult to interpret is that this approach will pick up a lot of possibly irrelevant genes. For example, if RAB_D is much greater than N50_D, but RAB_C is slightly lower than RAB_D, then the average-based contrast will still detect this gene. This may not be desirable given that there is no consistent change in both RAB genotypes over their N50 counterparts.

As to your specific example, you are correct in saying that this gene will not be called as DE by the average-based contrast. Lack of detection here is sensible to me, as there's no consistent change between treatments across the two genotypes. If you want it to be detected, then you should use the intersection approach on two pairwise comparisons instead. You may also want to filter on the sign of the log-fold change prior to intersection, depending on what you want to do.

Finally, these two approaches will indeed be qualitatively similar if the interaction term is not significant. However, in practice, the intersection operation will be much more conservative. This is because a gene must reject the null separately in each pairwise comparison, whereas the average-based contrast can combine information from both comparisons. Thus, I would generally expect to see fewer DE genes with the former approach.

0
4.9 years ago by
Smith H.10
United Kingdom
Smith H.10 wrote:

Thank you all for your help and advice on this (and sorry for the delay, I have not been well). I have now tried your code and am finding no DE genes for coef 1 with adjusted p values all above 0.99. For coef 2 I get DE genes but if I change the order of my contrast matrix and reverse coef 1 and 2, the first now has no DE genes. Is there something I am doing wrong here?

Many thanks again for all of your help,

Hazel

What do you mean by reversing the coefficients? Do you mean setting, e.g., N50_C - N50_D instead of N50_D - N50_C? Or do you mean swapping, e.g., the N50 contrast with the RAB contrast?

Hi Aaron,

Sorry for the confusion. I meant that if I ask for the first coefficient to be RAB_D - RAB_C, rather than N50_D - N50_C, I get no significant DE genes for whichever is coefficient 1.

I hope that clarifies it.

Thanks,

Hazel

That's a bit odd. Can you show me the commands you used to construct each contrast matrix, as well as the matrices themselves? I'd also like to see the design matrix.

Hi Aaron,

> eset <- gcrma(Data)

> write.exprs(eset, file="mydata.txt")

> gt <- gl(2,6,labels = c("N50","RAB"))

> z <- factor(paste(gt, trt, sep = "_"))

> design <- model.matrix(~0+z)
> colnames(design) <- gsub("z", "", colnames(design))
> design
N50_C N50_D RAB_C RAB_D
1      1     0     0     0
2      1     0     0     0
3      1     0     0     0
4      0     1     0     0
5      0     1     0     0
6      0     1     0     0
7      0     0     1     0
8      0     0     1     0
9      0     0     1     0
10     0     0     0     1
11     0     0     0     1
12     0     0     0     1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")\$z
[1] "contr.treatment"

> fit <- lmFit(eset, design)

> cont.matrix <- makeContrasts(
+     N50_D - N50_C,
+     RAB_D - RAB_C,
+     N50_D - N50_C - (RAB_D - RAB_C),
+     levels=design)

> fit2 <- contrasts.fit(fit, cont.matrix)
> fit2 <- eBayes(fit2)
> topTable(fit2, coef=1)

If I change the contrastmatrix to the following, I also get no DE genes for coef 1:

> cont.matrix <- makeContrasts(
+     RAB_D - RAB_C,

+     N50_D - N50_C,

+     N50_D - N50_C - (RAB_D - RAB_C),
+     levels=design)

Any help would be greatly appreciated.

Many thanks,

Hazel

What are the values of cont.matrix in each case? I assume you're re-running contrasts.fit with the second cont.matrix. To be clear, can you also run:

cont.matrix <- makeContrasts(
N50_D - N50_C,
RAB_D - RAB_C,
N50_D - N50_C - (RAB_D - RAB_C),
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
cont.matrix.B <- makeContrasts(
RAB_D - RAB_C,
N50_D - N50_C,
N50_D - N50_C - (RAB_D - RAB_C),
levels=design)
fit3 <- contrasts.fit(fit, cont.matrix.B)
fit3 <- eBayes(fit3)

and show us topTable(fit2, coef=1) and topTable(fit3, coef=1)? This gives me different results on my toy example.

My contrast matrices seem fine - they're the same as when I construct them using your code as below:

> cont.matrix
Contrasts
Levels  N50_D - N50_C RAB_D - RAB_C N50_D - N50_C - (RAB_D - RAB_C)
N50_C            -1             0                              -1
N50_D             1             0                               1
RAB_C             0            -1                               1
RAB_D             0             1                              -1

> cont.matrix.B
Contrasts
Levels  RAB_D - RAB_C N50_D - N50_C N50_D - N50_C - (RAB_D - RAB_C)
N50_C             0            -1                              -1
N50_D             0             1                               1
RAB_C            -1             0                               1
RAB_D             1             0                              -1

I am very surprised that I am getting no genes which are differentially expressed in N50_D - N50_C (a genotype comparison of control versus drought) with adjusted p values of >0.09 for one gene and 1.00 for all of the others. I had assumed this was an error with my code but perhaps it is biologically correct.

What do the top tables look like with each contrast matrix? Specifically, run:

topTable(fit2, coef=1) # N50
topTable(fit2, coef=2) # RAB
topTable(fit3, coef=1) # RAB
topTable(fit3, coef=2) # N50

Do the two N50 comparisons give the same set of results? Similarly, do the two RAB comparisons give the same results?

The topTables are the same in the two fits but I wonder if I have done something wrong earlier in the analysis that the adjusted p values are so high for N50,

> topTable(fit2, coef=1)
PtpAffx.140895.1.A1_at -0.17416286 2.286320 -11.848888 1.502722e-06 0.09228666
PtpAffx.6581.6.S1_a_at  0.32858875 9.465015   7.936903 3.417794e-05 1.00000000
PtpAffx.58692.1.A1_at  -0.16993640 2.278799  -6.992487 8.758426e-05 1.00000000
PtpAffx.215125.1.S1_at  0.04326888 2.413074   6.709572 1.182694e-04 1.00000000
PtpAffx.20668.1.S1_at   0.53806953 3.459342   6.550605 1.405775e-04 1.00000000
PtpAffx.102151.1.A1_at -0.14145705 3.756571  -6.350864 1.754086e-04 1.00000000
PtpAffx.208291.1.S1_at  0.48555994 2.538415   6.327287 1.801095e-04 1.00000000
PtpAffx.98933.1.S1_at  -0.89817974 3.595177  -5.791974 3.345567e-04 1.00000000
Ptp.1663.1.A1_s_at      0.59402564 8.322859   5.676405 3.842785e-04 1.00000000
PtpAffx.218179.1.S1_at -0.03392192 2.239123  -5.645168 3.990654e-04 1.00000000
B
PtpAffx.140895.1.A1_at 5.6096651
PtpAffx.6581.6.S1_a_at 2.8675151
PtpAffx.58692.1.A1_at  1.9741810
PtpAffx.215125.1.S1_at 1.6842578
PtpAffx.20668.1.S1_at  1.5165472
PtpAffx.102151.1.A1_at 1.3007798
PtpAffx.208291.1.S1_at 1.2749351
PtpAffx.98933.1.S1_at  0.6661223
Ptp.1663.1.A1_s_at     0.5290209
PtpAffx.218179.1.S1_at 0.4916112

> topTable(fit3, coef=2) # N50
PtpAffx.140895.1.A1_at -0.17416286 2.286320 -11.848888 1.502722e-06 0.09228666
PtpAffx.6581.6.S1_a_at  0.32858875 9.465015   7.936903 3.417794e-05 1.00000000
PtpAffx.58692.1.A1_at  -0.16993640 2.278799  -6.992487 8.758426e-05 1.00000000
PtpAffx.215125.1.S1_at  0.04326888 2.413074   6.709572 1.182694e-04 1.00000000
PtpAffx.20668.1.S1_at   0.53806953 3.459342   6.550605 1.405775e-04 1.00000000
PtpAffx.102151.1.A1_at -0.14145705 3.756571  -6.350864 1.754086e-04 1.00000000
PtpAffx.208291.1.S1_at  0.48555994 2.538415   6.327287 1.801095e-04 1.00000000
PtpAffx.98933.1.S1_at  -0.89817974 3.595177  -5.791974 3.345567e-04 1.00000000
Ptp.1663.1.A1_s_at      0.59402564 8.322859   5.676405 3.842785e-04 1.00000000
PtpAffx.218179.1.S1_at -0.03392192 2.239123  -5.645168 3.990654e-04 1.00000000
B
PtpAffx.140895.1.A1_at 5.6096651
PtpAffx.6581.6.S1_a_at 2.8675151
PtpAffx.58692.1.A1_at  1.9741810
PtpAffx.215125.1.S1_at 1.6842578
PtpAffx.20668.1.S1_at  1.5165472
PtpAffx.102151.1.A1_at 1.3007798
PtpAffx.208291.1.S1_at 1.2749351
PtpAffx.98933.1.S1_at  0.6661223
Ptp.1663.1.A1_s_at     0.5290209
PtpAffx.218179.1.S1_at 0.4916112

> topTable(fit2, coef=2)
PtpAffx.209290.1.S1_at  -0.8563588  2.444446 -54.56404 4.697955e-12 2.885155e-07
PtpAffx.203251.1.S1_at   0.2429489  2.287828  49.60141 1.048114e-11 3.218391e-07
PtpAffx.224388.1.S1_at  -4.2811127  3.858417 -30.90063 5.571314e-10 1.140504e-05
PtpAffx.148.1.S1_at     -1.9477112  2.752318 -23.29488 5.895214e-09 9.051070e-05
PtpAffx.223794.1.S1_at  -2.6400515  2.938420 -22.05193 9.300969e-09 1.142401e-04
Ptp.3132.1.S1_s_at       1.4337413 11.293561  18.77346 3.529678e-08 3.612802e-04
Ptp.2674.1.S1_at        -0.1455180  2.263546 -15.97011 1.335758e-07 1.171899e-03
Ptp.1054.1.A1_at        -0.5961837  2.470003 -15.56056 1.652398e-07 1.268484e-03
PtpAffx.91488.1.A1_s_at -1.6223911  7.548499 -15.07968 2.135760e-07 1.457371e-03
PtpAffx.207775.1.S1_at  -1.3694709  2.716284 -14.87850 2.383214e-07 1.463603e-03
B
PtpAffx.209290.1.S1_at  12.175812
PtpAffx.203251.1.S1_at  12.028901
PtpAffx.224388.1.S1_at  10.876208
PtpAffx.148.1.S1_at      9.763806
PtpAffx.223794.1.S1_at   9.507059
Ptp.3132.1.S1_s_at       8.678152
Ptp.2674.1.S1_at         7.739341
Ptp.1054.1.A1_at         7.579513
PtpAffx.91488.1.A1_s_at  7.383316
PtpAffx.207775.1.S1_at   7.298375

> topTable(fit3, coef=1)
PtpAffx.209290.1.S1_at  -0.8563588  2.444446 -54.56404 4.697955e-12 2.885155e-07
PtpAffx.203251.1.S1_at   0.2429489  2.287828  49.60141 1.048114e-11 3.218391e-07
PtpAffx.224388.1.S1_at  -4.2811127  3.858417 -30.90063 5.571314e-10 1.140504e-05
PtpAffx.148.1.S1_at     -1.9477112  2.752318 -23.29488 5.895214e-09 9.051070e-05
PtpAffx.223794.1.S1_at  -2.6400515  2.938420 -22.05193 9.300969e-09 1.142401e-04
Ptp.3132.1.S1_s_at       1.4337413 11.293561  18.77346 3.529678e-08 3.612802e-04
Ptp.2674.1.S1_at        -0.1455180  2.263546 -15.97011 1.335758e-07 1.171899e-03
Ptp.1054.1.A1_at        -0.5961837  2.470003 -15.56056 1.652398e-07 1.268484e-03
PtpAffx.91488.1.A1_s_at -1.6223911  7.548499 -15.07968 2.135760e-07 1.457371e-03
PtpAffx.207775.1.S1_at  -1.3694709  2.716284 -14.87850 2.383214e-07 1.463603e-03
B
PtpAffx.209290.1.S1_at  12.175812
PtpAffx.203251.1.S1_at  12.028901
PtpAffx.224388.1.S1_at  10.876208
PtpAffx.148.1.S1_at      9.763806
PtpAffx.223794.1.S1_at   9.507059
Ptp.3132.1.S1_s_at       8.678152
Ptp.2674.1.S1_at         7.739341
Ptp.1054.1.A1_at         7.579513
PtpAffx.91488.1.A1_s_at  7.383316
PtpAffx.207775.1.S1_at   7.298375

That looks okay to me. The N50 comparisons are the same between fit2 and fit3, as are the RAB comparisons. This means that there's nothing wrong when the order of contrasts is switched in makeContrasts.

As for the N50 comparison itself, the lack of DE genes may be the true result. From the look of it, the top set of fold changes are quite low. This suggests that there may not be strong changes, rather than detection being compromised by large dispersions. I'd generate a MDS plot of all libraries and see how the N50 libraries cluster.

Hi @Aaron Lun . I didnt want to post a new question because I am actually a bit confused with the reversing that you just mentioned above ( e.g., N50_C - N50_D instead of N50_D - N50_C ), because by doing so this can swap the number of detected DE. I am wondering if what is generally accepted when I for example want to compare Normal versus cancer tissue? e.g, should it be normal - cancer or cancer - normal. I am asking this because I need to know which genes are up/ down regulated in cancer tissue compared to normal .

Thank a lot.

1

Doesn't matter. If you have Cancer - Normal, the genes with positive log-fold changes are upregulated in cancer. If you have Normal - Cancer, the genes with negative log-fold changes are upregulated in cancer. The choice of order here will only affect the sign of the log-fold changes; the p-values should not change.