Question: Building contrasts for combined treatment groups to compare to a control
0
8 months ago by
mart11390
Purdue University
mart11390 wrote:

Hello, my name is Alex and I am currently a graduate student at Purdue University. I am using edgeR to conduct differential gene expression for a project wherein I'd like to compare 3 treatment groups [control (c)treatment-low (tl), and treatment-high (th)] across 3 population groups (lmlc, and ct) and I have a question regarding forming contrasts. I want to make two general types of comparisons:

1. compare all treated individuals (i.e., combined th + tl) to all control individuals across populations
2. compare treated individuals (again, combined th+tl) to control individuals within each population

Using the edgeR User Guide section 3.3 as a guide, I combined population+treatment into a single factor and created the following design matrix:

design <- model.matrix(~0+group)

ct_c ct_th ct_tl lc_c lc_th lc_tl lm_c lm_th lm_tl
ct_c_1     1     0     0    0     0     0    0     0     0
ct_c_2     1     0     0    0     0     0    0     0     0
ct_c_3     1     0     0    0     0     0    0     0     0
ct_th_1    0     1     0    0     0     0    0     0     0
ct_th_2    0     1     0    0     0     0    0     0     0
ct_th_3    0     1     0    0     0     0    0     0     0
ct_tl_1    0     0     1    0     0     0    0     0     0
ct_tl_2    0     0     1    0     0     0    0     0     0
ct_tl_3    0     0     1    0     0     0    0     0     0
lc_c_1     0     0     0    1     0     0    0     0     0
lc_c_2     0     0     0    1     0     0    0     0     0
lc_c_3     0     0     0    1     0     0    0     0     0
lc_th_1    0     0     0    0     1     0    0     0     0
lc_th_2    0     0     0    0     1     0    0     0     0
lc_th_3    0     0     0    0     1     0    0     0     0
lc_tl_1    0     0     0    0     0     1    0     0     0
lc_tl_2    0     0     0    0     0     1    0     0     0
lc_tl_3    0     0     0    0     0     1    0     0     0
lm_c_1     0     0     0    0     0     0    1     0     0
lm_c_2     0     0     0    0     0     0    1     0     0
lm_c_3     0     0     0    0     0     0    1     0     0
lm_c_4     0     0     0    0     0     0    1     0     0
lm_c_5     0     0     0    0     0     0    1     0     0
lm_th_1    0     0     0    0     0     0    0     1     0
lm_th_2    0     0     0    0     0     0    0     1     0
lm_tl_1    0     0     0    0     0     0    0     0     1
lm_tl_2    0     0     0    0     0     0    0     0     1
lm_tl_3    0     0     0    0     0     0    0     0     1
lm_tl_4    0     0     0    0     0     0    0     0     1
lm_tl_5    0     0     0    0     0     0    0     0     1

With the above design matrix I created the following contrasts to make the comparisons outlined above:

1. allpops_contrast <- c(-1/2,1/4,1/4,-1/2,1/4,1/4,-1/2,1/4,1/4)

2. my.contrasts <- makeContrasts(lm_cvt=(lm_th+lm_tl)/2-lm_c, lc_cvt=(lc_th+lc_tl)/2-lc_c, ct_cvt=(ct_th+ct_tl)/2-ct_c, levels=design)

I then used edgeR's GLM functionality to carry out the comparisons. My question is whether these contrasts are comparing the groups I intend for them to compare based on my objectives listed above. The results for the all population comparison yield a large number of DEGs (> 4,000), whereas each individual population comparison yields a much more modest number (<80 in any population comparison). Am I missing something when it comes to forming contrasts?

Thanks in advance for any insight!

Alex

modified 8 months ago by Aaron Lun22k • written 8 months ago by mart11390
Answer: Building contrasts for combined treatment groups to compare to a control
3
8 months ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

For your first hypothesis; your allpops_contrast doesn't look right to me. It should look something like this:

allpops_contrast <- c(-1/3,1/6,1/6,-1/3,1/6,1/6,-1/3,1/6,1/6)

... because you want to know whether the grand average of the three control groups is different from the grand average of the six treatment groups. I would recommend being explicit about it and writing it out:

con <- makeContrasts(
(lm_th + lm_tl + lc_th + lc_tl + ct_th + ct_tl)/6
- (lm_c + lc_c + ct_c)/3, levels=design)

This will ensure that your code is robust to reordering of the columns in design.

For your second set of hypotheses; my.contrasts looks fine.

While the aforementioned fix to allpops_contrast should now yield correct log-fold changes, it shouldn't change the number of DE genes, unless you're using TREAT to test against a log-fold change threshold. Assuming you ran glmQLFTest correctly, the discrepancy in the number of DE genes is most likely caused by the fact that you have a lot more power when you're using information across 30 libraries (in the first contrast) compared to 9-12 libraries (in each of the population-specific contrasts).

To convince yourself, you can do a little thought experiment based on the standard errors of the log-fold changes (we'll use a limma-like linear model for convenience). The sample mean of each condition has a variance of v/n where v is the sample variance and n is the number of replicates. This means that the variance of the log-fold change in the first contrast would be:

(v/3 + v/3 + v/5)/3^2 + (v/3 + v/3 + v/2 + v/3 + v/3 + v/5)/6^2 = v * 0.152

... remembering that any scaling of a random variable has a squared effect on the variance (during calculation of the grand mean of each treated/control group). By comparison, the variance of the log-fold change in the ct-specific contrast is much larger:

(v/3 + v/3)/2^2 + v/3 = v * 0.5

... which results in reduced power to detect population-specific DE genes.

P.S. Keep in mind that when you supply a contrast matrix to the various GLM testing functions in edgeR, they will perform ANODEVs where the global null hypothesis (formed from the individual null hypotheses corresponding to each column) is tested. If you want to test the null hypothesis for each population, make sure you only supply the corresponding column to contrasts=.

Thanks Aaron, I see my mistake with the first set of contrasts now. Also, your explanation of the effects of increasing library size on standard error and power to detect DE genes was extremely clear and much appreciated. I was concerned that the all population had nearly 2 orders of magnitude more DEGs than the individual comparisons, but I now see how large differences in the standard error and variance can occur due to differences in 'n' between the all-population and single-population comparisons.

Just wanted to confirm the point you made in your P.S. statement -- If I want to test the null hypothesis within a given population given the contrasts below:

my.contrasts <- makeContrasts(lm_cvt=(lm_th+lm_tl)/2-lm_c, lc_cvt=(lc_th+lc_tl)/2-lc_c, ct_cvt=(ct_th+ct_tl)/2-ct_c, levels=design)

I could achieve this by calling the corresponding set of contrasts within the my.contrasts object (in this example, testing the null hypothesis in the lm population):

lrt_lm <- glmLRT(fit, contrast = my.contrasts[,"lm_cvt"])

Is my understanding correct or were you referring to something else? Thanks for your input.

Yes, that's correct.